Saturday, January 25

Sunday, January 5

Pandas DataFrame cheat sheet and the Python v R debate

This has taken a lot longer than I thought it would. But I now have a rough, early draft of a cheat sheet on the Python pandas DataFrame object. When dealing with such a rich environment, it is a challenging decision on what to include and exclude in a four page set of notes. As always, comments and suggestions for improvement are always welcome. Doubtless there are many typographic errors that would be eradicated with another set of eyes and a ruthless proof reading of the draft text. 

I have also been thinking about the Python versus R debate. Which is the better tool for data analysis? Clearly, this is one of those questions to which there is no ultimate truth. Your answer will depend on what sort of analysis you do. My analytical interest is primarily data capture and visualization. While I do some simple statistical analyses, I do not use R in a deep statistical way. I do, however, depend heavily on Hadley Wickham's excellent graphics package ggplot2.

Before coming to the question, I should expose my relative experience with the two languages. I have been using R as my primary analytical language for a number of years. Over the same period I have used Python as well; albeit less frequently, and largely for data munging (loading data from various data sources into a MySQL server from which R can access it).

On the criterion of data munging, Python beats R hands down. Python is simply better able to manage dirty data in all its forms and is better for automating the cleaning and loading operations.

Another area where I think Python beats R is coding time. My impression is that I code at least twice as fast in Python. I spend much less time debugging code. I became a very defensive coder in R. Every function I wrote religiously tested the arguments to see that the values were of the right type and within the expected ranges. It was not unusual to start a function with 6 to 12 stopifnot() statements. Even so, I simply code faster in Python. There are a few reasons for this. While both languages are very expressive (compared with C, C++ or Java), I find Python the more expressive language. List comprehensions and generator expressions are powerful tools for tight code. While environments in R come close, they are no where as natural to use as dictionaries in Python. Second, Python's much stronger typing better protects me from my poor typing skills (no pun intended). Third, my learning curve with Python was much shorter than for R. But again, this may just be a product of my background (as someone coming from the C, C++, Objective-C and Java programming paradigms).

On graphics, I think Hadley Wickham's ggplot2 beats the competition from Python in the form of matplotlib. But work is afoot to replicate ggplot2 in the Python environment. When that work is well progressed I might just change ships.

Another area where R leads is idiomatic coherence. While the pandas DataFrame object is an immensely rich environment, it feels cobbled together and a little rough at the edges (it does not feel coherently designed from the ground up). Take the myriad of indexing options: [], .loc[], .iloc[], .ix[], .iat[], .at[], .xs[], which feel like the maze of twisty little passages, all alike, but each a little different. And then there are the confusing rules (for example, single indexes on a DataFrame return columns but single sliced indexes return rows). Furthermore, the Pythonic notion of container truthiness was not maintained by pandas (in the rest of Python, empty containers are False while non-empty containers are True, regardless of what they contain). I could go on. But, simply put, data.frames in R are more coherent with the rest of the R language compared with the DataFrame object and the rest of Python.

And another point of comparison in R's favour: the R help system is much more helpful than its counterpart from Python and pandas.

Finally there is something of the nanny state that annoys the hell out of me in both Hadley Wickham's ggplot2 and Wes McKinney's DataFrame. Hadley won't let you plot a chart with two y-axes. Wes, won't let you plot a time series as a bar chart. I can see the arguments for both protections. But really, is prohibition needed? Ironically, you can commit Hadley's unforgivable sin under Wes' DataFrame. And Hadley will happily let you plot a time series as a bar chart. It is time both Hadley and Wes embraced liberalism.

Wednesday, January 1

Reading ABS Excel files in pandas

Experimental code follows for reading excel files from the Australian Bureau of Statistics.

### ABSExcelLoader.py
### written in python 2.7 and pandas 0.12

import pandas as pd

class ABSExcelLoader:

    def load(self, pathName, freq='M', index_name=None, verbose=False):
        """return a dictionary of pandas DataFrame objects for
           each Data work sheet in an ABS Excel spreadsheet"""

        wb = pd.ExcelFile(pathName)
        returnDict = {}

        for name in wb.sheet_names:
            if not 'Data' in name:
                continue

            df = wb.parse(name, skiprows=8, header=9, index_col=0)
            periods = pd.PeriodIndex(pd.Series(df.index), freq=freq)
            df.set_index(keys=periods, inplace=True)
            df.index.name = index_name
            returnDict[name] = df

            if verbose:
                print ("\nFile: '{}', sheet: '{}'".format(pathName, name))
                print (df.iloc[:min(5, len(df)), :min(5, len(df.columns))])

        return returnDict