Tuesday, May 12

Plotting time-series DataFrames in pandas

Pandas provides a convenience method for plotting DataFrames: DataFrame.plot. There is also a quick guide here.

Unfortunately, when it comes to time series data, I don't always find the convenience method convenient. I often have a sparse DataFrame with lots of NaNs, which are not ignored by the convenience method. Additionally, I don't like the way that matplotlib places the lines hard against the left and right-hand sides of the canvas. I like a little bit of space at each end of the chart. Finally, I like playing with the tick marks and tick labels to get the right density of information on the x-axis.

Rather than use the inconvenient convenience method, I regularly find myself writing a short function to produce the plot layout I find a little more aesthetically pleasing. An example chart (from Mark the Ballot) and the associated python code follows.


import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter, YearLocator, MonthLocator
plt.style.use('ggplot')

def plot(df, filename, heading=None):

    fig, ax = plt.subplots(figsize=(8, 4))

    min_date = None
    max_date = None
    for col_name in df.columns.values:

        # plot the column
        col = df[col_name]
        col = col[col.notnull()] # drop NAs
        dates = [zzz.to_timestamp().date() for zzz in col.index]
        ax.plot_date(x=dates, y=col, fmt='-', label=col_name,
            tz=None, xdate=True, ydate=False, linewidth=1.5)

        # establish the date range for the data
        if min_date:
            min_date = min(min_date, min(dates))
        else:
            min_date = min(dates)
        if max_date:
            max_date = max(max_date, max(dates))
        else:
            max_date = max(dates)

    # give a bit of space at each end of the plot - aesthetics
    span = max_date - min_date
    extra = int(span.days * 0.03) * datetime.timedelta(days=1)
    ax.set_xlim([min_date - extra, max_date + extra])

    # format the x tick marks
    ax.xaxis.set_major_formatter(DateFormatter('%Y'))
    ax.xaxis.set_minor_formatter(DateFormatter('\n%b'))
    ax.xaxis.set_major_locator(YearLocator())
    ax.xaxis.set_minor_locator(MonthLocator(bymonthday=1, interval=2))

    # grid, legend and yLabel
    ax.grid(True)
    ax.legend(loc='best', prop={'size':'x-small'})
    ax.set_ylabel('Percent')

    # heading
    if heading:
        fig.suptitle(heading, fontsize=12)
    fig.tight_layout(pad=1.5)

    # footnote
    fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au', ha='right', 
        va='bottom', fontsize=8, color='#999999')

    # save to file
    fig.savefig(filename, dpi=125)

Sunday, May 3

Using python statsmodels for OLS linear regression

This is a short post about using the python statsmodels package for calculating and charting a linear regression.

Let's start with some dummy data, which we will enter using iPython. We fake up normally distributed data around y ~ x + 10.

In [1]: import numpy as np

In [2]: x = np.random.randn(100)

In [3]: y = x + np.random.randn(100) + 10

We can plot this simply ...

In [4]: import matplotlib.pyplot as plt

In [5]: fig, ax = plt.subplots(figsize=(8, 4))

In [6]: ax.scatter(x, y, alpha=0.5, color='orchid')
Out[6]: 

In [7]: fig.suptitle('Example Scatter Plot')
Out[7]: 

In [8]: fig.tight_layout(pad=2); 

In [9]: ax.grid(True)

In [10]: fig.savefig('filename1.png', dpi=125)


That was easy. Next we will add a regression line. We will use the statsmodels package to calculate the regression line. Lines 11 to 15 is where we model the regression. Lines 16 to 20 we calculate and plot the regression line.

The key trick is at line 12: we need to add the intercept term explicitly. Without with this step, the regression model would be: y ~ x, rather than y ~ x + c. Similarly, at line 17, we include an intercept term in the data we provide to the predicting method at line 18. The sm.add_constant() method prepends a column of ones for the constant term in the regression model, returning a two column numpy array. The first column is ones, the second column is our original data from above.

In [11]: import statsmodels.api as sm

In [12]: x = sm.add_constant(x) # constant intercept term

In [13]: # Model: y ~ x + c

In [14]: model = sm.OLS(y, x)

In [15]: fitted = model.fit()

In [16]: x_pred = np.linspace(x.min(), x.max(), 50)

In [17]: x_pred2 = sm.add_constant(x_pred)

In [18]: y_pred = fitted.predict(x_pred2)

In [19]: ax.plot(x_pred, y_pred, '-', color='darkorchid', linewidth=2)
Out[19]: []

In [20]: fig.savefig('filename2.png', dpi=125)


If we wanted key data from the regression, the following would do the job, after line 15:

print(fitted.params)     # the estimated parameters for the regression line
print(fitted.summary())  # summary statistics for the regression

We can add a confidence interval for the regression. There is a 95 per cent probability that the true regression line for the population lies within the confidence interval for our estimate of the regression line calculated from the sample data. We will calculate this from scratch, largely because I am not aware of a simple way of doing it within the statsmodels package.

To get the necessary t-statistic, I have imported the scipy stats package at line 27, and calculated the t-statistic at line 28. 

In [22]: y_hat = fitted.predict(x) # x is an array from line 12 above

In [23]: y_err = y - y_hat

In [24]: mean_x = x.T[1].mean()

In [25]: n = len(x)

In [26]: dof = n - fitted.df_model - 1

In [27]: from scipy import stats

In [28]: t = stats.t.ppf(1-0.025, df=dof)

In [29]: s_err = np.sum(np.power(y_err, 2))

In [30]: conf = t * np.sqrt((s_err/(n-2))*(1.0/n + (np.power((x_pred-mean_x),2) / 
   ....:     ((np.sum(np.power(x_pred,2))) - n*(np.power(mean_x,2))))))

In [31]: upper = y_pred + abs(conf)

In [32]: lower = y_pred - abs(conf)

In [33]: ax.fill_between(x_pred, lower, upper, color='#888888', alpha=0.4)
Out[33]: 

In [34]: fig.savefig('filename3.png', dpi=125)


The final step is a prediction interval. There is a 95 per cent probability that the real value of y in the population for a given value of x lies within the prediction interval. There is a statsmodels method in the sandbox we can use.

In [35]: from statsmodels.sandbox.regression.predstd import wls_prediction_std

In [36]: sdev, lower, upper = wls_prediction_std(fitted, exog=x_pred2, alpha=0.05)

In [37]: ax.fill_between(x_pred, lower, upper, color='#888888', alpha=0.1)
Out[37]: 

In [38]: fig.savefig('filename4.png', dpi=125)




Sunday, April 5

M1: surge in money supply - meaningful or meaningless

The last couple of months have seen a sudden jump in liquid money supply (M1).





Saturday, February 21

Load ABS files to MySQL

When I am working in R, I tend to have my working data in a MySQL database. I found that R did not always play nicely (and quickly) with complex Microsoft Excel files.

Previously, I had quite a complex bit of python code to read Excel files and upload them to MySQL. I have now retooled the way in which I load files from the Australian Bureau of Statistics (ABS) to MySQL using Python pandas. The code is much simpler.

First, I store my MySQL database username and password in a file MysqlConnect.py (it is used by a number of different programs). It lives in the bin directory (../bin from where I do my work). And, just in case you are wondering: no, it is not my password.

host     = 'localhost'
user     = 'root'
password = 'BigRedCar'
database = 'dbase1'

Now let's move on to the function to load ABS files into MySQL. It lives in the bin directory (../bin from where I do my work), in a file named LoadABSToMySQL.py

import pandas as pd
import pymysql
from sqlalchemy import create_engine
import os.path
import re

# local imports - a file that contains database login details
import MysqlConnect as MSC

def LoadABSToMySQL(pathName):
    """ Read an Excel file from the Australian Bureau of Statistics
        and load it into a MySQL database"""

    # --- 1 --- open MySQL
    s = 'mysql+pymysql://'+MSC.user+':'+MSC.password+'@'+MSC.host+'/'+MSC.database
    engine = create_engine(s)

    # --- 2 --- identify proposed table name from file name
    (head,tail) = os.path.split(pathName)
    tail = re.split('\.', tail)
    tablename = tail[0]

    # --- 3 --- open the XL file
    wb = pd.ExcelFile(pathName)

    # --- 4 --- load XL workbooks into a single DataFrame
    df = pd.DataFrame()
    for name in wb.sheet_names:

        # -- ignore junk
        if not 'Data' in name:
            continue

        # -- read
        tmp = wb.parse(sheetname=name, header=9, index_col=0, na_values=['', '-', ' '])

        # -- amalgamate
        df = pd.merge(left=df, right=tmp, how='outer', left_index=True, right_index=True)
        tmp = None

    # --- 5 --- write this DataFrame to MySQL
    df.to_sql(tablename, engine, index=True, if_exists='replace')

Finally, an example code snippet to load some of the ABS National Account files to MySQL. This files sits in my national accounts directory and has the rather unimaginative name py-load.py. The ABS Microsoft Excel files live in the ./raw-data sub-directory.

import sys
sys.path.append( '../bin' )

from LoadABSToMySQL import LoadABSToMySQL

dataDirectory = './raw-data/'
dataFiles = [
    '5206001_key_aggregates',
    '5206002_expenditure_volume_measures',
    '5206003_expenditure_current_price',
    '5206004_expenditure_price_indexes',
    '5206006_industry_gva',
    '5206008_household_final_consumption_expenditure',
    '5206022_taxes',
    '5206023_social_assistance_benefits',
    '5206024_selected_analytical_series'
]
dataSuffix = '.xls'

for f in dataFiles :
    LoadABSToMySQL(dataDirectory + f + dataSuffix)

To run this python load file, I have a BASH shell script, which I use on my iMac. It has another unimaginative name: run-load.sh.

#!/bin/bash

# mac os x fix ...
cd "$(dirname "$0")"

python ./py-load.py

Friday, February 13

Wednesday, February 11

An Extra Dry Baltic Index

The Baltic Dry Index provides a window on world trade. The view out the window ain't that good at the moment.



The Baltic Dry Index was at 663 on 5 December 2008 (its post-GFC low point).

It was at 554 on 9 February 2015. But no need to worry, it rebounded back up to 556 on 10 February.