Sunday, December 28

Updated cheat sheets: python, pandas and matplotlib

I have updated my cheat sheets for python, pandas and matplotlib.

You can find them here:

Sunday, October 12

Seasonal adjustment

Last week the ABS made a big call. It determined that the "seasonal pattern previously evident for the July, August and September labour force estimates is not apparent in 2014".

The ABS went on to say, "As there is little evidence of seasonality in the July, August and September months for 2014, the ABS has decided that for these months the seasonal factors will be set to one (reflecting no seasonality). This means the seasonally adjusted estimates (other than for the aggregate monthly hours worked series) for these months will be the same as the original series and this will result in revisions to the previously published July and August seasonally adjusted estimates."

I have enormous respect for the ABS, so I am not going to speculate on whether it got this call right or wrong. What I would like to know is what this decision means. And to do that, we will look at the national unemployment rate statistics for persons.

First, however, a quick recap on the seasonal adjustment process. With the monthly labour-force series, ABS uses a multiplicative model of seasonal adjustment, which is described here. The key formulas are as follows:

Original(t) = Trend(t) * Seasonal(t) * Irregular(t)
SeasonallyAdjusted(t) = Original(t) / Seasonal(t)

With a little algebra from the above formulas, we can see the seasonal factor the ABS has applied (including the 1.0 for July, August and September 2014) in the next chart, which plots in blue the seasonal factor applied for each month from February 1978 to September 2014. The red horizontal line is the mean seasonal factor for each month. This chart is evidence that seasonal factors can and do change over time (have a look at December).

To have a look at what the ABS decision meant, I am going to apply the seasonal factor from 2013 for July to September to the 2014 calculation (as an approximation).  The values I am substituting back are: 0.937385, 0.975908 and 0.991445.

Because the three relevant seasonal factors are all less than one, their effect is to increase the seasonally adjusted unemployment rate for the impacted months - particularly for July (which spiked at 6.4 per cent).

The ABS uses a 13-term Henderson moving average to derive the trend from the seasonally adjusted series. The resulting trend with seasonal adjustment applied for the past three months is now noticeably higher than if it were not applied. If the July result is a rogue, I would expect this difference to decline over the next four months, as further data points emerge.

For me the really interesting question is what will the ABS do next month. If you look at the first chart above, October and November are both months where the seasonal factor boosts the raw unemployment rate to yield a higher seasonally adjusted rate.


Another way to think about this problem is to apply my own seasonal adjustment to the original data. Unfortunately, I do not have as sophisticated software tools as those used by the ABS. I cannot make seasonal adjustments for moving holidays (such as Easter). Nor do I extend the the original data series using a seasonal, auto-regressive, integrated, moving average (SARIMA) regression forecast before undertaking the seasonal decomposition. Consequently, my seasonal weights and final results differ slightly from the ABS. Nonetheless, the results I obtain are reasonably close to the above analysis.

Saturday, October 11

State of the economy

The headline indicator of economic growth looks okay ... not brilliant, but not bad either ...

But the underlying indicators look anaemic ... suggesting things might not feel too flash for the average person on the street ...

And the villain in this saga ...

Sunday, August 3

Pandas 0.14.1

I have just upgraded to Pandas 0.14.1.

It was a pain. At first, none of my graphics programs worked. It looks like a change to the API for parsing Microsoft Excel files was the problem. I am not sure whether my previous approach was wrong (but worked serendipitously), or the API was deliberately changed to break old code (an unusual practice for a point release). If someone knows, I'd appreciate something in the comments below.

What follows are the classes I use to upload Australian Bureau of Statistics (ABS) and Reserve Bank of Australia (RBA) data, with the updates to the parsing stage commented.

And yes, I use Python 2.7, not Python 3 (it's what comes with the Apple Mac).

### written in python 2.7 and pandas 0.14.1

import pandas as pd
assert( pd.__version__ >= '0.14.1' )

class ABSExcelLoader:

    def load(self, pathName, freq='M', index_name=None, verbose=False):
        """return a dictionary of pandas DataFrame objects for
           each Data work sheet in an ABS Excel spreadsheet"""

        wb = pd.ExcelFile(pathName)
        returnDict = {}

        for name in wb.sheet_names:
            if not 'Data' in name:

            # ExcelFile.parse: API behaviour change with pandas 14.1
            #df = wb.parse(sheetname=name, skiprows=8, header=9, index_col=0, na_values=['', '-', ' '])
            df = wb.parse(sheetname=name, skiprows=9, header=0, index_col=0, na_values=['', '-', ' '])

            periods = pd.PeriodIndex(pd.Series(df.index), freq=freq)
            df.set_index(keys=periods, inplace=True)
   = index_name
            returnDict[name] = df

            if verbose:
                print ("\nFile: '{}', sheet: '{}'".format(pathName, name))
                print (df.iloc[:min(5, len(df)), :min(5, len(df.columns))])

        return returnDict

### written in python 2.7 and pandas 0.14.1

import pandas as pd
assert( pd.__version__ >= '0.14.1' )

class RBAExcelLoader:

    def load(self, pathName, freq='M', index_name=None, verbose=False):
        """return a pandas DataFrame for an RBA Excel spreadsheet"""
        wb = pd.ExcelFile(pathName)
        sheetname = 'Data'

        # ExcelFile.parse: API behaviour change with pandas 14.1
        #df = wb.parse(sheetname, skiprows=9, header=10, index_col=0, na_values=['', '-', ' '])
        df = wb.parse(sheetname, skiprows=10, header=0, index_col=0, na_values=['', '-', ' '])

        periods = pd.PeriodIndex(pd.Series(df.index), freq=freq)
        df.set_index(keys=periods, inplace=True)

        if verbose:
            print "\nFile: '{}', sheet: '{}'".format(pathName, sheetname)
            print 'Columns: {}'.format(df.columns.tolist())
            print 'Top left hand corner ...'
            print '------------------------'
            print df.iloc[:min(5, len(df)), :min(5, len(df.columns))]

        return df

Tuesday, July 15

Seasonal adjustment and decomposition

Having moved from R to python/pandas, one of the things I really miss is the extensive statistics library available in R.

I particularly like the STL function: the seasonal decomposition of a time series by Loess (localised regression). The function allows you to decompose a time series into its seasonal, trend and irregular components using loess. Unfortunately, I could find nothing like it in pandas.

In frustration, I have cobbled together a python/pandas function to decompose a univariate pandas time series item into seasonal, trend and irregular components. The function does not use Loess. Rather it uses a set of processes similar to those used by the Australian Bureau of Statistics.

It also uses a Henderson Moving Average which I had coded previously (see here).

Anyway, I'd be interested to know whether this code works for you. Drop me a line if you end up using it.

Caution: still experimental code.

## produce a time series decomposition

import pandas as pd
import numpy as np
from Henderson import Henderson

# --- A selection of seasonal smoothing weights, from which you can select
#     Note: these are end-weights, they are reversed for the start of a series
#     Note: your own weights in this form should also work
s3x3 = (
    np.array([ 5, 11, 11]) / 27.0,
    np.array([ 3,  7, 10,  7]) / 27.0,
    np.array([ 1,  2,  3,  2,  1]) / 9.0,

s3x5 = (
    np.array([ 9, 17, 17, 17]) / 60.0,
    np.array([ 4, 11, 15, 15, 15]) / 60.0,
    np.array([ 4,  8, 13, 13, 13,  9]) / 60.0,
    np.array([ 1,  2,  3,  3,  3,  2,  1]) / 15.0,

s3x9 = (
    np.array([0.051, 0.112, 0.173, 0.197, 0.221, 0.246]),
    np.array([0.028, 0.092, 0.144, 0.160, 0.176, 0.192, 0.208]),
    np.array([0.032, 0.079, 0.123, 0.133, 0.143, 0.154, 0.163, 0.173]),
    np.array([0.034, 0.075, 0.113, 0.117, 0.123, 0.128, 0.132, 0.137, 0.141]),
    np.array([0.034, 0.073, 0.111, 0.113, 0.114, 0.116, 0.117, 0.118, 0.120, 0.084]),
    np.array([1,     2,     3,     3,     3,    3,    3,    3,    3,    2,    1]) / 27.0,

# --- public Decomposition function
def Decompose(s, periods=None, model='multiplicative', 
    constantSeasonal=False, seasonalSmoother=s3x5):

    """ The simple decomposition of a pandas Series s into its trend, seasonal 
        and irregular components. The default is a multiplicative model: 
        Original(t) = Trend(t) * Seasonal(t) * Irregular(t). Can specify an 
        additive model: Original(t) = Trend(t) + Seasonal(t) + Irregular(t)

        -   s - the pandas Series, without any missing or NA values, 
            and sorted in ascending order
        -   periods - either a pandas Series indicating the period to 
            which each value of s belongs (of the same length as s, 
            with the same index as s), or an int for the number of periods
            into which to decompose the series
        -   model - string - either 'multiplicative' or 'additive'
        -   constantSeasonal - bool - whether the seasonal component is 
            constant or (slowly) variable
        -   seasonalSmoother - when not using a constantSeasonal, which 
            of the seasonal smoothers to use (s3x3, s3x5 or s3x9) - 
            default is s3x5 (ie over 7 years for monthly or quarterly data)

        Returns a pandas DataFrame with columns for each step in the 
        decomposition process (enables debugging). The key columns in the 
        DataFrame are:
        -   'Original' - the original series
        -   'SeasAdj' - the seasonally adjusted series
        -   'Trend' - the trend of the seasonally adjusted series
        -   'Seasonal' - the seasonal component found through the 
            decomposition process
        -   'Irregular' - the irregular component found through the 
            decomposition process

        Based on ideas gleaned from the Australian Bureau of Statistics:
            ABS (2005), "An Introductory Course on Times Series
            Analysis -- Electronic Delivery", Catalogue: 1346,0.55.001, online at:
        Does not adjust for moving holidays, public holidays, variable number 
        of working days in month, etc. (ie. it is quite a simple decomposition)"""

    ### --- Sanity checks and initialisation --- ###

    # --- sanity checks
    if periods is None:
        raise ValueError('The periods parameter is an integer or a Series of integers')
    if not isinstance(s, pd.core.frame.Series):
        raise TypeError('The s parameter should be a pandas Series')
    if not(s.index.is_monotonic and s.index.is_unique):
        raise ValueError('The index for the s parameter should be unique and sorted')
    if any(s.isnull()) or not all(np.isfinite(s)):
        raise ValueError('The s parameter contains NA or infinite values')

    # --- initialise
    result = pd.DataFrame(s)
    result.columns = ['Original']

    # --- determine the period
    if isinstance(periods, pd.core.frame.Series):
        if not (len(s) == len(periods) and all(s.index == periods.index)) :
            raise ValueError('The s and periods parameters must have the same index')
        result['period'] = periods
        periods = len(periods.unique())
        periods = int(periods)
        result['period'] = pd.Series(range(len(result)), index=s.index) % periods
    if periods < 2:
        raise ValueError('The periods parameter should be >= 2')
    if len(s) < (periods * 2) + 1:
        raise ValueError('The s parameter is not long enough to decompose')

    # --- settle the length of the Henderson moving average
    h = max(periods, 7) # ABS uses 13-term HMA for monthly and 7-term for quarterly
    if h % 2 == 0 :
        h += 1 # we need an odd number

    ### --- On to the decomposition process --- ###

    # --- 1 - derive an initial estimate for the trend component
    result['1stTrendEst'] = pd.rolling_mean(s, window=periods+1, 
        min_periods=periods+1, center=True)
    # Note: rolling mean leaves NA values at the start/end of the trend estimate.

    # --- 2 - preliminary estimate of the seasonal component
    if model == 'multiplicative':
        result['1stSeasonalEst'] = result['Original'] / result['1stTrendEst']
        result['1stSeasonalEst'] = result['Original'] - result['1stTrendEst']

    # --- 3 - smooth the seasonal
    result = _smoothSeasonalComponent(result, periods=periods, 
        constantSeasonal=constantSeasonal, seasonalSmoother=seasonalSmoother, 
        columnToBeSmoothed='1stSeasonalEst', newColumn='2ndSeasonalEst')

    # --- 4 - extend the smoothed seasonal estimate to full scale
    if any(result['2ndSeasonalEst'].isnull()) :
        result = _extendSeries(result, periods=periods,
            columnToBeExtended='2ndSeasonalEst', newColumn='3rdSeasonalEst')
        result['3rdSeasonalEst'] = result['2ndSeasonalEst']

    # --- 5 - preliminary estimate of the seasonally adjusted data
    if model == 'multiplicative':
        result['1stSeasAdjEst'] = result['Original'] / result['3rdSeasonalEst']
        result['1stSeasAdjEst'] = result['Original'] - result['3rdSeasonalEst']

    # --- 6 - a better estimate of the trend
    result['2ndTrendEst'] =  Henderson(result['1stSeasAdjEst'], h)

    # --- 7 - final estimate of the seasonal component
    if model == 'multiplicative':
        result['4thSeasonalEst'] = result['Original'] / result['2ndTrendEst']
        result['4thSeasonalEst'] = result['Original'] - result['2ndTrendEst']

    result = _smoothSeasonalComponent(result, periods=periods, 
        constantSeasonal=constantSeasonal, seasonalSmoother=seasonalSmoother, 
        columnToBeSmoothed='4thSeasonalEst', newColumn='Seasonal')

    # --- 8 - final estimate of the seasonally adjusted series
    if model == 'multiplicative':
        result['SeasAdj'] = result['Original'] / result['Seasonal']
        result['SeasAdj'] = result['Original'] - result['Seasonal']

    # --- 9 - final trend estimate
    result['Trend'] =  Henderson(result['SeasAdj'], h)

    # --- 10 - final irregular
    if model == 'multiplicative':
        result['Irregular'] = result['SeasAdj'] / result['Trend']
        result['Irregular'] = result['SeasAdj'] - result['Trend']

    # --- 11 - our job here is done
    return (result)

# --- apply seasonal smoother
def _smoothSeasonalComponent(result, periods, constantSeasonal, seasonalSmoother,
    columnToBeSmoothed, newColumn):

    # get the key smoothing constants
    if not constantSeasonal:
        kS = len(seasonalSmoother)
        lenS = (len(seasonalSmoother) * 2) -1
        centralS = seasonalSmoother[len(seasonalSmoother)-1]

    # establish an empty return column ...
    result[newColumn] = np.repeat(np.nan, len(result))

    # populate the return column ...
    for u in result['period'].unique() :

        # get each of of the seasonals
        thisSeason = result[result['period'] == u][columnToBeSmoothed]

        # smooth to a constant seasonal value
        if constantSeasonal:
            thisSeasonSmoothed = pd.Series(np.repeat(thisSeason.mean(skipna=True),
                len(thisSeason)), index=thisSeason.index)

        # smooth to a slowly changing seasonal value
            # drop NA values which result from step 1 in the decomp process
            thisSeason = thisSeason.dropna()

            # apply the seasonalSmoother
            thisSeasonSmoothed = pd.rolling_apply(thisSeason, window=lenS,
                func=lambda x: (x * centralS).sum(), min_periods=lenS, center=True)

            # for short series the above process results in no data ... find a simple mean
            if all(thisSeasonSmoothed.isnull()) :
                # same treatment as constant seasonal value above
                thisSeasonSmoothed = pd.Series(np.repeat(thisSeason.mean(skipna=True),
                    len(thisSeason)), index=thisSeason.index)

            # handle the end-point problem ...
            for i in range(kS-1) :
                if np.isnan(thisSeasonSmoothed.iat[i]) :
                    thisSeasonSmoothed.iat[i] = (thisSeason.iloc[0: i+kS] *
                        (seasonalSmoother[i][::-1])).sum() # note: reversed order at start

            for i in range(len(thisSeason)-1, len(thisSeason)-kS, -1) :
                if np.isnan(thisSeasonSmoothed.iat[i]) :
                    thisSeasonSmoothed.iat[i] = (
                        thisSeason.iloc[i-(kS-1):len(thisSeason)] *

        # package up season by season ...
        result[newColumn] = result[newColumn].where(result[newColumn].notnull(), 

    return (result)

# --- extend seasonal components to the full length of series 
def _extendSeries(result, periods, columnToBeExtended, newColumn):

    result[newColumn] = result[columnToBeExtended].copy()

    def fillup(result, fill, startPoint, endPoint):
        i = startPoint
        while True:
            p = result.index[i]
            result[newColumn].iat[i] = fill[newColumn].at[result['period'].iat[i]]
            if p >= endPoint:
            i += 1

    # back-cast
    if np.isnan(result.iat[0, result.columns.get_loc(newColumn)]):
        fill = pd.DataFrame(result[newColumn].dropna().iloc[0:periods])
        fill['period'] = result['period'][fill.index[0]:fill.index[len(fill)-1]]
        endPoint = fill.index[0] - 1
        fill.index = fill['period']
        fillup(result=result, fill=fill, startPoint=0, endPoint=endPoint)

    # forward-cast
    if np.isnan(result.iat[len(result)-1, result.columns.get_loc(newColumn)]):
        fill = result[newColumn].dropna()
        fill = pd.DataFrame(fill[len(fill)-periods:len(fill)])
        fill['period'] = result['period'][fill.index[0]:fill.index[len(fill)-1]]
        startPoint = result.index.get_loc(fill.index[len(fill)-1] + 1)
        fill.index = fill['period']
        endPoint = result.index[len(result)-1]
        fillup(result=result, fill=fill, startPoint=startPoint, endPoint=endPoint)

    return (result)

Monday, June 23

Henderson Moving Average

I have posted my R code for a Henderson moving average here. This is the same code in python.

## calculate a Henderson moving average

import pandas as pd
import numpy as np

def hmaSymmetricWeights(n):
    """ derive an n-term array of symmetric 'Henderson Moving Average' weights
        formula from ABS (2003), 'A Guide to Interpreting Time Series', page 41.
        returns a numpy array of symmetric Henderson weights indexed from 0 to n-1"""

    # calculate the constant denominator and terms
    m = int((n-1)//2) # the mid point - n must be odd
    m1 = (m+1)*(m+1)
    m2 = (m+2)*(m+2)
    d = float(8*(m+2)*(m2-1)*(4*m2-1)*(4*m2-9)*(4*m2-25))
    m3 = (m+3)*(m+3)

    # calculate the weights
    w = np.repeat(np.nan, n) # Actually indexed from 0 to n-1
    for j in range(m+1):
        j2 = j*j
        v = (315*(m1-j2)*(m2-j2)*(m3-j2)*(3*m2-11*j2-16))/d
        w[(m+j)] = v
        if j > 0:
            w[(m-j)] = v

    w.flags.writeable = False # let's make it quasi-immutable
    return (w)

def hmaAsymmetricWeights(m, w):
    """calculate the asymmetric end-weights

        w --> an array of symmetrical henderson weights (from above function)
        m --> the number of asymmetric weights sought; where m < len(w);

        returns a numpy array of asymmetrical weights, indexed from 0 to m-1;

        formula from Mike Doherty (2001), 'The Surrogate Henderson Filters in X-11',
        Aust, NZ J of Stat. 43(4), 2001, pp901-999; see formula (1) on page 903"""

    n = len(w) # the number of weights

    # - some quick sanity checks
    if m >= n:
        raise ValueError('The m argument must be less than n')
    if m <= int((n-1)//2):
        raise ValueError('The m argument must be greater than (n-1)/2')

    # --- let's build up Doherty's formula (1) from the top of page 903

    # - the second chunk of the formula
    sumResidual = w[range(m, n)].sum() / float(m)

    # - the last chunk of the formula
    sumEnd = 0.0
    for i in range(m+1, n+1):
        sumEnd += (float(i)-((m+1.0)/2.0)) * w[i-1] # w indexed from 0 to n-1

    # - the beta squared / sigma squared - formula at the bottom of page 904
    ic = 1.0
    if n >= 13 and n < 15:
        ic = 3.5
    elif n >= 15:
     ic = 4.5
    b2s2 = (4.0/np.pi)/(ic*ic)

    # - the gnarly bit in the middle of the formula
    denominator = 1.0 + ((m*(m-1.0)*(m+1.0) / 12.0 ) * b2s2)
    u = np.repeat(np.nan, m) # return series - created empty
    for r in range(m): # r ranges 0 to m-1; but the formulae assumes 1 to m
        numerator = ((r+1.0) - (m+1.0)/2.0) * b2s2
        # - finally putting it all together
        u[r] = w[r] + sumResidual + ( numerator / denominator ) * sumEnd

    u.flags.writeable = False # let's make it quasi-immutable
    return (u)

def Henderson(s, n):
    """ Calculate an n-term Henderson Moving Average for the Series s
        Note: we blithely assume s is ordered, contiguous and without missing data"""

    # - some simple sanity checks
    if not isinstance(s, pd.core.series.Series):
        raise TypeError('The s argument should be a pandas Series')
    if not isinstance(n, int):
        raise TypeError('The n argument must be an integer')
    if n < 5:
        raise ValueError('The n argument must be >= 5')
    if n % 2 == 0:
        raise ValueError('The n argument must be odd')
    if len(s) < n:
        raise ValueError('The s argument should be a Series longer than n')

    # - calculate the symmetric weights
    w = hmaSymmetricWeights(n)

    # preliminaries
    r = pd.Series(np.repeat(np.nan, len(s)), index=s.index) # the empty return vehicle
    m = int((n-1)//2)
    l = len(s)

    # - and now move over the length of the series ...
    for i in range(len(s)) :
        if i < m:
            # --- head section of series
            u = hmaAsymmetricWeights(m+i+1, w)[::-1] # reverse - asymmetric to the left
            r.iloc[i] = (s.iloc[0:(i+m+1)] * u).sum()
        elif i + m >= l:
            # --- tail section of series
            u = hmaAsymmetricWeights(m+l-i, w)
            r.iloc[i] = (s.iloc[(i-m):l] * u).sum()
            # --- middle section of series
            r.iloc[i] = (s.iloc[(i-m):(i+m+1)] * w).sum()

    return (r)

### - test code
# Check against Table 1 in B Quenneville and B Lefrancois (2001),
# "Implicit Forecasts in Musgrave Asymmetric Averages",
# Proceedings of the Annual Meeting of the American Statistical Association,
# August 5-9, 2001.
#w = hmaSymmetricWeights(9)
#print(w.sum()) # should be one
#u = hmaAsymmetricWeights(7, w)
#print(u.sum()) # should be one
#print (Henderson(pd.Series(range(30))+pd.Series(np.random.randn(30)), 13))