Wednesday, October 30

R code for reading ABS spreadsheets

I have been playing with some of my utility functions lately. The following function is one I am testing with data from the Australian Bureau of Statistics. I use it to load an ABS spreadsheet from my local file system into an R data frame. Almost all ABS spreadsheets have the meta data and actual data in the same row/column format.

When the ABS releases new data, my usual practice is to download the spreadsheet to my local hard drive. I then run the appropriate R graphing program for that dataset. All my graphing programs begin with a call to functions to get the ABS data into an R data frame.

The first half of the function is sanity checks. I use these a lot in R. It saves time down the track by helping avoid the many obscure R bugs that can occur. The second half of the function restructures the data within the data frame (essentially making it useful).

readABS <- function(fileName=NULL, sheet=2, output=NULL) {
    # load the standard format ABS Excel spreadsheet into an R data.frame

    ### --- upfront sanity checks (programming by contract)
    if(is.null(fileName)) {
        warning('readABS(fileName): fileName name not specified')
        return(NULL)
    }

    if(length(fileName) > 1) {
        warning('readABS(fileName): not vectorised, only the first fileName will be opened')
        fileName <- fileName[1] # truncate to scalar
    }

    if(length(sheet) > 1) {
        warning('readABS(fileName): not vectorised, only the first sheet will be opened')
        sheet <- sheet[1] # truncate to scalar
    }

    if(is.na(fileName) || !is.character(fileName)) {
        warning('readABS(fileName): fileName name must be a character string')
        return(NULL)
    }

    if(!file.exists(fileName)) {
        warning(paste('readABS(fileName): fileName does not exist --> ', fileName))
        return(NULL)
    }        

    if(!('gdata' %in% rownames(installed.packages())))
        stop('readABS(fileName): the gdata package is not installed')
    
    if(!('zoo' %in% rownames(installed.packages())))
        stop('readABS(fileName): the zoo package is not installed')
    
    ### --- lets do the heavy lifting
    require('gdata') 
    df <- read.xls(fileName, sheet, stringsAsFactors=FALSE)

    # minor sanity check
    if(nrow(df) < 10) {
        warning('readABS(fileName): something is odd; ABS spreadsheets normally have 10+ rows')
        return(df)
    }        
    
    # use ABS serial number as column/variable name
    colnames(df) <- as.character(df[9, ]) 
    colnames(df)[1] <- 't' # special case for the first column
    
    # keep an optional record of the meta-data - may be useful for debugging
    if(!is.null(output) && !any(is.na(output))) {
        if(ncol(df) >= 2) {
            meta.data <- data.frame(stringsAsFactors=FALSE)
            for(i in 2:ncol(df)) {
                meta.data <- rbind(meta.data, 
                    data.frame(index=as.character(df[[9, i]]), 
                        metaData=paste0(df[1:9, i], sep='|', collapse=''),
                        stringsAsFactors=FALSE))
            }
            write.csv(meta.data, file=as.character(output), row.names=FALSE)
        }
    }
    
    # delete the meta-data rows - ie. exclude rows 1 through 9
    df <- df[-1:-9, ] 
    
    # make the columns numeric - convert characters back to numbers
    if(ncol(df) >= 2)
        for(i in 2:ncol(df))
            df[[i]] <- as.numeric(as.character(df[[i]]))
    
    # fix dates - currently character strings in the format: Jun-1981
    # set date to middle of the month or quarter
    df$t <- as.Date(paste('15-', df$t, sep=''), format='%d-%b-%Y', frac=0.5)
    if(nrow(df)>2) {
        d <- as.numeric(df[[2, 't']] - df[[1, 't']])
        if(d > 85 && d < 95) {
            # time series appears to be quarterly ...
            require('zoo')
            m <- format(as.yearmon(df[[1, 't']]), '%b')
            if(m %in% c('Mar', 'Jun', 'Sep', 'Dec')) {
                t <- as.Date(as.yearmon(df$t), frac=1.0)
                df$t <- as.Date(as.yearqtr(t), frac=0.5)
            }
            else
                df$t <- as.Date(df$t - 30, frac=0.5) # middle of middle month in quarter
        }
    }
    
    # fix row numbers
    if(nrow(df) >= 1)
        rownames(df) <- 1:nrow(df)
    
    return(df)
}

Post Script

I should explain, this is not the method I use to read ABS files. I have a series of python routines I use to load the ABS spreadsheets into a MySQL database. I then access them in R from the MySQL database.

This was a test piece of code to see if I could read the ABS spreadsheets direct from R. It turns out that this code rounds the numbers as they appear (visually) in the spreadsheet, rather than capture the double precision accuracy that is available within ABS spreadsheets. In my view, this is a significant limitation.

Saturday, October 26

Young women (not) in jobs

According to the English proverb, "one swallow does not a summer make". In statistical parlance: do not confuse noise for signal. This is particularly true at the time at which a trend might be emerging.

With those caveats in mind, let's look at the latest monthly unemployment data for females aged between 15 and 19 years. When I apply seasonal adjustment, I get two months of substantial growth in this cohort's unemployment rate. (In my view, there is not enough data yet to say we have a problem, but the scale of the change in the data we have intrigues me).


Because I am a nerd, let's look quickly at the seasonal decomposition and a cycle plot of the seasonal component for this cohort. Although September is not normally a peak in the unemployment rate for women aged 15 to 19 years, the unseasonal increase is evident in the raw data.



Certainly, we are not seeing the same degree of jump among similarly aged males (but we are seeing a jump).


The recent (post GFC) deviation in participation rates adds to the mystery. Why are men in this age cohort leaving the labour force at a faster rate than women. Is it because they are studying? Or is it because they have less perseverance than women (and the young men are simply dropping out).




What is worse? A high female unemployment rate or a low male participation rate? Both trends are suggestive of an emerging issue in the labour market for 15 to 19 year olds.

CPI

A little late, but let's look at this week's CPI print. It's a big quarter on to match a big quarter off. So the through-the-year print for the September quarter is much the same as the June quarter.











Monday, October 7

A new R Cheat Sheet

A new Cheat Sheet - this time on the trips and traps in R programming.