Note: download Australian public holiday data on this page.
In this day and age, no data scientist should produce slightly less accurate models for want of public holiday data. When modelling human behaviour over time for example, this type of data can have a tremendous impact on the predictive power of a model. Think about how differently you behave on a public holiday compared to an ordinary day. So I was surprised not to find any data set out there that I would judge as “nice”.
So what does a good data set look like?
What do I define to be good data? That depends on the purpose in which it is intended to be used. In the context of a public holiday data set for analytics, I look for the following attributes:
- Accuracy: Needless to say. Preferably from an officious source
- Complete: My models are hungry beasts, they need all the data they can get. Preferable since the dawn of civilisation.
- Suitable format: I don’t wish to spend time scraping and restructuring data. Tabular csv files please.
Finding a source with one of these attributes was achievable, but not with all three.
Here is an example of data that’s designed for people, not for data scientists.
This same data could be structured in the following way, which would be much more useful to people like me.
Existing sources of data
There are currently two national data sets, both administered by the Digital Transformation Agency but residing in different domains.
The first appears very much current and correct and intended to be used as a reference rather than a data set that one would download. This page also provides links to the relevant state and territory based sources.
The second data set is described as “machine readable”, which I believe means it is in a tabular csv format, there is also an API. I have identified a number of issues with this source, some of which are referred to in the comments section of the site. A summary:
- The data is split across multiple files that contain duplicated data.
- Each of these files has a different API address.
- The field names are slightly different from the 2017 data set onward.
- There is one field which describes the jurisdiction in which the public holiday applies, within the records a bar symbol “|” is used to indicate multiple jurisdictions.
- Update: The 2019 data set has is now in long format, no more bars :|
Upon first glance this looks like a nicely laid out data set. Upon further inspection a number of issues become apparent:
- Semi consistent Description field for the location of public holiday e.g. a national holiday is referred to as either “All” or an empty record.
- Applicable states are recorded in the same record, separated by a comma.
- States may be anti-recorded. E.g. “All except Tas, WA” to represent all states except Western Australia and Tasmania.
- Use of abbreviated and full name to describe the same state e.g Vic and Victoria
Incomplete data, which is not disclosed – this is quite bad since it may be used based on the assumption that it is complete.
A better public holiday data set
I decided that I would build the data set I wished I had access to, and perhaps it will be useful to other people too. However I soon discovered a couple of subtleties particular to public holiday data:
- Sometimes certain holidays fall on a weekend, this may result in a “lieu day” on the following weekday. Lieu is an Old French word meaning “instead of”.
- In some states and territories there exists part-day public holidays.
A discussion on how to treat these is given below. Note that I’m coming from the perspective of a data scientist, I want to know how to structure this data for the purposes of a building a predictive model.
Public holidays on a weekend
I prefer to have less columns then more but I couldn’t think of a way to avoid this. An “in lieu of” flag is required to differentiate a holiday on a weekend to one on a weekday. There are several holidays that are affected by this phenomenon, those with fixed dates:
- New Years Day
- Christmas Day
- Boxing Day
- Australia Day
- Anzac Day
Relatively rare, some holidays are part day holidays. Annoyingly I had to add yet another column to describe this. I decided to use the 24 hour format HHMM to indicate the time of day the public holiday applies.
Bringing it all together…
Below is a sample of the data set I created. The complete list of Australian public holidays from 2009 onwards can be found here.
Public holiday data sets are typically intended to be a quick check for people to see if they have the day off next week, and not intended for analytics professional to feed their machine learning models. I hope I have provided something useful, or may at least save some time.