The roaming data scientist – Page 6 – Data, Science, Et Cetera

Building a machine learning model to predict a number is a trivial exercise with today’s tools. With highly optimised models, improving the accuracy of the prediction will always be dependent on the data, and clever ways of using it. This will be the subject of this article, we’ll use R and machine learning to predict pedestrian numbers in Melbourne.

We’ll step through this in two parts:

adding new data to your existing data set.
feature engineering – using the data that you have to make new variables

The data

The Melbourne city council provides a cool tool to explore the data here http://www.pedestrian.melbourne.vic.gov.au. We can download it directly into R:

library(dplyr)
# this takes a minute
ped_data <- read.csv("https://data.melbourne.vic.gov.au/api/views/b2ak-trbp/rows.csv?accessType=DOWNLOAD")

It is updated monthly, it looks like the following:

ID	Date_Time	Year	Month	Mdate	Day	Time	Sensor_ID	Sensor_Name	Hourly_Counts
2286872	2018-07-01T10:00:00	2018	July	1	Sunday	10	39	Alfred Place	160
2286991	2018-07-01T13:00:00	2018	July	1	Sunday	13	4	Town Hall (West)	2852
2286992	2018-07-01T13:00:00	2018	July	1	Sunday	13	5	Princes Bridge	3350
2287071	2018-07-01T14:00:00	2018	July	1	Sunday	14	37	Lygon St (East)	278
2287072	2018-07-01T14:00:00	2018	July	1	Sunday	14	39	Alfred Place	94
2287191	2018-07-01T17:00:00	2018	July	1	Sunday	17	4	Town Hall (West)	2332

Let’s look at just one location, Bourke Street Mall (South) in Melbourne’s CBD. We can visualise this using the ggplot2 library.

library(ggplot2)
library(lubridate)
ped_data %>% filter(Sensor_Name == "Bourke Street Mall (South)") %>% 
  mutate(Date_Time = ymd_hms(Date_Time)) %>% 
  ggplot(aes(y = Hourly_Counts, x = Date_Time)) +
  geom_line(aes(group = Sensor_Name, colour = Sensor_Name)) +
  theme(legend.position="bottom") + ylim(2000, 7500)

Interestingly, we can see a slow but steady long term decline in overall numbers, despite Melbourn’e population increasing in those years. There are strong peaks around Christmas/New Years, and secondary peaks that possibly represent school holidays. Using just this data, we can build our predictive model.

Basic Feature Engineering

This is a fancy way of saying “add a column”. Let’s extract the basic time information from the Date_Time field:

month
day of month
time of day

model_data <- ped_data %>% 
  filter(Sensor_Name == "Bourke Street Mall (South)") %>% 
  transmute(Date_Time = ymd_hms(Date_Time),
            Month = lubridate::month(Date_Time),
            Day = mday(Date_Time),
            Time = lubridate::hour(Date_Time),
            Hourly_Counts) %>% 
  filter(lubridate::year(Date_Time) > 2014) %>% 
  mutate(Date_Time = as.factor(Date_Time))

Let’s take a look. This is the basic structure of the input to train a model to predict Hourly_Counts:

Date_Time	Month	Day	Time	Hourly_Counts
2015-01-01 00:00:00	1	1	0	867
2015-01-01 01:00:00	1	1	1	614
2015-01-01 02:00:00	1	1	2	397
2015-01-01 03:00:00	1	1	3	239
2015-01-01 04:00:00	1	1	4	111
2015-01-01 05:00:00	1	1	5	52

Basic Machine Learning model

We can use the h2o framework to carry out our machine learning. To set it up we split the data into:

training data set to build our model
test set to measure accuracy of the model

library("h2o")
h2o.init()
h2o_model_data1 <- as.h2o(model_data)
 
h2o_model_data1 <- h2o.splitFrame(h2o_model_data1, 0.7)

Our training frame is hex_h2o_model_data[[1]], and our test data will be hex_h2o_model_data[[2]]

Let’s try the extreme gradient boosting algorithm XGBoost, it seems to be winning the Kaggle competitions last time I checked. I find these algorithms are highly optimised with the default parameters. Feel free to play around with these if you feel otherwise.

training_model1 <- h2o.xgboost(training_frame = h2o_model_data1[[1]], x = -1, y = 5)

To explain the above snippet, the fourth column is Hourly_Counts i.e. the value we’re trying to predict, so it is the ‘y’ argument in h2o parlance.

Now we use our test data set as input into the model we just built.

h2o_predicted1 <- h2o.predict(training_model1, h2o_model_data1[[2]])
test_results1 <- as.data.frame(h2o.cbind( h2o_model_data1[[2]], h2o_predicted1))

Let’s take a look:

predict	Date_Time	Month	Day	Time	Hourly_Counts
553	2015-01-01 01:00:00	1	1	1	614
58	2015-01-01 06:00:00	1	1	6	43
315	2015-01-01 09:00:00	1	1	9	322
833	2015-01-01 10:00:00	1	1	10	774
1797	2015-01-01 12:00:00	1	1	12	1879
2215	2015-01-01 13:00:00	1	1	13	2277

Note: Your results may vary slightly due to the stochastic nature of the algorithm.

This looks pretty good, the actual data in the Hourly_Counts column seem to track well with the data in our predict column. We can use the r² metric to measure how well the model describes the data. An r² equal to 1 means the model is perfect, an r² equal to 0.5 means the model is no better than a guess.

h2o.r2(h2o.performance(training_model1, h2o_model_data1[[2]]))

This is a great result.

Add new data

Let’s see if we can improve the r² value by adding public holiday data. There is an excellent data source here that covers the date range we are interested in.

library(readr)
pub_hols <- read_csv("https://www.michaelplazzer.com/wp-content/uploads/2018/09/Aus_public_hols_2009-2019.csv")
 
# melbourne is in the state of Victoria (VIC), and we'll assign a public holiday label to those dates
pub_hols_data <- pub_hols %>% 
  filter(State == "VIC") %>% 
  mutate(Public_Holiday = 1)

Let’s take a look, the Day_In_Lieu field represents those days when a public holiday fell on a weekend, and the holiday was shifted to the following weekday.

Date	State	Weekday	Public_Holiday
2009-01-01	VIC	Thursday	1
2009-01-26	VIC	Monday	1
2009-03-09	VIC	Monday	1
2009-04-10	VIC	Friday	1
2009-04-11	VIC	Saturday	1
2009-04-12	VIC	Sunday	1

And we’ll need to join this data set with our pedestrian data, we can use the date field.

model_data2 <- model_data %>% 
  mutate(Date = as.Date(Date_Time)) %>%   
  left_join(pub_hols_data, by =  "Date") %>% 
  select(-State, -Weekday, -Date, -Part_Day_From) %>% 
  mutate(Day_In_Lieu = ifelse(is.na(Day_In_Lieu), 0, Day_In_Lieu),
         Public_Holiday = ifelse(is.na(Public_Holiday), 0, Public_Holiday)) %>% 
  mutate(Day_In_Lieu = as.factor(Day_In_Lieu),
         Public_Holiday = as.factor(Public_Holiday)) %>% # be careful with factor variables
  mutate(Date_Time = as.factor(Date_Time)) # h2o doesn't recognise these date types

Let’s see what our data looks like now:

Date_Time	Month	Day	Time	Hourly_Counts	Public_Holiday
2015-01-01 00:00:00	1	1	0	867	1
2015-01-01 01:00:00	1	1	1	614	1
2015-01-01 02:00:00	1	1	2	397	1
2015-01-01 03:00:00	1	1	3	239	1
2015-01-01 04:00:00	1	1	4	111	1
2015-01-01 05:00:00	1	1	5	52	1

Now build and test the model as before.

 
h2o_model_data2 <- as.h2o(model_data2)
h2o_model_data2 <- h2o.splitFrame(h2o_model_data2, 0.7)
 
training_model2 <- h2o.xgboost(training_frame = h2o_model_data2[[1]], x = -1, y = 5, ignore_const_cols = FALSE )
 
h2o.r2(h2o.performance(training_model2 , h2o_model_data2[[2]]))

I thought that would have more benefit! The result is essentially the same, so I would consider leaving this data out. Perhaps we can work with the data that we have …

Adding more features

Let’s keep it simple: Can we increase the r² just by adding a day of week field?

library(lubridate)
library(lubridate)
model_data3 <- model_data %>% 
  mutate(Day_of_week = as.factor(weekdays(ymd_hms(Date_Time))))
 
h2o_model_data3 <- as.h2o(model_data3)
h2o_model_data3 <- h2o.splitFrame(h2o_model_data3, 0.7)
 
training_model3 <- h2o.xgboost(training_frame = h2o_model_data3[[1]], y = 5)
 
h2o.r2(h2o.performance(training_model3, h2o_model_data3[[2]]))

This is a substantial increase, the r² value is getting close to one, so any new features will have diminishing returns. Let’s plot some of the predictions against the actuals

Hourly Pedestrian Counts: Prediction vs Actual

The approach described above can be applied to the other sensor locations around Melbourne. Indeed, this is a general approach that can be applied to a range of problems.