By Wpcpey [CC BY-SA 4.0 (], from Wikimedia Commons

Machine learning walk-through: Predicting pedestrian traffic

Building a machine learning model to predict a number is a trivial exercise with today’s tools. With highly optimised models, improving the accuracy of the prediction will always be dependent on the data, and clever ways of using it. This will be the subject of this article, we’ll use R and machine learning to predict pedestrian numbers in Melbourne.

We’ll step through this in two parts:

  • adding new data to your existing data set.
  • feature engineering – using the data that you have to make new variables

The data

The Melbourne city council provides a cool tool to explore the data here We can download it directly into R:

# this takes a minute
ped_data <- read.csv("")

It is updated monthly, it looks like the following:

22868722018-07-01T10:00:002018July1Sunday1039Alfred Place160
22869912018-07-01T13:00:002018July1Sunday134Town Hall (West)2852
22869922018-07-01T13:00:002018July1Sunday135Princes Bridge3350
22870712018-07-01T14:00:002018July1Sunday1437Lygon St (East)278
22870722018-07-01T14:00:002018July1Sunday1439Alfred Place94
22871912018-07-01T17:00:002018July1Sunday174Town Hall (West)2332

Let’s look at just one location, Bourke Street Mall (South) in Melbourne’s CBD. We can visualise this using the ggplot2 library.

ped_data %>% filter(Sensor_Name == "Bourke Street Mall (South)") %>% 
  mutate(Date_Time = ymd_hms(Date_Time)) %>% 
  ggplot(aes(y = Hourly_Counts, x = Date_Time)) +
  geom_line(aes(group = Sensor_Name, colour = Sensor_Name)) +
  theme(legend.position="bottom") + ylim(2000, 7500)

Interestingly, we can see a slow but steady long term decline in overall numbers, despite Melbourn’e population increasing in those years. There are strong peaks around Christmas/New Years, and secondary peaks that possibly represent school holidays. Using just this data, we can build our predictive model.

Basic Feature Engineering

This is a fancy way of saying “add a column”. Let’s extract the basic time information from the Date_Time field:

  • month
  • day of month
  • time of day
model_data <- ped_data %>% 
  filter(Sensor_Name == "Bourke Street Mall (South)") %>% 
  transmute(Date_Time = ymd_hms(Date_Time),
            Month = lubridate::month(Date_Time),
            Day = mday(Date_Time),
            Time = lubridate::hour(Date_Time),
            Hourly_Counts) %>% 
  filter(lubridate::year(Date_Time) > 2014) %>% 
  mutate(Date_Time = as.factor(Date_Time))

Let’s take a look. This is the basic structure of the input to train a model to predict Hourly_Counts:

2015-01-01 00:00:00110867
2015-01-01 01:00:00111614
2015-01-01 02:00:00112397
2015-01-01 03:00:00113239
2015-01-01 04:00:00114111
2015-01-01 05:00:0011552

Basic Machine Learning model

We can use the h2o framework to carry out our machine learning. To set it up we split the data into:

  • training data set to build our model
  • test set to measure accuracy of the model
h2o_model_data1 <- as.h2o(model_data)
h2o_model_data1 <- h2o.splitFrame(h2o_model_data1, 0.7)

Our training frame is hex_h2o_model_data[[1]], and our test data will be hex_h2o_model_data[[2]]

Let’s try the extreme gradient boosting algorithm XGBoost, it seems to be winning the Kaggle competitions last time I checked.  I find these algorithms are highly optimised with the default parameters. Feel free to play around with these if you feel otherwise.

training_model1 <- h2o.xgboost(training_frame = h2o_model_data1[[1]], x = -1, y = 5)

To explain the above snippet, the fourth column is Hourly_Counts i.e. the value we’re trying to predict, so it is the ‘y’ argument in h2o parlance.

Now we use our test data set as input into the model we just built.

h2o_predicted1 <- h2o.predict(training_model1, h2o_model_data1[[2]])
test_results1 <- h2o_model_data1[[2]], h2o_predicted1))

Let’s take a look:

5532015-01-01 01:00:00111614
582015-01-01 06:00:0011643
3152015-01-01 09:00:00119322
8332015-01-01 10:00:001110774
17972015-01-01 12:00:0011121879
22152015-01-01 13:00:0011132277

Note: Your results may vary slightly due to the stochastic nature of the algorithm.

This looks pretty good, the actual data in the Hourly_Counts column seem to track well with the data in our predict column. We can use the r2 metric to measure how well the model describes the data. An r2 equal to 1 means the model is perfect, an r2 equal to 0.5 means the model is no better than a guess.

h2o.r2(h2o.performance(training_model1, h2o_model_data1[[2]]))

This is a great result.

Add new data

Let’s see if we can improve the r2 value by adding public holiday data. There is an excellent data source here that covers the date range we are interested in.

pub_hols <- read_csv("")
# melbourne is in the state of Victoria (VIC), and we'll assign a public holiday label to those dates
pub_hols_data <- pub_hols %>% 
  filter(State == "VIC") %>% 
  mutate(Public_Holiday = 1)

Let’s take a look, the Day_In_Lieu field represents those days when a public holiday fell on a weekend, and the holiday was shifted to the following weekday.


And we’ll need to join this data set with our pedestrian data, we can use the date field.

model_data2 <- model_data %>% 
  mutate(Date = as.Date(Date_Time)) %>%   
  left_join(pub_hols_data, by =  "Date") %>% 
  select(-State, -Weekday, -Date, -Part_Day_From) %>% 
  mutate(Day_In_Lieu = ifelse(, 0, Day_In_Lieu),
         Public_Holiday = ifelse(, 0, Public_Holiday)) %>% 
  mutate(Day_In_Lieu = as.factor(Day_In_Lieu),
         Public_Holiday = as.factor(Public_Holiday)) %>% # be careful with factor variables
  mutate(Date_Time = as.factor(Date_Time)) # h2o doesn't recognise these date types

Let’s see what our data looks like now:

2015-01-01 00:00:0011086701
2015-01-01 01:00:0011161401
2015-01-01 02:00:0011239701
2015-01-01 03:00:0011323901
2015-01-01 04:00:0011411101
2015-01-01 05:00:001155201

Now build and test the model as before.

h2o_model_data2 <- as.h2o(model_data2)
h2o_model_data2 <- h2o.splitFrame(h2o_model_data2, 0.7)
training_model2 <- h2o.xgboost(training_frame = h2o_model_data2[[1]], x = -1, y = 5, ignore_const_cols = FALSE )
h2o.r2(h2o.performance(training_model2 , h2o_model_data2[[2]]))

I thought that would have more benefit! The result is essentially the same, so I would consider leaving this data out. Perhaps we can work with the data that we have …

Adding more features

Let’s keep it simple: Can we increase the r2 just by adding a day of week field?

model_data3 <- model_data %>% 
  mutate(Day_of_week = as.factor(weekdays(ymd_hms(Date_Time))))
h2o_model_data3 <- as.h2o(model_data3)
h2o_model_data3 <- h2o.splitFrame(h2o_model_data3, 0.7)
training_model3 <- h2o.xgboost(training_frame = h2o_model_data3[[1]], y = 5)
h2o.r2(h2o.performance(training_model3, h2o_model_data3[[2]]))

This is a substantial increase, the r2 value is getting close to one, so any new features will have diminishing returns. Let’s plot some of the predictions against the actuals

Hourly Pedestrian Counts: Prediction vs Actual

The approach described above can be applied to the other sensor locations around Melbourne. Indeed, this is a general approach that can be applied to a range of problems.