Sadly the Dublin bikes do not look like this.

## Dublin bikes revisited

### Same idea, more data and more cities

Quite a while ago I wrote a post about the Dublin bikes scheme. J.C.Decaux, who run the scheme, make real-time data about the location of the bikes throughout the city available and I thought it would be interesting to collect this data and carry out some simple analysis of it.

At the time when I wrote my previous post, October 2009, I only had 7 weeks of data (about the length of time the scheme had then been running). I was pleased to discover that it was possible to learn a bit about Dubliners' bike habits even with this limited quantity of data. Nevertheless, I wanted to return to the idea when I had a bigger data-set.

It has taken a while but I have finally collected over a year of data for the Dublin bikes scheme as well as almost a year of data for similar bike schemes (also run by J.C.Decaux) in Brussels, Lyon, Paris and Seville.

### Questioning the data

To my mind, there are all sorts of questions one might investigate using this bike data. Below I describe those that I looked into. I would be interested to hear about others' work. (Incidentally, I have made the data I have available for download - see below).

### Bike usage by time of day

The fundamental datum that J.C.Decaux make available is the number of bikes docked at each station at a given point in time. Let us call this a snapshot of a given station at a moment in time. By adding snapshot data up across all stations at the same time we can deduce how many bikes are in use. Taking the average over all days in the data-set and treating week days and weekends separately (for obvious reasons) we can generate charts of bike usage as a function of time of day.

In the below charts we plot the average number of bikes in use as a function of time of day minus the number that were in use at midnight.

Dublin bike usage by time of day (weekdays)

Dublin bike usage by time of day (weekends)

The peaks in the morning and evening presumably represent people going to and coming from work. As such they give us a way to estimate the length of people's working days (including the time getting to/from the bike stations). I made this observation in my previous post and it occurred to me that it might be interesting to use this as a method for comparing the length of the working day across different European cities. At the time, I did not have any data for other cities. Now however, that has all changed! Below are the corresponding (weekday) charts for Brussels, Lyon, Paris and Seville.

Brussels bike usage by time of day

Lyon bike usage by time of day

Paris bike usage by time of day

Seville bike usage by time of day

The first thing that strikes me about this data is that of the five cities, four of them have reasonably unsurprising profiles but one, Brussels, does not. Based solely on this quick plotting exercise, it looks like something may be wrong (or at least very different) with the Brussels bike scheme. In fact I was not completely surprised when I saw this because I had had some correspondence with an entrepreneurial Brussels bike user, Jonathan Van Parys, who mentioned he had concerns about the Brussels bike scheme. In his words:
Brussels's physical geography leads to a rather inefficient allocation of the bikes as the day goes on (ie. stations at the top of hills are empty while those at the bottom are full), which can be a little frustrating.
Furthermore he has set up an excellent website with useful data about the Brussels bike scheme with the intention of drawing attention to this important issue.

Returning to the original idea of comparing different European cities' apparent/implied working hours, let us gather our data in a table:

 City Morning peak (local time) Evening peak (local time) Time between peaks Brussels 08:45 None N/A Dublin 08:50 17:45 8 hrs 55 mins Lyon 08:50 18:10 9 hrs 20 mins Paris 08:55 18:50 9 hrs 55 mins Seville 07:55 20:25 12 hrs 30 mins

So what do we make of that!? Certainly the data makes no suggestion that our European counterparts are slacking off with a shorter working week. Indeed Dublin comes in bottom of the four cities. The peak times are much less well defined for the other cities than for Dublin (this is because there are far more bikes and stations) but even allowing for this, there is no way this data suggests we work any harder here in Dublin (indeed it is hard not to conclude the opposite!). Of course this is an extremely crude method and shouldn't really be taken very seriously for a whole host of reasons but I still thought it would be a fun to see what I might find.

Two other brief points here are

1. The reason the graphs for the other cities, especially Paris and Brussels, are so noisy is that there are a great many more stations in these cities than there are in Dublin. As a result, and because it takes time to scrape the data from each station, I have a similar number of records per day for these cities as I do for Dublin but a much smaller number of snapshots of the entire city. (E.g., if we have 100 records between 10 stations we have 10 snapshots of the city but if we have 100 records between 50 stations, we only have 2 snapshots of the city). Fewer data-points means more noise.
2. There appear to be two peaks around lunchtime for Lyon. I do not know why this is the case and would be interested if anyone has an idea. (Perhaps it is a seasonal effect and people try to avoid the hottest sun in the summer? I could look for this in the data but I haven't bothered.)

### Seasonality

Another obvious question I thought I'd ask of the data was how the Summer compared to the Winter and how Mondays compared to Fridays in terms of bike usage.

Regarding Summer vs. Winter, the situation is mildly interesting. The peaks in our usage by time-of-day graph occur at exactly the time and the usage peaks to the same value in the morning. However usage peaks at a significantly smaller value later in the day for Winter than for Summer. My guess is that people prefer using the bikes in Summer but not so much that they are willing to be late for work! (So morning usage is unaffected by the weather but not so for afternoon/evening.)

Dublin bike usage by time of day (weekdays): Winter vs. Summer

For Monday vs. Friday, there is no significant difference. I had wondered if it might be possible to observe a late arrival/early departure time effect but the data does not even hint at its presence.

Dublin bike usage by time of day (weekdays): Monday vs. Friday

### Bike usage by station

Yet another question one can ask is how the different stations compare to each other in terms of busyness. This is impossible to measure perfectly using the snapshot-type data that I have because many dockings/undockings can take place between two snapshots and it is impossible to know how many. Nevertheless we can count how many times each station's snapshot changes throughout the day to get an approximate idea of station busyness. This is a biased measure (e.g., it would be biased against stations that have intense but sparse periods of busyness and biased in favour of stations with a more even spread of busyness) but there should still be some value in its results. Here is what I found

Busyness by station (count of snapshot changes versus busyness rank)

Evidently there is quite a significant spread. I would expect the true chart to have approximately the same ordering but a much greater range of values. Even based on this data, it is worth noting that Hardwicke Street and Parnell Square North stand out as extremely underused stations relative to the others. These two are quite close to each other, it might be interesting to plot station usage on a map to see this data in terms of the city geography. Below is the underlying data for the chart including the dictionary which reveals stations by busyness rank.

 Busyness rank Station name Busyness 40 Hardwicke street 10436 39 Parnell Square North 13651 38 Georges Quay 15947 37 Blessington Street 16722 36 Greek Street 18594 35 Fitzwilliam Square West 19516 34 Bolton Street 20301 33 Custom House 20442 32 Ormond Quay Upper 20794 31 Eccles Street 21521 30 Christchurch Place 21957 29 St. Stephen's Green East 22104 28 Golden Lane 23741 27 Leinster Street South 23997 26 Merrion Square West 24129 25 Dame Street 24400 24 Cathal Brugha Street 24824 23 Parnell Street 26293 22 Mountjoy Square West 26692 21 Jervis Street 26733 20 James Street East 26879 19 Molesworth Street 27536 18 Earlsfort Terrace 28074 17 High Street 28082 16 Townsend Street 28662 15 Merrion Square East 29301 14 St. Stephen's Green South 29408 13 Fownes Street Upper 29743 12 Talbot Street 31604 11 Wilton Terrace 32323 10 Portobello Harbour 32326 9 Charlemont Street 32999 8 Exchequer Street 34557 7 Princes Street / O'Connell Street 34765 6 Smithfield 34829 5 Grantham Street 35034 4 Chatham Street 36681 3 Herbert Place 39368 2 Pearse Street 39857 1 Custom House Quay 40035

### Weather effects

Finally I thought it might be worthwhile looking at weather effects directly (not just through the seasons). I found that Met Eireann make historical weather station data available and scraped a little over a year of it (available for download below). I tested bike usage against weather station data from Dublin Airport (obviously I would prefer more central weather station data, like Merrion Square, but I could not get it). The data I tested against was:
• Rainfall (in mm)
• Sunshine (in hours)
• Minimum day's temperature (in degrees Celsius)
• Maximum day's temperature (in degrees Celsius)
I had one data-point for each of the above measurements per day. I calculated a day's busyness in terms of bike usage by adding up the number of changes to a station's snapshots over the whole day, across all stations. I then I scatter-plotted this busyness against the corresponding day's weather figure for each day.

At first I expected the rainfall data to have the strongest predictive power, however this turned out not to be the case. Thinking about it, I should not have been surprised. Firstly, I only have one point in the scatter plot per day and only just over a year of data, so only about 400 data-points. While it might seem surprising to those of us who live in Dublin, it did not actually rain at all on the vast majority of days so most of the data-set is wasted on dry days. Secondly, while I believe rain is a very strong disincentive for people to cycle it usually only rains for relatively brief periods so we would probably need rainfall data on an much finer timescale like every 5 or 10 minutes to see the effect clearly. Finally, we have rainfall data for Dublin Airport, not Dublin City Centre. In any case, below are the charts.

Dublin rainfall (mm) as a function of time note similarity of Summer and Winter!

Busyness vs. rainfall (mm) (no apparent relationship)

After giving up on rainfall, I did however find that there appeared to be a relationship between maximum day's temperature and bike usage. This makes sense and it is easier to believe that such a weather effect would not be as susceptible to the same problems as the rainfall effect since temperature is much less localised than rain and persists for longer. While the scatter plot is noisy, it does look like there is a plausible positive relationship. (Btw if you're an experimental physicist, the relationship probably looks weak; to a quant. in finance it looks strong!) Here are the charts

Dublin maximum temperature as a function of time note the maximum is below 0 in January 2010!

Busyness vs. max temp. (plausible positive relationship)

### Further thoughts

There is a bike-sharing blog that is worth a visit if you are at all interested in this sort of thing if only to see its world map of bike schemes around the world. I had not realised they were so widespread till I saw this.

Although I like the Dublin bike scheme, I can't help wondering if it is good value for money. It may well be, but I can't help wondering. It seems to me that there is a potential conflict of interest for the city councilors who make the decisions (the scheme is popular so they could be biased in favour of it, even at a bad price) or at least that the best decisions may not be made because of a possible asymmetric perception of opportunity cost relative to upfront cost by either the councilors of the public. A significant part of the payment for the scheme is in advertising revenue forgone by providing J.C.Decaux with free hoardings throughout the city. I would be interested to see the pricing detail. I have been told by those in the know that the keyword when buying such advertising space is the rate card and it seems that this website has some figures for that. I have not quite found the time to look into this in detail myself but would be interested to hear others' thoughts. I'm sure that a few FOI requests could turn up some interesting figures.

For my own part, I feel like I have spent more than enough time analysing bike schemes for now (indeed I had to force myself to write this post). However Dublin Bus are still promising that they will be providing GPS data on the location of their buses any day (they missed their own 2010 deadline). They claim to have a pilot scheme running on the 123 route and I did manage to find one stop with a so called real time passenger information (RTPI) display display on this route, but the display was blank. However in the last few weeks I have noticed the posts for several such displays appearing beside stops on Nassau Street so I expect this data may be on the way. I look forward to gathering and analysing it to see if I can detect any correlation between timetables and the movement of the buses.

### A post-script on technical details

Based on some of the queries I have received in relation to my other post on this topic, I thought it might be useful if I included a little bit of technical information about how/where to scrape the data. In that post, I give the URLs to visit in order to get the data but I did not supply the simple python script to actually scrape the data. I have received quite a few requests for this so here it is (in all its hacky glory!):

import urllib, time, csv, sys, datetime, gzip
from xml.dom import minidom

URL_station_list = 'https://abo-%s.cyclocity.fr/service/carto'
URL_data = 'https://abo-%s.cyclocity.fr/service/stationdetails/%d'

def prev_day_s(d_s):
d = datetime.date(int(d_s[:4]), int(d_s[4:6]), int(d_s[6:8]))
d -= datetime.timedelta(days=1)
return '%04d%02d%02d' % (d.year, d.month, d.day)

def yyyymmdd_from_epoch(t):
tm = time.localtime(t)
return '%04d%02d%02d' % (tm.tm_year, tm.tm_mon, tm.tm_mday)

def get_stations(city, date, get_file = True):
fname = '%s/%s.stations.xml' % (city, date)
if get_file:
urllib.urlretrieve(URL_station_list % city, fname)
stations = map(lambda x: int(x.getAttribute('number')), minidom.parse(fname).getElementsByTagName('marker'))
assert len(stations) > 0 # Horrible hack to quickfix case when web serves up incorrect stations data.
return sorted(stations)

if len(sys.argv) != 5:
sys.stderr.write('Usage: python %s <city> <run time> <main loop delay> <request delay>\n' % sys.argv[0])
sys.exit(1)

city = sys.argv[1]
(run_time, main_delay, req_delay) = map(float, sys.argv[2:])

start_epoch = time.time()
date = yyyymmdd_from_epoch(start_epoch)
try:
stations = get_stations(city, date)
except:
stations = get_stations(city, prev_day_s(date), False)

out = csv.writer(gzip.open('%s/%s.out.csv.gz' % (city, date), 'a'))
err = csv.writer(gzip.open('%s/%s.err.csv.gz' % (city, date), 'a'))
t = time.time()
while t < start_epoch + run_time:
for i in stations:
t = time.time()
try:
out.writerow([i, t] + map(str.strip, urllib.urlopen(URL_data % (city, i))))
except:
err.writerow([i, t, sys.exc_info()[0], sys.exc_info()[1]])
time.sleep(req_delay)
time.sleep(main_delay)


I also thought it would be worth recording the summary statistics for my various data-sets. So here are a few that seem relevant:

 City Start date End date Number of records Link to data Brussels 7-Dec-2009 8-Nov-2010 33,741,610 121 MB Dublin 19-Sep-2009 8-Nov-2010 30,909,771 104 MB Lyon 7-Dec-2009 8-Nov-2010 34,236,139 147 MB Seville 7-Dec-2009 8-Nov-2010 33,365,277 124 MB Paris 7-Dec-2009 8-Nov-2010 40,232,297 189 MB
I will happily supply this data to anyone who requests it from me.

A record consists of a snapshot of a station and as such consists of a station ID, a date/time-stamp and a number which is the number of bikes docked at that station at that date/time. (It would be extremely interesting to get hold of J.C.Decaux's data in which they can actually track individual bikes.)

Also, in case it is useful to somebody, here is the weather data page on Met Eireann and the data which I have weather.csv.

One last point which might be worth making (if only to document it for my own sake) is that I ended up working with these moderately large data-sets in a raw csv format but that I experimented with using sqlite to manage them. A simple python script like this:


import sqlite3
import sys
import datetime
import csv
import gzip
from xml.dom import minidom
from xml.parsers.expat import ExpatError
from utils import yyyymmdd_from_date, date_from_yyyymmdd, ssm_from_datetime

path = '../raw_data/v1/%s/%s.out.csv.gz'

city, startdate, enddate = sys.argv[1:4]
startdate = date_from_yyyymmdd(startdate)
enddate = date_from_yyyymmdd(enddate)

db = sqlite3.connect('%s.db' % city)
curs = db.cursor()
curs.execute('create table station_snaps (station_id integer, date text, ssm integer, available integer)')

d = startdate
while d <= enddate:
db_rows = []
try:
for row in csv.reader(gzip.open(path % (city, yyyymmdd_from_date(d)))):
try:
station_id = int(row[0])
timestamp = float(row[1])
dt = datetime.datetime.fromtimestamp(timestamp) # NB: This handles DST properly.
xml_data = '\n'.join(row[2:])
fld = minidom.parseString(xml_data).getElementsByTagName('available')[0].firstChild
if fld is not None:
db_rows.append((station_id, yyyymmdd_from_date(dt), round(ssm_from_datetime(dt)), int(fld.data)))
except (ValueError, IndexError, ExpatError):
pass
except (IOError, csv.Error):
pass
if len(db_rows) > 1:
curs.executemany('insert into station_snaps values (?, ?, ?, ?)', db_rows)
db.commit()
print d
d += datetime.timedelta(days=1)
db.close()

turns the scraped xml files into a nice sqlite3 file and then we can calculate many of our statistics very easily, in principle, using SQL. For example

select
ssm_300, sum(avg_available)
from
(select
station_id, ssm/300 ssm_300, avg(available) avg_available
from
station_snaps
group by
station_id, ssm_300)
group by
ssm_300
order by ssm_300

Unfortunately however this was just too slow (at least on my laptop) so I had to do things by hand. This is to be expected given that the database cannot use the natural time-ordering of the records which I could in my scripts.