Data set This dataset contains all bike hire trips in Montreal, Canada, from 2014 onwards. Each trip records the pick up and drop off stations, associated time, and the trip duration. We also have access to the stations' GPS coordinates. Due to rough winters, no bikes trips occur in winter.
Note that user-level information is not available. Equally none of the demand drivers or influencers are present, and the instructions are to stick with this data.
https://www.bixi.com/en/open-data#trip-history
Challenge description This exercise is about forecasting the trips starting from Metro Mont Royal (code 6184) to Berri stations (code 6015) between September 4th and September 10th, 2017. The data authorized for this is all the past data up to the 31st of August 2017. This is a forecast up to 10 days ahead of time.
Potential value Forecasting bike rental allows:
The number of bike rides is driven by thousands of factors that lead humans to make one decision over another. Because no causal driver outside the network is included in the dataset, our options to forecast the future are to pick up repeatable signals, and try to contain the noise.
Picking up signal
Reducing noise
Given the limited time available to me, I have been able to explore only some options, and some of them very coarsly. In the next section I outline what I did, and what I would do next.
Below are daily rides between the 4th and 10th of September 2017 from station 6184 to 6015.
Rides / day: 2.6, 3.4, 3.3, 3.4, 3.4, 3, 3.1
The reality is obviously more noisy. An overview of hourly trends in the network is available in the notebook.
A structuring hypothesis is the meaning of absent data. Given the context in which the data is collected, I have assumed that the absence of data means an absence of bike rides. In this sparse environment where null values need to be replaced by zeros, I see two approaches:
To identify time-based trends in the data:
To identify network and entity-based trends:
I explored the dataset in two stages:
(a) a quick model that harvests low-hanging fruits and gives a feel for the data:
(b) a rough attempt to capture more network effects and aggregate signal over entity clusters
Given more time
In this exploration, I focused on predicting the number of rides between station at the day level. I judged that trip time could be simply assessed by historical time (e.g. average or median over previous week). Given the limited number of rides and the limited time, I did not break down the forecast by member / non-member. Overall on the network it would likely impact revenue predictions (probably not the same pricing). At the station level, this could be used to influence marketing campaigns decisions.
Given the opportunity to get more data
Get to the root causal factors as much as possible:
# data manipulation
import numpy as np
import pandas as pd
import datetime as dt
from datetime import date, timedelta
# to unzip and manipulate files
import zipfile
import glob
# times series prediction
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation
from fbprophet.diagnostics import performance_metrics
# to group stations with similarities (here similar demand)
from sklearn.cluster import KMeans
# plots
import matplotlib.pyplot as plt
import seaborn as sns
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *
import cufflinks as cf
init_notebook_mode()
%matplotlib inline
unzip files
notebook_directory = !pwd
notebook_directory = notebook_directory[0]
!mkdir unzip
zipped_files_list = !ls zip
for file in zipped_files_list:
complete_path = notebook_directory + '/zip/' + file
zip_ref = zipfile.ZipFile(complete_path, 'r')
zip_ref.extractall(path='unzip')
zip_ref.close()
concatenate files into 2 dataframes: stations and rides
# stations
generator_stations = (pd.read_csv(f) for f in glob.iglob('unzip/**/Stations*.csv', recursive=True))
stations = pd.concat(generator_stations, ignore_index=True)
# most stations are the same across the years, but not all: look at this in detail if time allows
stations = stations.drop_duplicates()
stations.info()
# rides
generator_rides = (pd.read_csv(f) for f in glob.iglob('unzip/**/OD_*.csv', recursive=True))
rides = pd.concat(generator_rides, ignore_index=True)
rides['start_date'] = pd.to_datetime(rides['start_date'])
rides['end_date'] = pd.to_datetime(rides['end_date'])
# as instructed, limit dataset to dates up to the 31st of August
rides = rides.loc[rides['end_date'] < pd.to_datetime('2017-09-01')]
rides.info()
The absence of rides some days is interpreted as face value, not as missing data.
In this sparse environment where null values need to be replaced by zeros, I see two approaches:
Given the time constraints I have defaulted to the former option.
# approach: cartesian product of dates, start station and end station
# let us generate all dates
d1 = date(2014,4,15)
d2 = date(2017,8,31)
all_dates = pd.DataFrame({'date':pd.to_datetime([d1 + timedelta(days=x) for x in range((d2-d1).days + 1)])})
# keep only most connected pairs
link_importance = rides.groupby(['start_station_code', 'end_station_code'])[['duration_sec']].count()
link_importance.rename(columns={'duration_sec':'rides_perlink_count'}, inplace=True)
link_importance_top1pc = link_importance.loc[link_importance['rides_perlink_count']>=700]
link_importance_top1pc.reset_index(inplace=True)
all_dates['key'] = 0
link_importance_top1pc['key'] = 0
rides_completedates = all_dates.merge(link_importance_top1pc, how='left', on='key')
rides_completedates.drop('key',1, inplace=True)
all_dates.drop('key',1,inplace=True)
rides['start_date_daytrunc'] = rides['start_date'].dt.floor('d')
rides_perday_activeonly = rides.groupby(['start_station_code', 'end_station_code', 'start_date_daytrunc'])[['duration_sec']].agg({'duration_sec':{'rides_perday':'count', 'median_duration_sec_perday':'median'}}).reset_index()
rides_perday = rides_completedates.merge(rides_perday_activeonly, left_on = ['start_station_code', 'end_station_code', 'date'] ,
right_on = ['start_station_code', 'end_station_code', 'start_date_daytrunc'], how='left')
rides_perday['rides_perday'].fillna(value=0, inplace=True)
rides_perday.head()
# from station 6184 to station 6015
selected_stations = stations[stations['code'].isin([6184, 6015])]
plt.figure(figsize=(15,7))
plt.scatter(x=stations["longitude"], y=stations["latitude"], alpha=0.4)
plt.scatter(x=selected_stations["longitude"], y=selected_stations["latitude"], c='red')
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title('station geolocation')
plt.show()
rides_start = rides.groupby(['start_station_code'])[['start_date']].count().reset_index()
rides_start_geo = rides_start.merge(stations, left_on='start_station_code', right_on='code', how='inner')
plt.figure(figsize=(15,7))
plt.scatter(x=rides_start_geo['longitude'], y=rides_start_geo['latitude'],
s=rides_start_geo['start_date']/1500, c=rides_start_geo['start_date'], cmap='hot')
plt.colorbar()
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title('Number of rides departing from stations shown on the map')
plt.show()
rides_6184_to_6015 = rides_perday.loc[(rides_perday['start_station_code']==6184) & (rides_perday['end_station_code']==6015)]
rides_6184_to_6015.head()
# a priori, this is pretty impressive for the median ride time per day to vary that much, but the median is over so few people...
# it may also be the reflection that some bikes are never returned or used to tour the city, different physical conditions, weather conditions etc
sns.distplot(rides_6184_to_6015.dropna()['median_duration_sec_perday'])
rides_6184_to_6015_prophet = pd.DataFrame({'ds':rides_6184_to_6015['date'].values,
'y':rides_6184_to_6015['rides_perday'].values})
m = Prophet()
m.fit(rides_6184_to_6015_prophet)
future = m.make_future_dataframe(periods=10)
forecast = m.predict(future)
fig1 = m.plot(forecast)
fig2 = m.plot_components(forecast)
df_cv = cross_validation(m, initial='900 days', horizon = '10 days', period='5 days')
df_p = performance_metrics(df_cv, rolling_window=1, metrics=['rmse', 'mse'])
df_p
requested_forecast = forecast[forecast['ds'].between(pd.to_datetime('2017-09-04'), pd.to_datetime('2017-09-10'))].set_index('ds')[['yhat', 'yhat_lower', 'yhat_upper']]
requested_forecast
Explore interactively variations between prediction and actual
combined_6184_to_6015 = pd.merge(rides_6184_to_6015_prophet, forecast, on='ds')
combined_6184_to_6015.head()
iplot(combined_6184_to_6015.set_index('ds')[['y', 'yhat', 'yhat_lower', 'yhat_upper']].iplot(asFigure=True,
kind='scatter',xTitle='ds',yTitle='y',title='comparison between historical data and forecast', color=['blue', 'grey', 'black', 'black'], dash = ['solid', 'dashdot' , 'dash', 'dash']))
Exceptionnally here we won't force 0 values where data is missing (very little is missing when aggregated over the entire network, and prophet doesn't deal with this too badly). Just keep in mind that in the following trends, you should expect night rides and winter rides to be closer to 0.
def winter_holidays(year):
d1 = date(year,12,1)
d2 = date(year+1,3,31)
# this will give a list containing all of the dates
return pd.to_datetime([d1 + timedelta(days=x) for x in range((d2-d1).days + 1)])
df_winter = pd.DataFrame({'ds':np.concatenate([winter_holidays(year) for year in np.arange(2014,2017)])})
df_winter['y'] = 0
df_winter.head()
rides['start_date_hourtrunc'] = rides['start_date'].dt.floor('h')
rides_binned = rides.groupby('start_date_hourtrunc')[['start_date']].count()
rides_prophet = pd.concat([ pd.DataFrame({'ds':rides_binned.index, 'y':rides_binned['start_date'].values}),
df_winter], ignore_index=True)
holidays = pd.DataFrame({
'holiday': 'winter',
'ds': df_winter['ds'].values
})
m = Prophet(holidays=holidays)
m.fit(rides_prophet)
future = m.make_future_dataframe(periods=10*24, freq='H')
forecast = m.predict(future)
fig1 = m.plot(forecast)
fig2 = m.plot_components(forecast)
scaling = rides_6184_to_6015_prophet[rides_6184_to_6015_prophet['ds']>='2017-08-01'].sum() / rides_prophet[rides_prophet['ds']>='2017-08-01'].sum()
scaling = scaling.values[0]
scaling
hourly_forecast = forecast[forecast['ds'].between(pd.to_datetime('2017-09-04 00:00:00'), pd.to_datetime('2017-09-10 23:59:59'))].set_index('ds')[['yhat']] * scaling
plt.figure(figsize=(17,6))
plt.plot(hourly_forecast)
plt.xlabel('date'); plt.ylabel('rides per hour'); plt.title('hourly forecast of 6184 -> 6015 trip if it was a simple scale-down of the entire network')
plt.show()
let's group station trips that have a correlated demand
Many ways to measure proximity between stations, e.g. spatial or time distance, number of rides on the link, and demand correlation. Here we chose demand correlation semi-arbitrarily (makes sense + time limits, but more clusters could be learned).
We could play with clustering techniques that shift or compress timelines (such as fastdtw) if we thought there'd be such information delay in the network, but given the limited time I chose to keep things simple.
Below the number of cluster was chosen pretty arbitrarily, based on the number of items and the clusters I could visually see on the correlation matrices. However, given more time I'd look at cluster coherence and use something like an 'elbow technique'. With even more time, clusters could be tuned as hyperparameters to maximise the algorithms' predictions.
# ----------- work on start and end station pairs ('link') -----------
rides_perday_link_pivot = rides_perday.pivot_table(values='rides_perday', index=['start_station_code', 'end_station_code'], columns='date')
plt.figure(figsize=(12,8))
sns.heatmap(rides_perday_link_pivot.transpose().corr())
kmeans_link = KMeans(n_clusters=10, random_state=0).fit_predict(rides_perday_link_pivot)
df_link_clusters = rides_perday_link_pivot.reset_index()[['start_station_code','end_station_code']].assign(link_cluster = kmeans_link)
df_link_clusters.head()
# ----------- work on start stations -----------
rides_perday_startstation = rides_perday.groupby(['start_station_code', 'date'])[['rides_perday']].sum()
rides_perday_startstation.reset_index(inplace=True)
rides_perday_startstation_pivot = rides_perday_startstation.pivot_table(values='rides_perday', index='start_station_code', columns='date')
plt.figure(figsize=(12,8))
sns.heatmap(rides_perday_startstation_pivot.transpose().corr())
kmeans_startstation = KMeans(n_clusters=10, random_state=0).fit_predict(rides_perday_startstation_pivot)
df_startstation_clusters = rides_perday_startstation_pivot.reset_index()[['start_station_code']].assign(startstation_cluster = kmeans_startstation)
df_startstation_clusters.head()
def date_based_features(rides_perday):
# ----- date based features -----
rides_perday['year'] = rides_perday['date'].apply(lambda x: x.year)
rides_perday['month'] = rides_perday['date'].apply(lambda x: x.month)
rides_perday['week'] = rides_perday['date'].apply(lambda x: x.week)
rides_perday['day_of_month'] = rides_perday['date'].apply(lambda x: x.day)
rides_perday['day_of_week'] = rides_perday['date'].apply(lambda x: x.weekday())
rides_perday['days_since_beginningdataset'] = rides_perday['date'].apply(lambda x: (x- pd.to_datetime('2014-04-15')).days)
# average data over the week
rides_perday_weekmean = rides_perday.groupby(['week', 'start_station_code', 'end_station_code'])[['rides_perday']].agg({'rides_perday':{'rides_perday_weekmean':'mean', 'rides_perday_weekmedian':'median'}})
rides_perday = rides_perday.merge(rides_perday_weekmean, on=['week', 'start_station_code', 'end_station_code'])
return rides_perday
def rides_relativeto_total_startandend_stations_traffic(rides_perday):
# ---------- relative number of rides to total start and end station traffic ----------
# start_station rides each week + day ride as a fraction of that
start_station_importance_perweek = rides_perday.groupby(['start_station_code', 'week'])[['rides_perday']].agg({'rides_perday':{'meanrides_startstation_thisweek':'mean', 'medianrides_startstation_thisweek':'median'}})
start_station_importance_perweek.reset_index(inplace=True)
rides_perday = rides_perday.merge(start_station_importance_perweek, on=['start_station_code', 'week'], how='inner')
rides_perday['rides_ratio_startstation'] = rides_perday['rides_perday'] / (rides_perday['meanrides_startstation_thisweek'] + 0.01)
# end_station rides each week + day ride as a fraction of that
end_station_importance_perweek = rides_perday.groupby(['end_station_code', 'week'])[['rides_perday']].agg({'rides_perday':{'meanrides_endstation_thisweek':'mean', 'medianrides_endstation_thisweek':'median'}})
end_station_importance_perweek.reset_index(inplace=True)
rides_perday = rides_perday.merge(end_station_importance_perweek, on=['end_station_code', 'week'], how='inner')
rides_perday['rides_ratio_endstation'] = rides_perday['rides_perday'] / (rides_perday['meanrides_endstation_thisweek'] + 0.01)
# link rides each week + day ride as a fraction of that
link_importance_perweek = rides_perday.groupby(['start_station_code', 'end_station_code', 'week'])[['rides_perday']].agg({'rides_perday':{'meanrides_link_thisweek':'mean', 'medianrides_link_thisweek':'median'}})
link_importance_perweek.reset_index(inplace=True)
rides_perday = rides_perday.merge(link_importance_perweek, on=['start_station_code', 'end_station_code', 'week'], how='inner')
rides_perday['rides_ratio_link'] = rides_perday['rides_perday'] / (rides_perday['meanrides_link_thisweek'] + 0.01)
rides_perday['rides_ratio_linkmedian'] = rides_perday['rides_perday'] / (rides_perday['medianrides_link_thisweek'] + 0.01)
return rides_perday
def clusters_features(rides_perday):
# ---------- clusters ----------
# define clusters
rides_perday = rides_perday.merge(df_link_clusters, on=['start_station_code', 'end_station_code'], how='inner')
rides_perday = rides_perday.merge(df_startstation_clusters, on=['start_station_code'], how='inner')
# ----- calculate rides statistics linked to each cluster
# link cluster: daily predictions on the cluster, and on specific station by taking ratio over a week
rides_perday_linkcluster = rides_perday.groupby(['date', 'link_cluster'])[['rides_perday']].agg({'rides_perday':{'rides_perday_linkcluster':'mean'}}).reset_index()
rides_perday = rides_perday.merge(rides_perday_linkcluster, on=['date', 'link_cluster'], how='inner')
rides_linkcluster_thisweek = rides_perday.groupby(['week', 'link_cluster'])[['rides_perday_linkcluster']].agg({'rides_perday_linkcluster':{'rides_perday_linkcluster_weeklymean':'mean', 'rides_perday_linkcluster_weeklymedian':'median'}}).reset_index()
rides_perday = rides_perday.merge(rides_linkcluster_thisweek, on=['week', 'link_cluster'])
rides_perday['rides_linkcluster_ratio_weekly'] = rides_perday['rides_perday_weekmean'] / (rides_perday['rides_perday_linkcluster_weeklymean'] + 0.01)
rides_perday['rides_perday_linkcluster_adj'] = rides_perday['rides_linkcluster_ratio_weekly'] * rides_perday['rides_perday_linkcluster']
# start station cluster: daily predictions on the cluster, and on specific station by taking ratio over a week
rides_perday_startstationcluster = rides_perday.groupby(['date', 'startstation_cluster'])[['rides_perday']].agg({'rides_perday':{'rides_perday_startstationcluster':'mean', 'rides_perday_startstationcluster_median':'median'}}).reset_index()
rides_perday = rides_perday.merge(rides_perday_startstationcluster, on=['date', 'startstation_cluster'], how='inner')
rides_startcluster_thisweek = rides_perday.groupby(['week', 'startstation_cluster'])[['rides_perday_startstationcluster']].agg({'rides_perday_startstationcluster':{'rides_perday_startcluster_weeklymean':'mean', 'rides_perday_startcluster_weeklymedian':'median'}})
rides_perday = rides_perday.merge(rides_startcluster_thisweek, on=['week', 'startstation_cluster'])
rides_perday['rides_startcluster_ratio_weekly'] = rides_perday['rides_perday_weekmean'] / (rides_perday['rides_perday_startcluster_weeklymean'] + 0.01)
rides_perday['rides_perday_startcluster_adj'] = rides_perday['rides_startcluster_ratio_weekly'] * rides_perday['rides_perday_startstationcluster']
return rides_perday
def lags(rides_perday):
# ----- order dataframe as we'll be using lag functions later -----
rides_perday.sort_values(by=['start_station_code', 'end_station_code', 'date'], ascending=True, inplace=True)
# ---------- pass past data ----------
# use lag to get latest ride number
rides_perday['rides_10daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday'].shift(10)
rides_perday['rides_11daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday'].shift(11)
rides_perday['rides_12daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday'].shift(12)
rides_perday['rides_13daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday'].shift(13)
rides_perday['rides_14daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday'].shift(14)
rides_perday['rides_15daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday'].shift(15)
# use lag to get latest time duration between stations, and relative number of rides
# perhaps should I aggregate median duration over a week instead of taking only a day
rides_perday['median_duration_10daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['median_duration_sec_perday'].shift(10)
rides_perday['medianrides_startstation_week14daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['medianrides_startstation_thisweek'].shift(14)
rides_perday['medianrides_endstation_week14daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['medianrides_endstation_thisweek'].shift(14)
rides_perday['medianrides_link_week14daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['medianrides_link_thisweek'].shift(14)
rides_perday['rides_ratio_startstation_14daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_ratio_startstation'].shift(14)
rides_perday['rides_ratio_endstation_14daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_ratio_endstation'].shift(14)
rides_perday['rides_ratio_link_14daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_ratio_link'].shift(14)
rides_perday['rides_ratio_linkmedian_14daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_ratio_linkmedian'].shift(14)
# calculate change for our latest data
rides_perday['rides_10daysago_diff1'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_10daysago'].diff(periods=1)
rides_perday['rides_10daysago_diff7'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_10daysago'].diff(periods=7)
rides_perday['rides_11daysago_diff1'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_11daysago'].diff(periods=1)
rides_perday['rides_11daysago_diff7'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_11daysago'].diff(periods=7)
# --------------- cluster data ---------------
# cluster rides per day
rides_perday['rides_link_cluster_14daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday_linkcluster'].shift(14)
rides_perday['rides_start_cluster_14daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday_startstationcluster'].shift(14)
# smoothed prediction of station data, from cluster
rides_perday['rides_perday_linkcluster_adj_14daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday_linkcluster_adj'].shift(14)
rides_perday['rides_perday_startcluster_adj_14daysago'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday_startcluster_adj'].shift(14)
# change on cluster predictions
rides_perday['rides_perday_linkcluster_adj_14daysago_diff1'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday_linkcluster_adj_14daysago'].diff(periods=1)
rides_perday['rides_perday_linkcluster_adj_14daysago_diff7'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday_linkcluster_adj_14daysago'].diff(periods=7)
rides_perday['rides_perday_startcluster_adj_14daysago_diff1'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday_startcluster_adj_14daysago'].diff(periods=1)
rides_perday['rides_perday_startcluster_adj_14daysago_diff7'] = rides_perday.groupby(['start_station_code', 'end_station_code'])['rides_perday_startcluster_adj_14daysago'].diff(periods=7)
return rides_perday
# --------------- TO DO: go through my features, and check:
# - that I haven't forgotten to pass major ones through lag
# - that I haven't included too many features that are colinear
def feature_engineering(rides_perday):
'''
basic requirements for each row on this table:
- date
- start_station_code
- end_station_code
there should be past data also (at least 20 days, but way more in practice)
'''
return lags(clusters_features(rides_relativeto_total_startandend_stations_traffic(date_based_features(rides_perday))))
# to allow feature engineering, I need to process my forecasted dates with the historical values
# GENERATE DATES TO FORECAST HERE and add to dataframe at the end
rides_perday = feature_engineering(rides_perday)
rides_perday.head(20)
rides_perday_dropped = rides_perday
# TO DO: all these variables need tidying up
# drop variables that are probably redundant
rides_perday_dropped.drop(['day_of_month'
], axis=1, inplace=True)
# drop variables that won't be accessible for predictions
rides_perday_dropped.drop(['date', 'start_date_daytrunc', 'rides_perlink_count', 'median_duration_sec_perday',
'rides_perday_weekmean', 'rides_perday_weekmedian', 'meanrides_startstation_thisweek',
'medianrides_startstation_thisweek', 'rides_ratio_startstation', 'meanrides_endstation_thisweek',
'medianrides_endstation_thisweek', 'rides_ratio_endstation', 'meanrides_link_thisweek',
'medianrides_link_thisweek', 'rides_ratio_link', 'rides_ratio_linkmedian',
'rides_perday_linkcluster', 'rides_perday_linkcluster_weeklymean', 'rides_perday_linkcluster_weeklymedian',
'rides_linkcluster_ratio_weekly', 'rides_perday_linkcluster_adj', 'rides_perday_startstationcluster',
'rides_perday_startstationcluster_median', 'rides_perday_startcluster_weeklymean',
'rides_perday_startcluster_weeklymedian', 'rides_startcluster_ratio_weekly', 'rides_perday_startcluster_adj'
], axis=1, inplace=True)
# transform a few categorical variables into dummies
# to avoid overfitting, we'll limit it to the minimum
rides_perday_dropped = pd.get_dummies(rides_perday_dropped,
columns=['month', 'day_of_week', 'link_cluster', 'startstation_cluster'
# ,'start_station_code', 'end_station_code'
], drop_first=True)
# dummies for the start and end station codes would be a lot. By giving data over the previous few weeks,
# this should allow to pick up some specificities about each link, while trying to generalize the results
rides_perday_dropped.replace([np.inf, -np.inf], np.nan)
rides_perday_dropped.dropna(inplace=True, how='any')
print(pd.to_datetime('2014-04-15') + timedelta(days=1115))
print(rides_perday_dropped['days_since_beginningdataset'].max())
print(rides_perday_dropped['days_since_beginningdataset'].min())
# RMSE applied to the log of the target and the prediction output.
# It works as an approximation to the percentage error between our predictions and the target, which is a nice way to understand the errors our model is making.
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import mean_squared_error
def rmsle(ytrue, ypred):
return np.sqrt(mean_squared_log_error(ytrue, ypred))
def rmse(ytrue, ypred):
return np.sqrt(mean_squared_error(ytrue, ypred))
# forecast is 4-10 days in advance for this exercise.
# We'll give the model all the data to a certain point, and predict the next 10 days
# To test our script we'll test this kind of predictions multiple times, and average the results out
# 1200 to 1234 will be reserved for final test
rides_perday_test = rides_perday_dropped.loc[rides_perday_dropped['days_since_beginningdataset']>1200]
rides_perday_crossval = rides_perday_dropped.loc[rides_perday_dropped['days_since_beginningdataset']<=1200]
# we're left with 0 to 1200 days for cross validation
# we'll dedicate at least 1115 days to training (a big part is winter closure)
# 900-1115 days will be for cross validation
# to control computational power, we'll run the validation every 42 days (5 iterations)
def timeseries_crossvalidation(model, number_test=5, end_purelearning=1115, end_crossvalidation_set = 1200):
period = (end_crossvalidation_set-end_purelearning) // number_test
mean_error_rmsle = []
mean_error_rmse = []
for cutoff_day in range(end_purelearning, end_crossvalidation_set, period):
#print(cutoff_day)
#print(cutoff_day - 9)
train = rides_perday_crossval[rides_perday_crossval['days_since_beginningdataset'] < cutoff_day-9]
val = rides_perday_crossval[rides_perday_crossval['days_since_beginningdataset'].between(cutoff_day-9, cutoff_day)]
#print('train', train)
#print()
#print('val', val)
xtr, xts = train.drop('rides_perday', axis=1), val.drop('rides_perday', axis=1)
ytr, yts = train['rides_perday'].values, val['rides_perday'].values
model.fit(xtr, ytr)
p = model.predict(xts)
error_rmsle = rmsle(yts, pd.DataFrame({'p':p}).clip_lower(0).values)
error_rmse = rmse(yts, p)
print('Cutoffday %d - RMSLE %.5f, RMSE %.5f' % (cutoff_day, error_rmsle, error_rmse))
mean_error_rmsle.append(error_rmsle)
mean_error_rmse.append(error_rmse)
print('Mean error over crossvalidation set - - RMSLE %.5f, RMSE %.5f' % (np.mean(mean_error_rmsle), np.mean(mean_error_rmse)))
rides_perday_dropped.shape[0] / rides_perday_dropped.shape[1]
# baseline
# a simple way to predict naively what will happen is to look at the past datapoint on the same day of the week
# p = val['rides_14daysago'].values
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.neighbors import KNeighborsRegressor
random_state = 2
models = [
LinearRegression(n_jobs=-1)
, RandomForestRegressor(n_jobs=-1, random_state=random_state, n_estimators=20)
# , AdaBoostRegressor(random_state=random_state)
# , GradientBoostingRegressor(random_state=random_state)
# , MLPRegressor(random_state=random_state)
# , SGDRegressor(random_state=random_state)
# , BaggingRegressor(n_jobs=-1, random_state=random_state)
, KNeighborsRegressor(n_jobs=-1)
]
model_names = [
"LinearRegression"
, "RandomForestRegressor"
# , "AdaBoostRegressor"
# , "GradientBoostingRegressor"
# , "MLPRegressor"
# , 'SGDRegressor'
# , 'BaggingRegressor'
, 'KNeighborsRegressor'
]
i=0
for model in models:
print(model_names[i])
timeseries_crossvalidation(model, number_test=3)
print('')
i=i+1
importances = models[1].feature_importances_
std = np.std([tree.feature_importances_ for tree in models[1].estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking and contribution
print("Feature ranking:")
for f in range(rides_perday_crossval.drop('rides_perday', axis=1).shape[1]):
print("%d. feature %d %s (%f)" % (f + 1, indices[f],rides_perday_crossval.drop('rides_perday', axis=1).columns.tolist()[indices[f]], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(rides_perday_crossval.drop('rides_perday', axis=1).shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(rides_perday_crossval.drop('rides_perday', axis=1).shape[1]), indices)
plt.xlim([-1, rides_perday_crossval.drop('rides_perday', axis=1).shape[1]])
plt.show()
For a random forrest, the feature importance represents the average of reduction in impurity index over all trees, when a particular feature is used at split point. This is normalized so it sums to 1.
As many of our features are related, the choice of the first feature to split the tree among related features will be somewhat random, and so will the feature importance.
#LinearRegression_coefficients = pd.DataFrame({"Feature":rides_perday_crossval.drop('rides_perday', axis=1).columns,"Coefficients":np.transpose(models[0].coef_)})
#LinearRegression_coefficients.sort_values(by='Coefficients', ascending=False)
gridsearchCV
diagnose under or overfitting
aggregate predictions from very different models