Airmet - The best helmet is the one that's with you
Airmet is a London based company that manufactures and resell an inflatable helmet of the same name specially designed for those regular users of cycle and other 2 wheels hire schemes. The Airmet uses cutting edge technology to provide best in class protection to the rider and once deflated takes the shape of a credit card to fit in your wallet so you always have it with you.
Brief
The Marketing Director at Airmet wants to launch a new marketing campaign in order to raise awareness and boost acquisition.
His idea, in order to target a relevant audience, is to take over some cycle scheme docking stations and turn them into an ad.
The Director has found a free dataset online detailing historical traffic for all the cycle scheme docking stations in London. As Airmet's go to data guru, you are asked to come up with a recommendation to optimise chances for success of the campaign.
Facts & Constraints
Deliverables
You are expected to present your findings and recommendation including :
You will also be asked to share the code used to analyse the data
Assets
You can find the dataset here and details on how to get started with public-data in BigQuery here. While the dataset seats in BigQuery you are free to use any tool and language of your choice. You are also welcome to use any other relevant public dataset available out there.
Please do reach out if you have any questions.
A few structuring questions about the study that I would ask if I was actually designing this campaign.
(1) Breadth vs repetition
Why is this information impactful? (i) If there is no intensity game, and if most customers on a station are fresh everyday or if their need a high number of exposures to ads to convert efficaciously, the strategy could be to keep only a few stations for a long time. (ii) On the other hand, if intensity is required, or if customers stick to the station they use and they do not need a high number of impression to convert (or never convert), we probably ought to run a short campaign on multiple stations at the same time.
(2) Conversion and profitability
Why is this information impactful? As the campaign is parallelized, the number of people reached per station will go down. Gross profit and conversion rate will help assess how many trips per station per week (or users) are needed for the campaign to be profitable on that station.
(3) Timing
(1.a) Previous performance marketing campaigns have shown that higher frequency drives higher purchase intent. This is due to Airmet being a new innovative product and the audience needing time to understand the offering and digest the message.
(1.b) The dataset is indeed incomplete on that level but it’s all Airmet has managed to get access to. An assumption is expected to be made here.
(2.a) Previous performance marketing campaign and website analytics have shown the following :
(2.b) Campaign will be tracked through a unique code for each station incentivised by a 20% off their next month of Cycle Scheme subscription (a deal was cracked with tfl who is keen to use this as a pilot to understand wether advertising on stations could become an interesting new source of revenue for them).
(2.c) Gross profit per helmet is £60
(3.a) The dataset is indeed incomplete. It is fine to assume that traffic and seasonality remained the same since the last data available for the purpose of this case.
(3.b) Even though 6 months is a very long time for an early stage startup, the Marketing director is keen to understand which is the best period in the year to run the campaign.
Data structure
Modelling
Given assumed gross margin per ad view (conversion rates, margins), only a few stations are expected to have enough traffic (particularly commuter traffic) to be profitable.
With the current assumptions, the overall strategy is to run on only 2 stations (other stations aren't estimated profitable):
If the conversion rates or gros margin improve, the next stations on the list are:
The suggested approach is to:
# data manipulation
import numpy as np
import pandas as pd
from datetime import date, timedelta
from math import isnan
# get the data from Google Big Query
from google.cloud import bigquery
from pandas.io import gbq
# times series prediction
from fbprophet import Prophet
# visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sql = '''
SELECT *
FROM `bigquery-public-data.london_bicycles.cycle_stations`
'''
df_stations = gbq.read_gbq(sql, project_id="airmet-2018-11-17", dialect='standard')
df_stations.head()
# the data is pretty big, so for this first-off analysis we limit ourselves to 2015 onwards
# we will also aggregate data at the day level
#
sql = '''
SELECT
TIMESTAMP_TRUNC(start_date, DAY, 'UTC') AS ride_date,
start_station_id,
end_station_id,
COUNT(DISTINCT rental_id) AS rides,
COUNT(DISTINCT bike_id) AS distinct_bike_id,
AVG(duration) AS avg_ride_duration
FROM `bigquery-public-data.london_bicycles.cycle_hire`
WHERE start_date >= '2015-01-01'
GROUP BY start_station_id, end_station_id, ride_date
ORDER BY start_station_id, end_station_id, ride_date
'''
df_rides = gbq.read_gbq(sql, project_id="airmet-2018-11-17", dialect='standard')
df_rides.head()
# save data locally (cost and time control)
df_stations.to_csv('stations.csv')
df_rides.to_csv('rides.csv')
# convert to datetime
df_rides['ride_date'] = pd.to_datetime(df_rides['ride_date'])
df_rides.head()
print(df_rides[['ride_date']].min())
print(df_rides[['ride_date']].max())
# ----------- missing dates - probably not many at the agglomerated level -----------
d1 = date(2015,1,4)
d2 = date(2017,6,13)
all_dates = pd.DataFrame({'date':pd.to_datetime([d1 + timedelta(days=x) for x in range((d2-d1).days + 1)])})
rides_perday_agglomerated = df_rides.groupby('ride_date')[['rides']].sum().reset_index()
rides_perday_agglomerated_complete = all_dates.merge(rides_perday_agglomerated, left_on='date', right_on='ride_date',
how='left')
#If a date is missing, we interpret it as no rides that day (e.g. closure)
rides_perday_agglomerated_complete['rides'].fillna(value=0, inplace=True)
# ----------- fit the data with Prophet -----------
# additive regression model that combines nicely trend (piecewise or logistic growth), yearly seasonal component (fourier series), weekly seasonal component (dummy variables)
rides_perday_agglomerated_prophet = pd.DataFrame({'ds':rides_perday_agglomerated_complete['date'].values,
'y':rides_perday_agglomerated_complete['rides'].values})
m = Prophet()
m.fit(rides_perday_agglomerated_prophet)
future = m.make_future_dataframe(periods=180)
forecast = m.predict(future)
fig1 = m.plot(forecast)
fig2 = m.plot_components(forecast)
overall the network seems dominated by commuters
# approach: cartesian product of dates, and stations considered
# to avoid using unnecessary memory, we will perform the join only when the dataset will have been reduced for our speciic purpose
# let us generate all dates
d1 = date(2016,6,20)
d2 = date(2016,8,20)
summer_2016_dates = pd.DataFrame({'date':pd.to_datetime([d1 + timedelta(days=x) for x in range((d2-d1).days + 1)])})
summer_2016_dates.head()
# restrict to peak season for speed
summer_rides = df_rides.loc[df_rides['ride_date'].between(pd.to_datetime('2016-06-20'), pd.to_datetime('2016-08-20'))]
# get number of visits at a station by adding rides coming from and arriving at that station
startstation_rides = summer_rides.groupby(['start_station_id', 'ride_date'])[['rides']].sum().reset_index()
endstation_rides = summer_rides.groupby(['end_station_id', 'ride_date'])[['rides']].sum().reset_index()
station_visit_incomplete = startstation_rides.merge(endstation_rides, left_on=['start_station_id', 'ride_date'],
right_on=['end_station_id', 'ride_date'], how='outer')
# missing values and departure or arrival interpreted as no rides in that direction that day
station_visit_incomplete.update(station_visit_incomplete[['rides_x', 'rides_y']].fillna(value=0))
station_visit_incomplete['station_id'] = station_visit_incomplete.apply(lambda row: next(x for x in [row['start_station_id'], row['end_station_id']] if not isnan(x)), axis=1)
station_visit_incomplete['visits'] = station_visit_incomplete['rides_x'] + station_visit_incomplete['rides_y']
# cartesian product of stations and dates (if data was bigger we could restrict to top stations, but fine here)
active_stations = station_visit_incomplete[['station_id']].drop_duplicates()
summer_2016_dates['key'] = 0
active_stations['key'] = 0
cartesian_product = summer_2016_dates.merge(active_stations, how='left', on='key')
cartesian_product.drop('key', axis=1, inplace=True)
summer_2016_dates.drop('key', axis=1, inplace=True)
# merge rides and dates
station_visits = cartesian_product.merge(station_visit_incomplete[['station_id', 'ride_date', 'visits']],
left_on=['station_id', 'date'], right_on=['station_id', 'ride_date'],
how='left')
station_visits['visits'].fillna(value=0, inplace=True)
station_visits.drop('ride_date', axis=1, inplace=True)
station_visits.head()
mean_visits_perday = station_visits.groupby('station_id')[['visits']].mean()
mean_visits_perday.sort_values(by='visits', ascending=False).head(10)
plt.figure(figsize=(10,5))
sns.distplot(mean_visits_perday)
plt.xlabel('mean visits per day per station, at peak season'); plt.ylabel('stations probability density function');
plt.show()
# visualization on a map of where each station is located, and what is its mean number of visitor per day
startstation_augmented = mean_visits_perday.merge(df_stations, how='inner', left_on='station_id', right_on='id')
plt.figure(figsize=(15,7))
plt.scatter(x=startstation_augmented['longitude'], y=startstation_augmented['latitude'],
s=startstation_augmented['visits']/3, c=startstation_augmented['visits'])
plt.colorbar()
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title("Number of cyclists' visits to stations shown on the map")
plt.show()
installation_cost_perstation = 100
# in the absence of better estimates, we will assume that online behaviors are mostly replicated in the real world
# let's assume that street advertizing creates an extra barrier to purchase, as users need to navigate to website
view_to_website = 0.015 / 3
website_conversion_rate = 0.05 # we assume the online advertising was equally or less targeted than this campaign
gross_profit_perhelmet = 60
cost_per_week = 200
gross_profit_perview = gross_profit_perhelmet * website_conversion_rate * view_to_website
dailyviews_forbreakeven = (cost_per_week / 7) / gross_profit_perview
print('Estimated gross profit per view GBP', gross_profit_perview)
print('Estimated number of daily views per station to breakeven', dailyviews_forbreakeven)
# if focus on top 10 stations
print('mean summer visits for top 10 stations :', mean_visits_perday.sort_values(by='visits', ascending=False).head(10).mean().values[0])
print('mean summer visits for top 2 stations :', mean_visits_perday.sort_values(by='visits', ascending=False).head(2).mean().values[0])
print('mean summer visits for top 1 stations :', mean_visits_perday.sort_values(by='visits', ascending=False).head(1).mean().values[0])
station_visits['day_of_week'] = station_visits['date'].apply(lambda x: x.weekday())
station_visits['week'] = station_visits['date'].apply(lambda x: x.week)
station_visits['weekend_flag'] = station_visits['day_of_week'].apply(lambda x: 1 if x>=5 else 0)
station_visits.sort_values(by=['station_id', 'date']).head(7)
def weekend_usage_proportion(x):
calculated_weights = station_visits.loc[x.index, "visits"]
# if a station wasn't used on a certain week, we define the usage proportion to be week only
# this is arbitrary, but will not have any impact given our future usage
if calculated_weights.sum() == 0:
return 0
else:
return np.average(x, weights=calculated_weights)
station_visits_week = station_visits.groupby(['station_id', 'week']).agg({'weekend_flag':weekend_usage_proportion, 'visits':sum})
station_visits_week.reset_index(inplace=True)
station_visits_week.rename(columns={'weekend_flag':'weekend_usage_proportion'}, inplace=True)
station_visits_week = station_visits_week.merge(df_stations[['id', 'name']], left_on='station_id', right_on='id',
how='inner')
station_visits_week.drop(columns='id', inplace=True)
station_visits_week.head()
# estimated on Nielson data on consummer goods and electronics
# https://www.nielsen.com/au/en/insights/news/2017/how-frequency-of-exposure-can-maximise-the-resonance-of-your-digital-campaigns.html
lift_for10views_perperson = (20 / 5)
lift_for10views_perperson
def estimated_grossmargin_independantweeks(row, lift_for10views_perperson_param = lift_for10views_perperson,
gross_profit_perview_param = gross_profit_perview, cost_per_week_param = cost_per_week):
'''
For this first model, all customers assumed to be new every new week.
Weekend customers are assumed to be fully new users, and weekday customers are assumed to come 10 times a week
(comes from assumed commuter and entertainment usage patterns)
Repeat viewership is modelled to generate a substantial lift
'''
return row['visits'] * gross_profit_perview_param * (1*row['weekend_usage_proportion'] +
lift_for10views_perperson_param*(1-row['weekend_usage_proportion'])) - cost_per_week_param
station_visits_week['estimated_grossmargin_independantweeks'] = station_visits_week.apply(estimated_grossmargin_independantweeks, axis=1)
station_visits_week.sort_values(by='estimated_grossmargin_independantweeks', ascending=False).head(30)
station_visits_week_model1profitable = station_visits_week[station_visits_week['estimated_grossmargin_independantweeks']>0]
station_visits_week_model1profitable = station_visits_week_model1profitable.groupby(['station_id', 'name'])[
['estimated_grossmargin_independantweeks', 'visits', 'week']].agg({'estimated_grossmargin_independantweeks':{
'number_week':'count', 'grossmargin_perstation':'sum'}, 'visits':{'mean_weekly_visits':'mean'},
'week':{'first_week':'min'}}).reset_index()
station_visits_week_model1profitable['grossmargin_perstation_withinstallationcost'] = station_visits_week_model1profitable['grossmargin_perstation'] - installation_cost_perstation
station_visits_week_model1profitable['ROI'] = station_visits_week_model1profitable['grossmargin_perstation_withinstallationcost'] / installation_cost_perstation
station_visits_week_model1profitable.sort_values(by='grossmargin_perstation_withinstallationcost', ascending=False)