The aim of this notebook is to identify 2-3 actionable levers to increase the number of successful bookings.
As the dataset pertains mostly to matching, the focus of this notebook will be on booking rates. Further analysis with access to more data should consider ratings and Customer Lifetime Value optimization (for guests and hosts).
(0) Data importation
(1) Features engineering
(2) Get a first intuition of what features matter
(3) Visualization of key features
(4) Logistic regression to attribute impact of significant variables, controlling for confounding factors
# data manipulation
import numpy as np
import pandas as pd
# plots
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# machine-assisted exploration
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# statistical analysis and regression model
import statsmodels.discrete.discrete_model as sm
contacts = pd.read_csv('contacts.csv')
contacts.head(2)
contacts.info()
get a feel for the context of this dataset
contacts['id_guest_anon'].agg(['nunique', 'count'])
contacts['id_host_anon'].agg(['nunique', 'count'])
contacts['id_listing_anon'].agg(['nunique', 'count'])
contacts['ts_interaction_first'].agg(['min', 'max'])
users = pd.read_csv('users.csv')
users.head(2)
listings = pd.read_csv('listings.csv')
listings.head(2)
print(listings['listing_neighborhood'].nunique())
listings['listing_neighborhood'].value_counts()[:10]
It'd be cool to visualize booking number and conversion rates in different neighborhoods. However, a large part of the listings do not have any neighborhoods affiliation
map with the 'Bairro' (neighborhoods) http://portalgeo-pcrj.opendata.arcgis.com/datasets/8454eb0454b7424d89c61b67742286a1_15?geometry=-77.293%2C5.937%2C-73.723%2C6.756
KPIs: what do people care about?
What can be done differently
Potential ways to identify inefficiencies
Potential segments
However,
# --------------- join datasets ---------------
joined = contacts.merge(users, left_on = 'id_guest_anon', right_on = 'id_user_anon', how='left')
joined = joined.merge(listings, on = 'id_listing_anon', how='left')
# checked the match: 100%
# --------------- data type conversion ---------------
joined['ts_interaction_first'] = pd.to_datetime(joined['ts_interaction_first'])
joined['ts_reply_at_first'] = pd.to_datetime(joined['ts_reply_at_first'])
joined['ts_accepted_at_first'] = pd.to_datetime(joined['ts_accepted_at_first'])
joined['ts_booking_at'] = pd.to_datetime(joined['ts_booking_at'])
joined['ds_checkin_first'] = pd.to_datetime(joined['ds_checkin_first'])
joined['ds_checkout_first'] = pd.to_datetime(joined['ds_checkout_first'])
# --------------- feature engineering ---------------
# boolean variables for the conversion funnel: reply, accepted, booking
joined['contact'] = 1 #non predictive feature that's just convenient for funnel analysis
joined['replied'] = joined['ts_reply_at_first'].apply(lambda x: 0 if x is pd.NaT else 1)
joined['accepted'] = joined['ts_accepted_at_first'].apply(lambda x: 0 if x is pd.NaT else 1)
joined['booked'] = joined['ts_booking_at'].apply(lambda x: 0 if x is pd.NaT else 1)
# booking speed (seconds)
joined['delay_firstreply'] = joined.apply(lambda x: (x['ts_reply_at_first'] - x['ts_interaction_first']).seconds, axis=1)
joined['delay_firstaccepted'] = joined.apply(lambda x: (x['ts_accepted_at_first'] - x['ts_interaction_first']).seconds, axis=1)
joined['delay_booked'] = joined.apply(lambda x: (x['ts_booking_at'] - x['ts_interaction_first']).seconds , axis=1)
# deal with nA
joined['delay_firstreply'] = joined['delay_firstreply'].fillna(value=3600*40)
joined['log_delay_firstreply'] = np.log(joined['delay_firstreply']+1)
joined['response_within_15min'] = joined['delay_firstreply'].apply(lambda x: 1 if x<=15*60 else 0)
joined['response_between_15_and_45min'] = joined['delay_firstreply'].apply(lambda x: 1 if ((x>15*60)&(x<=45*60)) else 0)
# ------ deal with NaN for response time and Cie
# it seems that NaN corresponds to an expired response time
# regression on conversion rate vs response time suggests that a ~40h response time would have 0% booking rate
# so for our attribution model, it seems that attributing 40h to the response time NaN could do the trick
#joined['delay_firstreply'] = joined['delay_firstreply'].fillna(value=3600*40)
# previous approach
#no_answer_infinitedelay = 6*30*24*3600 #arbitrary number for now: potentially improve with
#joined[['delay_firstreply', 'delay_firstaccepted', 'delay_booked']] = joined[['delay_firstreply', 'delay_firstaccepted', 'delay_booked']].fillna(value=no_answer_infinitedelay, axis=1)
# only 35 people have an unknown 'guest_user_stage_first' status
# for statistical significance we'll get rid of that category, and initially assign it to new users
joined['guest_user_stage_first'] = joined['guest_user_stage_first'].apply(lambda x: 'new' if x=='-unknown-' else x)
# --> potentially create another variable that's the time to answer normalized by enquiry_to_checkin
# how far in advance is enquiry made (days)
joined['enquiry_to_checkin'] = joined.apply(lambda x: (x['ds_checkin_first'] - x['ts_interaction_first']).days, axis=1)
joined['enquiry_sameday_checkin'] = joined['enquiry_to_checkin'].apply(lambda x: 1 if x==0 else 0)
joined['enquiry_nextday_checkin'] = joined['enquiry_to_checkin'].apply(lambda x: 1 if x<=1 else 0)
joined['enquiry_sameweek_checkin'] = joined['enquiry_to_checkin'].apply(lambda x: 1 if x<=7 else 0)
# how many nights booked
joined['nights_booked'] = joined.apply(lambda x: (x['ds_checkout_first'] - x['ds_checkin_first']).days, axis=1)
# random forrest suggests that being brazilian vs not makes a difference
# we'll create a specific feature for this
joined['isbrazilian'] = joined['country'].apply(lambda x: 1 if x=='BR' else 0)
# keep only most frequent nationalities and neighborhoods
country_freq = joined[['id_guest_anon', 'country']].groupby('country').count()
frequent_country = set(country_freq[country_freq['id_guest_anon']>50].index)
joined['frequent_countries'] = joined['country'].apply(lambda x: x if (x in frequent_country) else 'other_country')
neighborhood_freq = joined[['id_guest_anon', 'listing_neighborhood']].groupby('listing_neighborhood').count()
frequent_neighborhood = set(neighborhood_freq[neighborhood_freq['id_guest_anon']>50].index)
joined['frequent_neighborhoods'] = joined['listing_neighborhood'].apply(lambda x: x if (x in frequent_neighborhood) else 'other_neighborhood')
not_unknown_neigh = joined[joined['listing_neighborhood']!='-unknown-']
country_neighborhood_freq = not_unknown_neigh[['id_listing_anon', 'listing_neighborhood', 'country']].groupby(['listing_neighborhood', 'country']).count()
frequent_country_neigh = set(country_neighborhood_freq[country_neighborhood_freq['id_listing_anon']>50].index)
joined['frequent_country_neigh'] = joined[['listing_neighborhood', 'country']].apply(lambda x:
x['listing_neighborhood']+'_'+x['country'] if ((x['listing_neighborhood'], x['country']) in frequent_country_neigh)
else 'other_country_neighborhood', axis=1)
# identify the very top super hosts
# -------- transform variable so response is more linear ----------
# a few values were negatives... interpreted as error
median_reviews = joined['total_reviews'].median()
joined['total_reviews'] = joined['total_reviews'].apply(lambda x: x if x>=0 else median_reviews)
joined['log_reviews'] = np.log(joined['total_reviews']+1)
# a few negative values.. interpreted as error
median_enquiry2checkin = joined['enquiry_to_checkin'].median()
joined['enquiry_to_checkin'] = joined['enquiry_to_checkin'].apply(lambda x: x if x>=0 else median_enquiry2checkin)
joined['log_enquiry2checkin'] = np.log(joined['enquiry_to_checkin']+1)
# hour of the day, day of the week and month of the year, for both enquiry and checkin
joined['enquiry_h'] = joined['ts_interaction_first'].apply(lambda x: x.hour)
joined['enquiry_dow'] = joined['ts_interaction_first'].apply(lambda x: x.weekday())
joined['enquiry_month'] = joined['ts_interaction_first'].apply(lambda x: x.month)
joined['enquiry_night'] = joined['enquiry_h'].apply(lambda x: 0 if ((x>=7)&(x<=23)) else 1)
joined['enquiry_middleday'] = joined['enquiry_h'].apply(lambda x: 1 if ((x>=9)&(x<=18)) else 0)
joined['enquiry_weekend'] = joined['enquiry_dow'].apply(lambda x: 1 if x>=4 else 0)
joined['enquiry_likely_unavailabletime'] = joined.apply(lambda x:
1 if ((x['enquiry_night']+(1-x['enquiry_weekend'])*x['enquiry_middleday'])>=1) else 0, axis=1)
# all those indicators are useful for host behaviors, as they are in Rio's local time
# if one wanted to infer information about guests from timing, time zone correction may need to be made (from nationality isn't great --> from ip location)
joined['checkin_dow'] = joined['ds_checkin_first'].apply(lambda x: x.weekday())
joined['checkin_month'] = joined['ds_checkin_first'].apply(lambda x: x.month)
joined['checkin_weekend'] = joined['checkin_dow'].apply(lambda x: 1 if x>=4 else 0)
# complete or substitute these seasonal numerical variables with
# --> Fourier transform coefficients, or other methods (ARIMA, prophet)
# --> perhaps some dummy variables where it could make sense (week vs weekend, working hours , sleeping hours etc)
joined.head()
#joined.columns
# ------------- prepare for exploratory model (random forrest) -------------
# drop variables that aren't generalizable
df_model_1 = joined.drop(columns=['id_guest_anon', 'id_host_anon', 'id_listing_anon', 'ts_interaction_first', 'ts_reply_at_first',
'ts_accepted_at_first', 'ts_booking_at', 'ds_checkin_first', 'ds_checkout_first',
'id_user_anon', 'contact'], axis=1)
# variables that are partial success factors in the conversion funnel
# we'll keep them in the model initially, just to check that the funnel indeed progresses in stages
# replied, accepted, (booked), as well as time to do those things --> they lead to 100% prediction indeed
df_model_1.drop(columns=['replied', 'accepted', 'delay_firstreply', 'delay_firstaccepted',
'delay_booked', 'm_interactions'], inplace=True)
# caution also about log_delay_firstreply
# keep delay_firstreply only if interested in attributing the impact of response time
# drop variables that do not seem to have a high predictive power, or are redundant
# (method: looked at feature importance in a random forrest + business judgement)
df_model_1.drop(columns=['country', 'listing_neighborhood', 'enquiry_h', 'enquiry_dow',
'enquiry_to_checkin', 'total_reviews',
'frequent_countries', 'frequent_neighborhoods'], inplace=True)
# convert categorical variables into dummy variables
df_model_1 = pd.get_dummies(df_model_1, columns=['contact_channel_first', 'guest_user_stage_first', 'room_type'
, 'frequent_country_neigh'
],drop_first=True)
# because we're using a random forest initially, no need to convert seasonal numerical variables into dummie
# ----- caution: a lot of features: high risk of overfit --------
# look indicatively at: rows / features, and then learning curves etc
# ------ initially do that, but bad idea to keep it like this
df_model_1.dropna(inplace=True)
y = df_model_1['booked']
X = df_model_1.drop(columns='booked')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = RandomForestClassifier(n_jobs=-1, random_state=42, n_estimators=1000)
model.fit(X_train, y_train)
from sklearn.model_selection import cross_val_score, StratifiedKFold
scores = cross_val_score(model, X_train, y_train, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_macro')
print("F1 macro: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
hyperparamamter optimization and learning curves if time allows
importances = model.feature_importances_
std = np.std([tree.feature_importances_ for tree in model.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking and contribution
print("Feature ranking:")
for f in range(X.shape[1]):
print("%d. feature %d %s (%f)" % (f + 1, indices[f],X.columns.tolist()[indices[f]], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
#plt.xlim([-1, X.shape[1]])
plt.show()
y = df_model_1['booked']
#X = df_model_1[['contact_channel_first_instant_book', 'm_interactions', 'contact_channel_first_contact_me', 'total_reviews', 'm_first_message_length_in_characters']]
X = df_model_1.drop(columns='booked')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
scores = cross_val_score(model, X_train, y_train, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_macro')
print("F1 macro: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
import graphviz
from sklearn import tree
# in line function for small trees
dot_data = tree.export_graphviz(model, out_file=None,
feature_names=X.columns.tolist(),
class_names=["No booking","Booked"],
filled=True, rounded=False, precision=2, label='root', impurity=False,
special_characters=True, max_depth=4)
graph = graphviz.Source(dot_data)
graph
#graph.format = 'png'
#graph.render("tree")
funnel_contactchannel = joined.groupby('contact_channel_first')['contact','replied', 'accepted', 'booked'].sum()
funnel_contactchannel['replied%'] = funnel_contactchannel['replied'] / funnel_contactchannel['contact']
funnel_contactchannel['accepted%'] = funnel_contactchannel['accepted'] / funnel_contactchannel['contact']
funnel_contactchannel['booked%'] = funnel_contactchannel['booked'] / funnel_contactchannel['contact']
funnel_contactchannel
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
funnel_contactchannel[['contact','replied', 'accepted', 'booked']].plot(kind='bar', ax=ax)
plt.title('Matching funnel by contact channel (Jan-Jun 2016)')
plt.tight_layout()
plt.xlabel('Initial contact channel')
plt.ylabel('Enquiries between Jan & Jun 2016')
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
funnel_contactchannel[['replied%', 'accepted%', 'booked%']].plot(kind='bar', ax=ax)
plt.title('conversion funnel by contact channel')
print('instant book')
print(joined[joined['contact_channel_first'] == 'instant_book'][['delay_firstreply', 'delay_firstaccepted', 'delay_booked']].agg(['mean', 'median']))
print()
print('contact_me or book_it')
print(joined[joined['contact_channel_first'] != 'instant_book'][['delay_firstreply', 'delay_firstaccepted', 'delay_booked']].agg(['mean', 'median']))
not_instantbooked = joined[joined['contact_channel_first'] != 'instant_book']
book_it = joined[joined['contact_channel_first'] == 'book_it']
with sns.plotting_context('notebook', font_scale=1.5):
fig, ax = plt.subplots()
fig.set_size_inches(16, 7)
bins = np.arange(0, 48, 0.5)
g = sns.distplot(book_it[book_it['booked']==1]['delay_firstreply']/3600, label='booked', ax=ax, kde=False, bins=bins)
g = sns.distplot(book_it[book_it['booked']==0]['delay_firstreply']/3600, label='not booked', ax=ax, kde=False, bins=bins)
g.set_xlim(left=0, right=23)
g.legend()
plt.title("Host response time delay, and impact on booking rates (Jan-Jun 2016, 'book it' medium)")
plt.xlabel('Delay of first reply (h)')
plt.ylabel('Number enquiries')
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
g = sns.distplot(np.log(book_it[book_it['booked']==1]['delay_firstreply']+1), label='booked', ax=ax, kde=False)
g = sns.distplot(np.log(book_it[book_it['booked']==0]['delay_firstreply']+1), label='not booked', ax=ax, kde=False)
g.set_xlim(left=0, right=12)
g.legend()
plt.xlabel('Delay of first reply (log of response time in seconds)')
plt.ylabel('Number enquiries')
firstreply_h_bin = np.concatenate(([0, 0.25, 0.5], np.arange(1,24,1)), axis=None)
book_it['firstreply_bins'] = pd.cut(book_it['delay_firstreply']/3600, bins=firstreply_h_bin, labels=False)
bookit_binned = book_it.groupby('firstreply_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate'] = bookit_binned['sum'] / bookit_binned['count']
bookit_binned_aug = pd.concat([bookit_binned, pd.DataFrame({'firstreply_h_bin':firstreply_h_bin[0:-1]})], axis=1)
bookit_binned_aug.set_index(keys='firstreply_h_bin', inplace=True)
bookit_binned_aug
with sns.plotting_context('notebook', font_scale=1.5):
fig, ax = plt.subplots()
fig.set_size_inches(14, 7)
g = sns.regplot(x="firstreply_h_bin", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
#g.set_axis_labels("Response time (h)", "Average booking rate")
g.set(ylim=(0, 0.8))
plt.xlabel('First response delay (h)')
plt.title("Impact of host response delay on booking rate (Jan-Jun 2016, 'book it' medium)")
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
g = sns.regplot(x="firstreply_h_bin", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
g.set(xlim=(0, 48), ylim=(0, 1))
linear regression seems to cross y=0 at 40h
--> we'll use that to fill the NaN values and attribute the impact of response time to booking rate
transform so response is more linear
firstreply_log_bin = range(0,13)
book_it['firstreply_bins'] = pd.cut(np.log(book_it['delay_firstreply']+1), bins=firstreply_log_bin, labels=False)
bookit_binned = book_it.groupby('firstreply_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate'] = bookit_binned['sum'] / bookit_binned['count']
firstreply_bin_h = (np.exp(firstreply_log_bin)/3600)[0:-1]
bookit_binned_aug = pd.concat([pd.DataFrame({'firstreply_bin_h':firstreply_bin_h}),bookit_binned], axis=1)
bookit_binned_aug.set_index('firstreply_bin_h', inplace=True)
bookit_binned_aug
fig, ax = plt.subplots()
ax.set(xscale="log")
ax.set(xlim=(30/3600, 10))
fig.set_size_inches(14, 6)
ax.plot(bookit_binned_aug[['booking_rate']])
plt.xlabel('enquiry response time (h)')
plt.title('Response time influence on booking rates')
plt.show()
with sns.plotting_context('notebook', font_scale=1.5):
g = sns.FacetGrid(not_instantbooked[not_instantbooked['guest_user_stage_first']!='-unknown-']
, row='guest_user_stage_first', col='contact_channel_first', hue='booked', height=5, aspect=2)
g.map(sns.distplot, 'log_delay_firstreply', kde=False)
#g.set(xlim=(0, 300), ylim=(0, 2500))
g.set_axis_labels("response time (s)", "Number of enquiries")
g.add_legend()
new guests seem to have booking rates similar to returning guests on the 'book_it' channel. However, they seem to perform even worse than past bookers on the 'contact_me' channel.
joined['room_type'].value_counts()
funnel_roomtype = joined.groupby('room_type')[['contact', 'replied', 'accepted', 'booked']].sum()
funnel_roomtype['replied%'] = funnel_roomtype['replied'] / funnel_roomtype['contact']
funnel_roomtype['accepted%'] = funnel_roomtype['accepted'] / funnel_roomtype['contact']
funnel_roomtype['booked%'] = funnel_roomtype['booked'] / funnel_roomtype['contact']
funnel_roomtype
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
funnel_roomtype[['contact', 'replied', 'accepted', 'booked']].plot(kind='bar', ax=ax)
plt.title('conversion funnel by room type')
plt.tight_layout()
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
funnel_roomtype[['replied%', 'accepted%', 'booked%']].plot(kind='bar', ax=ax)
plt.title('conversion funnel by room type')
plt.ylim(0,1)
room type doesn't seem strongly different, and probably catters to different needs and availabilities, so this probably doesn't need to be a choice that needs to be nudged
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
sns.distplot(joined[joined['booked']==1]['m_interactions'], ax=ax, bins=range(0,22), kde=False, label='booked')
sns.distplot(joined[joined['booked']==0]['m_interactions'], ax=ax, bins=range(0,22), kde=False, label='not booked')
plt.legend()
ax.set_xlim(left=0, right=20)
m_interactions is not quite a predictive variable, in the sense that if probably counts all interactions related to the booking, not just the interactions prior to the booking?
# exclude auto booking
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
sns.distplot(not_instantbooked[not_instantbooked['booked']==1]['m_first_message_length_in_characters'], ax=ax, bins=100, kde=False, label='booked')
sns.distplot(not_instantbooked[not_instantbooked['booked']==0]['m_first_message_length_in_characters'], ax=ax, bins=100, kde=False, label='not booked')
plt.legend()
plt.title('book it or contact me media')
ax.set_xlim(left=0, right=500)
# booking-it media
book_it = joined[joined['contact_channel_first']=='book_it']
# exclude auto booking
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
sns.distplot(book_it[book_it['booked']==1]['m_first_message_length_in_characters'], ax=ax, bins=100, kde=False, label='booked')
sns.distplot(book_it[book_it['booked']==0]['m_first_message_length_in_characters'], ax=ax, bins=100, kde=False, label='not booked')
plt.legend()
plt.title('book it medium')
ax.set_xlim(left=0, right=500)
country_conv = joined.groupby('country')[['contact', 'booked']].sum()
country_conv['booking_rate'] = country_conv['booked'] / country_conv['contact']
sns.distplot(country_conv['booking_rate'], kde=False, bins=5)
sns.distplot(country_conv[country_conv['contact']>=50]['booking_rate'], kde=False, bins=5)
country_conv[country_conv['contact']>=50].sort_values(by='booking_rate', ascending=False).head()
neighborhood_conv = joined.groupby('listing_neighborhood')[['contact', 'booked']].sum()
neighborhood_conv['booking_rate'] = neighborhood_conv['booked'] / neighborhood_conv['contact']
sns.distplot(neighborhood_conv[neighborhood_conv['contact']>=50]['booking_rate'], kde=False, bins=5)
neighborhood_conv[neighborhood_conv['contact']>=50].sort_values(by='booking_rate', ascending=False).head()
-> perhaps some nationalities have better booking rates in some areas than others? and so the recommendation engine could try to match those nationalities to these areas?
How shall we structure this?
country_neighborhood = joined.groupby(['country', 'listing_neighborhood'])[['contact', 'booked']].sum().reset_index()
country_neighborhood['booking_rate%'] = country_neighborhood['booked'] / country_neighborhood['contact']
country_neighborhood.sort_values(by='booking_rate%', ascending=False).head()
country_neighborhood_filtered = country_neighborhood[country_neighborhood['contact']>20]
# keep only countries and neighborhoods that have more than 2 pairs
countries_count = country_neighborhood_filtered.groupby(['country'])[['listing_neighborhood']].count()
common_countries = countries_count[countries_count['listing_neighborhood']>=2].reset_index().drop(columns='listing_neighborhood')
neigborhood_count = country_neighborhood_filtered.groupby(['listing_neighborhood'])[['country']].count()
common_neighborhood = neigborhood_count[neigborhood_count['country']>=2].reset_index().drop(columns='country')
country_neighborhood_filtered2 = country_neighborhood_filtered.merge(common_countries, on='country', how='inner')
country_neighborhood_filtered2 = country_neighborhood_filtered2.merge(common_neighborhood, on='listing_neighborhood', how='inner')
country_neighborhood_pivot = country_neighborhood_filtered2.pivot(index='country', columns='listing_neighborhood', values='booking_rate%')
# heatmap
with sns.plotting_context('notebook', font_scale=1.3):
plt.figure(figsize=(18,8))
sns.heatmap(country_neighborhood_pivot.dropna(how='all'), cmap="Greens")
plt.title('Booking rates grouped by host country and listing neighborhood (where enough data is available)')
plt.tight_layout()
plt.xlabel('listing neighborhood')
plt.show()
Several problems with the data:
plt.figure(figsize=(18,10))
sns.heatmap(country_neighborhood_pivot.corr().dropna(how='all', axis=0).dropna(how='all', axis=1), cmap="RdYlGn")
plt.title('Correlation between listing neighborhood, in terms of conversion rates for different countries')
plt.show()
plt.figure(figsize=(18,10))
sns.clustermap(country_neighborhood_pivot.transpose().corr().dropna(how='all', axis=0).dropna(how='all', axis=1).dropna(), cmap='RdYlGn')
plt.title('Correlation between guest nationality, in terms of conversion rates for different neighborhoods')
plt.show()
cluster countries, neighborhoods and pairs if time allows
with sns.plotting_context('notebook', font_scale=1.5):
g = sns.FacetGrid(joined, row='contact_channel_first', hue='booked', height=3.5, aspect=4)
g.map(sns.distplot, 'nights_booked', kde=False, bins=range(0,15,1))
g.set(xlim=(0, 14), ylim=(0, 2500))
g.set_axis_labels("Nights booked", "Number of enquiries")
g.add_legend()
there seem to be some optimal booking rate around 3 nights --> not sure how to act upon that
with sns.plotting_context('notebook', font_scale=1.5):
g = sns.FacetGrid(joined, row='contact_channel_first', hue='booked', height=3.5, aspect=4)
g.map(sns.distplot, 'm_guests', kde=False, bins=range(1,11,1))
g.set(xlim=(1, 10), ylim=(0, 2500))
g.set_axis_labels("Number of guests", "Number of enquiries")
g.add_legend()
sns.barplot(y='total_reviews', x='booked', data=joined)
with sns.plotting_context('notebook', font_scale=1.5):
g = sns.FacetGrid(joined, row='contact_channel_first', hue='booked', height=3.5, aspect=4)
g.map(sns.distplot, 'total_reviews', kde=False)
g.set(xlim=(0, 100), ylim=(0, 2500))
g.set_axis_labels("Total reviews to date on listing", "Number of enquiries")
g.add_legend()
reviews_bins = np.concatenate(([0, 1, 2, 3], np.arange(5,44,5)), axis=None)
book_it['reviews_bins'] = pd.cut(book_it['total_reviews'], bins=reviews_bins, labels=False)
bookit_binned = book_it.groupby('reviews_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate'] = bookit_binned['sum'] / bookit_binned['count']
bookit_binned_aug = pd.concat([bookit_binned, pd.DataFrame({'reviews_bins':reviews_bins[0:-1]})], axis=1)
bookit_binned_aug.set_index(keys='reviews_bins', inplace=True)
bookit_binned_aug
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
g = sns.lineplot(x="reviews_bins", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
#g.set_axis_labels("response time (s)", "Conversion rate")
g.set(ylim=(0, 1))
log_reviews_bins = np.log(np.concatenate(([1, 2, 3], np.arange(5,44,5)), axis=None))
book_it['log_reviews_bins'] = pd.cut(book_it['log_reviews'], bins=log_reviews_bins, labels=False)
bookit_binned = book_it.groupby('log_reviews_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate'] = bookit_binned['sum'] / bookit_binned['count']
bookit_binned_aug = pd.concat([bookit_binned, pd.DataFrame({'log_reviews_bins':log_reviews_bins[0:-1]})], axis=1)
bookit_binned_aug.set_index(keys='log_reviews_bins', inplace=True)
bookit_binned_aug
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
g = sns.regplot(x="log_reviews_bins", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
#g.set_axis_labels("response time (s)", "Conversion rate")
g.set(ylim=(0, 1))
with sns.plotting_context('notebook', font_scale=1.5):
g = sns.FacetGrid(joined, row='contact_channel_first', hue='booked', height=3.5, aspect=4)
g.map(sns.distplot, 'enquiry_to_checkin', kde=False)
g.set(xlim=(0, 300), ylim=(0, 2500))
g.set_axis_labels("Days between enquiry and prospective check-in", "Number of enquiries")
g.add_legend()
log_enquiry2checkin_bins = np.log(np.concatenate(([1, 2, 3, 4, 7, 14], np.arange(21,200,20)), axis=None))
book_it['log_enquiry2checkin_bins'] = pd.cut(book_it['log_enquiry2checkin'], bins=log_enquiry2checkin_bins, labels=False)
bookit_binned = book_it.groupby('log_enquiry2checkin_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate'] = bookit_binned['sum'] / bookit_binned['count']
bookit_binned_aug = pd.concat([bookit_binned, pd.DataFrame({'log_enquiry2checkin_bins':log_enquiry2checkin_bins[0:-1]})], axis=1)
bookit_binned_aug.set_index(keys='log_enquiry2checkin_bins', inplace=True)
bookit_binned_aug
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
g = sns.regplot(x="log_enquiry2checkin_bins", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
#g.set_axis_labels("response time (s)", "Conversion rate")
g.set(ylim=(0, 1))
with hindsight I have considered the dataset may just be a random sample of the real data
joined.groupby('id_listing_anon')[['id_guest_anon']].count().describe([0.5, 0.75, 0.9, 0.95, 0.99, 0.999])
--> there are probably fixed costs to setup a host (taking pictures etc). After how many bookings does Airbnb recoupe its costs? (will vary with listing price & # nights probably, if income is percentage of booking price)
joined.groupby('id_host_anon')[['id_listing_anon']].count().describe([0.5, 0.75, 0.9, 0.95, 0.99, 0.999])
joined.groupby('id_guest_anon')[['id_host_anon']].count().describe([0.5, 0.75, 0.9, 0.95, 0.99, 0.999])
joined.groupby('id_guest_anon')[['booked']].agg(['count', 'sum']).describe([0.5, 0.75, 0.9, 0.95, 0.98])
not enough data really to answer this question
regression 1
instant_booking gets 100% booking rate messaging has a big penalty. We'll treat that separately
booked(binary) ~ intercept + log(first_response_time) + isBrazilian +
book_it2 = joined[joined['contact_channel_first'] == 'book_it']
book_it2['intercept'] = 1
X = book_it2[[
'isbrazilian',
#'frequent_country_neigh',
#'frequent_countries', 'frequent_neighborhoods',
#'guest_user_stage_first', # news users seem to struggle mainly on the contact me channel, not book it
'log_delay_firstreply', 'log_reviews', 'log_enquiry2checkin',
'enquiry_sameday_checkin', 'enquiry_nextday_checkin',
'm_first_message_length_in_characters',
'intercept'
]]
#X = pd.get_dummies(X, columns=[
#'guest_user_stage_first'
# 'frequent_country_neigh'
# 'frequent_countries', 'frequent_neighborhoods'
# ],drop_first=True)
y = book_it2['booked']
logit = sm.Logit(y, X)
logit_fitted = logit.fit()
print(logit_fitted.summary())
print()
number_rows = len(y)
print(logit_fitted.pred_table(threshold=0.5))
print(logit_fitted.pred_table(threshold=0.5)/number_rows)