Improving matching performance in a two-sided market

The aim of this notebook is to identify 2-3 actionable levers to increase the number of successful bookings.

As the dataset pertains mostly to matching, the focus of this notebook will be on booking rates. Further analysis with access to more data should consider ratings and Customer Lifetime Value optimization (for guests and hosts).

Approach

  • (0) Data importation

  • (1) Features engineering

    • Either important confounding variables
    • Likely actionable levers
  • (2) Get a first intuition of what features matter

    • User a random forest classifier with many trees, and look at feature importance
  • (3) Visualization of key features

    • and how to transform them so they provide a linear response
  • (4) Logistic regression to attribute impact of significant variables, controlling for confounding factors

    • Focused on the ‘book it’ medium here, as I have recommended to steer away from the ‘contact me’ medium

Assumptions

  • Dataset
    • Assumed that dataset is a random unbiased sample of actual market
    • No multi-user analysis was made on that basis. If this is the entire dataset, the high number of hosts with only one booking in the 6 month period may require more attention. Given setup cost considerations, better targeting and active retention is needed (better matching is only part of it).
  • next point

(0) Data importation

In [90]:
# data manipulation
import numpy as np
import pandas as pd

# plots
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# machine-assisted exploration
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# statistical analysis and regression model
import statsmodels.discrete.discrete_model as sm
In [7]:
contacts = pd.read_csv('contacts.csv')
contacts.head(2)
Out[7]:
id_guest_anon id_host_anon id_listing_anon ts_interaction_first ts_reply_at_first ts_accepted_at_first ts_booking_at ds_checkin_first ds_checkout_first m_guests m_interactions m_first_message_length_in_characters contact_channel_first guest_user_stage_first
0 da8656a1-51af-4f38-b1c4-94be1f585157 5426897d-960d-4013-9e38-606ae746793c a408a8b2-0d44-4513-a611-3736d0409bb2 2016-04-21 02:55:53 2016-04-21 03:15:00.0 2016-04-21 03:15:00.0 2016-04-21 03:15:00.0 2016-08-02 2016-08-06 4.0 30 31.0 book_it past_booker
1 8590d6f1-8bc9-4e8b-bdfb-de78f69fcf37 f30417c5-6df4-45ac-bfc2-6ad1cce398ab e387c705-0aeb-464c-9375-ece63be6f006 2016-02-16 22:14:01 2016-02-16 23:37:36.0 NaN NaN 2016-08-11 2016-08-22 5.0 2 312.0 contact_me past_booker
In [8]:
contacts.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27887 entries, 0 to 27886
Data columns (total 14 columns):
id_guest_anon                           27887 non-null object
id_host_anon                            27887 non-null object
id_listing_anon                         27887 non-null object
ts_interaction_first                    27887 non-null object
ts_reply_at_first                       25855 non-null object
ts_accepted_at_first                    16415 non-null object
ts_booking_at                           11587 non-null object
ds_checkin_first                        27887 non-null object
ds_checkout_first                       27887 non-null object
m_guests                                27886 non-null float64
m_interactions                          27887 non-null int64
m_first_message_length_in_characters    27887 non-null float64
contact_channel_first                   27887 non-null object
guest_user_stage_first                  27887 non-null object
dtypes: float64(2), int64(1), object(11)
memory usage: 3.0+ MB

get a feel for the context of this dataset

In [3]:
contacts['id_guest_anon'].agg(['nunique', 'count'])
Out[3]:
nunique    22566
count      27887
Name: id_guest_anon, dtype: int64
In [4]:
contacts['id_host_anon'].agg(['nunique', 'count'])
Out[4]:
nunique     8959
count      27887
Name: id_host_anon, dtype: int64
In [5]:
contacts['id_listing_anon'].agg(['nunique', 'count'])
Out[5]:
nunique    12819
count      27887
Name: id_listing_anon, dtype: int64
In [6]:
contacts['ts_interaction_first'].agg(['min', 'max'])
Out[6]:
min    2016-01-01 00:46:47
max    2016-06-30 23:58:14
Name: ts_interaction_first, dtype: object
In [10]:
users = pd.read_csv('users.csv')
users.head(2)
Out[10]:
id_user_anon country words_in_user_profile
0 1d16a001-31a2-494c-a101-1f7b308adc62 FR 0
1 42607e0a-86c0-472e-b633-9e192114e93c AR 0
In [14]:
listings = pd.read_csv('listings.csv')
listings.head(2)
Out[14]:
id_listing_anon room_type listing_neighborhood total_reviews
0 71582793-e5f8-46d7-afdf-7a31d2341c79 Private room -unknown- 0.0
1 a1a3f728-e21f-4432-96aa-361d28e2b319 Entire home/apt Copacabana 0.0
In [13]:
print(listings['listing_neighborhood'].nunique())
listings['listing_neighborhood'].value_counts()[:10]
68
Out[13]:
-unknown-          6221
Copacabana         2531
Ipanema            1041
Barra da Tijuca     593
Leblon              458
Botafogo            345
Santa Teresa        243
Flamengo            171
Lapa                171
Leme                168
Name: listing_neighborhood, dtype: int64

It'd be cool to visualize booking number and conversion rates in different neighborhoods. However, a large part of the listings do not have any neighborhoods affiliation

map with the 'Bairro' (neighborhoods) http://portalgeo-pcrj.opendata.arcgis.com/datasets/8454eb0454b7424d89c61b67742286a1_15?geometry=-77.293%2C5.937%2C-73.723%2C6.756

brainstorm how to approach the problem

KPIs: what do people care about?

  • Listing: how full is it? (cost efficiency)
  • Listing & Guest: how many interactions are needed to get a successful transaction (time efficiency)
  • Listing & Guest: lead time to get booking finalized (time efficiency)

What can be done differently

  • encourage auto booking
  • tweak recommendation engine to maximise conversion rates, or rather the number of overall successful bookings
    • e.g. match nationality to entire appartments in a certain neighborhood
  • direct marketing campaigns to
    • some nationalities or neighborhoods
    • at certain times of the year, if there is a supply vs demand mismatch
  • suggest faster response time or manage expectations
  • setup reminders for bookings? or follow up emails with other suggestions?
  • incentive to develop profile and make a long first message
  • etc

Potential ways to identify inefficiencies

  • conversion funnel of users and listings
  • geographical or temporal mismatch
  • regression on performance variables e.g. faster response

Potential segments

  • area of the city
  • nationality from person that bookeed
  • message length and number
  • first contact channel
  • number of guests
  • number of booking made (experienced guests and hosts)

Machine-assisted exploration

  • dataset is looking at only part of the matching process. Indeed, this dataset is partial, and captures what happens for people that already expressed an interest into each other
  • in that situation, what we can learn is:
    • what interaction modalities are associated with a conversion --> that'd be useful to know so Airbnb can encourage the most effective behaviors
    • what context is more auspicious to a match, so Airbnb can
    • invest more in that context (e.g. part of the city), advertize to a certain type of customers
    • encourage users to shift to that context / segment

However,

  • There are a lot of potential insightful variables
  • Let's build a random forest classifier to predict booking, and see what features matter

pre-processing

In [15]:
# --------------- join datasets ---------------
joined = contacts.merge(users, left_on = 'id_guest_anon', right_on = 'id_user_anon', how='left')
joined = joined.merge(listings, on = 'id_listing_anon', how='left')
# checked the match: 100%

# --------------- data type conversion ---------------
joined['ts_interaction_first'] = pd.to_datetime(joined['ts_interaction_first'])
joined['ts_reply_at_first'] = pd.to_datetime(joined['ts_reply_at_first'])
joined['ts_accepted_at_first'] = pd.to_datetime(joined['ts_accepted_at_first'])
joined['ts_booking_at'] = pd.to_datetime(joined['ts_booking_at'])

joined['ds_checkin_first'] = pd.to_datetime(joined['ds_checkin_first'])
joined['ds_checkout_first'] = pd.to_datetime(joined['ds_checkout_first'])

feature engineering (filter later)

In [16]:
# --------------- feature engineering ---------------
# boolean variables for the conversion funnel: reply, accepted, booking
joined['contact'] = 1 #non predictive feature that's just convenient for funnel analysis
joined['replied'] = joined['ts_reply_at_first'].apply(lambda x: 0 if x is pd.NaT else 1)
joined['accepted'] = joined['ts_accepted_at_first'].apply(lambda x: 0 if x is pd.NaT else 1)
joined['booked'] = joined['ts_booking_at'].apply(lambda x: 0 if x is pd.NaT else 1)

# booking speed (seconds)
joined['delay_firstreply'] = joined.apply(lambda x: (x['ts_reply_at_first'] - x['ts_interaction_first']).seconds, axis=1)
joined['delay_firstaccepted'] = joined.apply(lambda x: (x['ts_accepted_at_first'] - x['ts_interaction_first']).seconds, axis=1)
joined['delay_booked'] = joined.apply(lambda x: (x['ts_booking_at'] - x['ts_interaction_first']).seconds , axis=1)

# deal with nA
joined['delay_firstreply'] = joined['delay_firstreply'].fillna(value=3600*40)

joined['log_delay_firstreply'] = np.log(joined['delay_firstreply']+1)
joined['response_within_15min'] = joined['delay_firstreply'].apply(lambda x: 1 if x<=15*60 else 0)
joined['response_between_15_and_45min'] = joined['delay_firstreply'].apply(lambda x: 1 if ((x>15*60)&(x<=45*60)) else 0)

# ------ deal with NaN for response time and Cie
# it seems that NaN corresponds to an expired response time

# regression on conversion rate vs response time suggests that a ~40h response time would have 0% booking rate
# so for our attribution model, it seems that attributing 40h to the response time NaN could do the trick
#joined['delay_firstreply'] = joined['delay_firstreply'].fillna(value=3600*40)

# previous approach
#no_answer_infinitedelay = 6*30*24*3600 #arbitrary number for now: potentially improve with
#joined[['delay_firstreply', 'delay_firstaccepted', 'delay_booked']] = joined[['delay_firstreply', 'delay_firstaccepted', 'delay_booked']].fillna(value=no_answer_infinitedelay, axis=1)

# only 35 people have an unknown 'guest_user_stage_first' status
# for statistical significance we'll get rid of that category, and initially assign it to new users
joined['guest_user_stage_first'] = joined['guest_user_stage_first'].apply(lambda x: 'new' if x=='-unknown-' else x)


# --> potentially create another variable that's the time to answer normalized by enquiry_to_checkin

# how far in advance is enquiry made (days)
joined['enquiry_to_checkin'] = joined.apply(lambda x: (x['ds_checkin_first'] - x['ts_interaction_first']).days, axis=1)
joined['enquiry_sameday_checkin'] = joined['enquiry_to_checkin'].apply(lambda x: 1 if x==0 else 0)
joined['enquiry_nextday_checkin'] = joined['enquiry_to_checkin'].apply(lambda x: 1 if x<=1 else 0)
joined['enquiry_sameweek_checkin'] = joined['enquiry_to_checkin'].apply(lambda x: 1 if x<=7 else 0)

# how many nights booked
joined['nights_booked'] = joined.apply(lambda x: (x['ds_checkout_first'] - x['ds_checkin_first']).days, axis=1)

# random forrest suggests that being brazilian vs not makes a difference
# we'll create a specific feature for this
joined['isbrazilian'] = joined['country'].apply(lambda x: 1 if x=='BR' else 0)

# keep only most frequent nationalities and neighborhoods
country_freq = joined[['id_guest_anon', 'country']].groupby('country').count()
frequent_country = set(country_freq[country_freq['id_guest_anon']>50].index)
joined['frequent_countries'] = joined['country'].apply(lambda x: x if (x in frequent_country) else 'other_country')

neighborhood_freq = joined[['id_guest_anon', 'listing_neighborhood']].groupby('listing_neighborhood').count()
frequent_neighborhood = set(neighborhood_freq[neighborhood_freq['id_guest_anon']>50].index)
joined['frequent_neighborhoods'] = joined['listing_neighborhood'].apply(lambda x: x if (x in frequent_neighborhood) else 'other_neighborhood')

not_unknown_neigh = joined[joined['listing_neighborhood']!='-unknown-']
country_neighborhood_freq = not_unknown_neigh[['id_listing_anon', 'listing_neighborhood', 'country']].groupby(['listing_neighborhood', 'country']).count()
frequent_country_neigh = set(country_neighborhood_freq[country_neighborhood_freq['id_listing_anon']>50].index)
joined['frequent_country_neigh'] = joined[['listing_neighborhood', 'country']].apply(lambda x: 
     x['listing_neighborhood']+'_'+x['country'] if ((x['listing_neighborhood'], x['country']) in frequent_country_neigh) 
                                             else 'other_country_neighborhood', axis=1)

# identify the very top super hosts

# -------- transform variable so response is more linear ----------
# a few values were negatives... interpreted as error
median_reviews = joined['total_reviews'].median()
joined['total_reviews'] = joined['total_reviews'].apply(lambda x: x if x>=0 else median_reviews) 
joined['log_reviews'] = np.log(joined['total_reviews']+1)

# a few negative values.. interpreted as error
median_enquiry2checkin = joined['enquiry_to_checkin'].median()
joined['enquiry_to_checkin'] = joined['enquiry_to_checkin'].apply(lambda x: x if x>=0 else median_enquiry2checkin)
joined['log_enquiry2checkin'] = np.log(joined['enquiry_to_checkin']+1)


# hour of the day, day of the week and month of the year, for both enquiry and checkin
joined['enquiry_h'] = joined['ts_interaction_first'].apply(lambda x: x.hour)
joined['enquiry_dow'] = joined['ts_interaction_first'].apply(lambda x: x.weekday())
joined['enquiry_month'] = joined['ts_interaction_first'].apply(lambda x: x.month)

joined['enquiry_night'] = joined['enquiry_h'].apply(lambda x: 0 if ((x>=7)&(x<=23)) else 1)
joined['enquiry_middleday'] = joined['enquiry_h'].apply(lambda x: 1 if ((x>=9)&(x<=18)) else 0)
joined['enquiry_weekend'] = joined['enquiry_dow'].apply(lambda x: 1 if x>=4 else 0)
joined['enquiry_likely_unavailabletime'] = joined.apply(lambda x: 
        1 if ((x['enquiry_night']+(1-x['enquiry_weekend'])*x['enquiry_middleday'])>=1) else 0, axis=1)

# all those indicators are useful for host behaviors, as they are in Rio's local time
# if one wanted to infer information about guests from timing, time zone correction may need to be made (from nationality isn't great --> from ip location)

joined['checkin_dow'] = joined['ds_checkin_first'].apply(lambda x: x.weekday())
joined['checkin_month'] = joined['ds_checkin_first'].apply(lambda x: x.month)

joined['checkin_weekend'] = joined['checkin_dow'].apply(lambda x: 1 if x>=4 else 0)

# complete or substitute these seasonal numerical variables with
# --> Fourier transform coefficients, or other methods (ARIMA, prophet)
# --> perhaps some dummy variables where it could make sense (week vs weekend, working hours , sleeping hours etc)

joined.head()
Out[16]:
id_guest_anon id_host_anon id_listing_anon ts_interaction_first ts_reply_at_first ts_accepted_at_first ts_booking_at ds_checkin_first ds_checkout_first m_guests ... enquiry_h enquiry_dow enquiry_month enquiry_night enquiry_middleday enquiry_weekend enquiry_likely_unavailabletime checkin_dow checkin_month checkin_weekend
0 da8656a1-51af-4f38-b1c4-94be1f585157 5426897d-960d-4013-9e38-606ae746793c a408a8b2-0d44-4513-a611-3736d0409bb2 2016-04-21 02:55:53 2016-04-21 03:15:00 2016-04-21 03:15:00 2016-04-21 03:15:00 2016-08-02 2016-08-06 4.0 ... 2 3 4 1 0 0 1 1 8 0
1 8590d6f1-8bc9-4e8b-bdfb-de78f69fcf37 f30417c5-6df4-45ac-bfc2-6ad1cce398ab e387c705-0aeb-464c-9375-ece63be6f006 2016-02-16 22:14:01 2016-02-16 23:37:36 NaT NaT 2016-08-11 2016-08-22 5.0 ... 22 1 2 0 0 0 0 3 8 0
2 ebcd83ba-bda1-47eb-9680-2dd04ccf3988 13cbf50a-3272-45d4-9866-a06b6ea1b99a d1eb1960-938f-4305-a353-51e224414dd4 2016-01-27 23:33:38 2016-01-28 02:12:47 NaT NaT 2016-03-14 2016-03-23 1.0 ... 23 2 1 0 0 0 0 0 3 0
3 b0af8848-fe2a-4ef1-991e-26ab3066feb3 01614601-d5a4-4776-ab9b-c10d3b865bf0 855f6779-346c-45fc-a64b-ea133250ca7a 2016-05-05 14:42:52 2016-05-05 15:17:40 2016-05-05 15:17:40 NaT 2016-05-27 2016-05-29 2.0 ... 14 3 5 0 1 0 1 4 5 1
4 5ddbbcc3-ac1a-4d8a-b6f8-0f675b76d1b2 f2fed6f3-4c5c-453d-9e64-37c62b8bd06d f2928a59-c5e7-42b2-9c37-240a40f38dc0 2016-06-23 03:09:25 2016-06-23 03:09:26 2016-06-23 03:09:33 2016-06-23 03:09:33 2016-08-19 2016-08-21 3.0 ... 3 3 6 1 0 0 1 4 8 1

5 rows × 51 columns

(2) Get a first intuition of what features matter

prepare for random forrest, and filter a bit

In [22]:
#joined.columns
In [20]:
# ------------- prepare for exploratory model (random forrest) -------------

# drop variables that aren't generalizable
df_model_1 = joined.drop(columns=['id_guest_anon', 'id_host_anon', 'id_listing_anon', 'ts_interaction_first', 'ts_reply_at_first',
                          'ts_accepted_at_first', 'ts_booking_at', 'ds_checkin_first', 'ds_checkout_first',
                          'id_user_anon', 'contact'], axis=1)

# variables that are partial success factors in the conversion funnel
# we'll keep them in the model initially, just to check that the funnel indeed progresses in stages
# replied, accepted, (booked), as well as time to do those things --> they lead to 100% prediction indeed

df_model_1.drop(columns=['replied', 'accepted', 'delay_firstreply', 'delay_firstaccepted', 
                         'delay_booked', 'm_interactions'], inplace=True)
# caution also about log_delay_firstreply
# keep delay_firstreply only if interested in attributing the impact of response time

# drop variables that do not seem to have a high predictive power, or are redundant
# (method: looked at feature importance in a random forrest + business judgement)
df_model_1.drop(columns=['country', 'listing_neighborhood', 'enquiry_h', 'enquiry_dow',
                        'enquiry_to_checkin', 'total_reviews',
                        'frequent_countries', 'frequent_neighborhoods'], inplace=True)


# convert categorical variables into dummy variables
df_model_1 = pd.get_dummies(df_model_1, columns=['contact_channel_first', 'guest_user_stage_first', 'room_type'
                                                 , 'frequent_country_neigh'
                                                ],drop_first=True)

# because we're using a random forest initially, no need to convert seasonal numerical variables into dummie


# ----- caution: a lot of features: high risk of overfit --------
# look indicatively at: rows / features, and then learning curves etc

# ------ initially do that, but bad idea to keep it like this
df_model_1.dropna(inplace=True)
In [21]:
y = df_model_1['booked']
X = df_model_1.drop(columns='booked')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = RandomForestClassifier(n_jobs=-1, random_state=42, n_estimators=1000)
model.fit(X_train, y_train)


from sklearn.model_selection import cross_val_score, StratifiedKFold

scores = cross_val_score(model, X_train, y_train, cv=5)

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_macro')
print("F1 macro: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.89 (+/- 0.01)
F1 macro: 0.88 (+/- 0.01)

hyperparamamter optimization and learning curves if time allows

get a feel for what features matter and how the problem works

In [23]:
importances = model.feature_importances_
std = np.std([tree.feature_importances_ for tree in model.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking and contribution
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d %s (%f)" % (f + 1, indices[f],X.columns.tolist()[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
#plt.xlim([-1, X.shape[1]])
plt.show()
Feature ranking:
1. feature 21 contact_channel_first_contact_me (0.181782)
2. feature 22 contact_channel_first_instant_book (0.162868)
3. feature 3 log_delay_firstreply (0.151128)
4. feature 11 log_reviews (0.071082)
5. feature 4 response_within_15min (0.054449)
6. feature 1 m_first_message_length_in_characters (0.052770)
7. feature 12 log_enquiry2checkin (0.047524)
8. feature 9 nights_booked (0.034063)
9. feature 19 checkin_month (0.029637)
10. feature 13 enquiry_month (0.022598)
11. feature 0 m_guests (0.021247)
12. feature 18 checkin_dow (0.020823)
13. feature 2 words_in_user_profile (0.019977)
14. feature 10 isbrazilian (0.015266)
15. feature 23 guest_user_stage_first_past_booker (0.011331)
16. feature 15 enquiry_middleday (0.007610)
17. feature 16 enquiry_weekend (0.007589)
18. feature 17 enquiry_likely_unavailabletime (0.007193)
19. feature 24 room_type_Private room (0.007091)
20. feature 72 frequent_country_neigh_other_country_neighborhood (0.006972)
21. feature 20 checkin_weekend (0.006303)
22. feature 5 response_between_15_and_45min (0.005906)
23. feature 8 enquiry_sameweek_checkin (0.005892)
24. feature 14 enquiry_night (0.005867)
25. feature 35 frequent_country_neigh_Copacabana_BR (0.003727)
26. feature 7 enquiry_nextday_checkin (0.003610)
27. feature 52 frequent_country_neigh_Ipanema_BR (0.002325)
28. feature 25 room_type_Shared room (0.002180)
29. feature 6 enquiry_sameday_checkin (0.002135)
30. feature 46 frequent_country_neigh_Copacabana_US (0.002020)
31. feature 63 frequent_country_neigh_Leblon_BR (0.001740)
32. feature 26 frequent_country_neigh_Barra da Tijuca_BR (0.001604)
33. feature 33 frequent_country_neigh_Copacabana_AR (0.001568)
34. feature 58 frequent_country_neigh_Ipanema_US (0.001497)
35. feature 28 frequent_country_neigh_Botafogo_BR (0.001295)
36. feature 43 frequent_country_neigh_Copacabana_GB (0.001164)
37. feature 48 frequent_country_neigh_Flamengo_BR (0.001080)
38. feature 42 frequent_country_neigh_Copacabana_FR (0.001026)
39. feature 57 frequent_country_neigh_Ipanema_GB (0.000918)
40. feature 64 frequent_country_neigh_Leblon_US (0.000854)
41. feature 60 frequent_country_neigh_Lapa_BR (0.000845)
42. feature 50 frequent_country_neigh_Ipanema_AR (0.000789)
43. feature 68 frequent_country_neigh_Santa Teresa_BR (0.000789)
44. feature 34 frequent_country_neigh_Copacabana_AU (0.000772)
45. feature 67 frequent_country_neigh_Recreio dos Bandeirantes_BR (0.000696)
46. feature 40 frequent_country_neigh_Copacabana_DE (0.000653)
47. feature 70 frequent_country_neigh_Santa Teresa_US (0.000605)
48. feature 56 frequent_country_neigh_Ipanema_FR (0.000593)
49. feature 36 frequent_country_neigh_Copacabana_CA (0.000580)
50. feature 66 frequent_country_neigh_Leme_US (0.000515)
51. feature 38 frequent_country_neigh_Copacabana_CL (0.000513)
52. feature 47 frequent_country_neigh_Copacabana_UY (0.000487)
53. feature 71 frequent_country_neigh_Tijuca_BR (0.000458)
54. feature 61 frequent_country_neigh_Laranjeiras_BR (0.000407)
55. feature 30 frequent_country_neigh_Botafogo_US (0.000385)
56. feature 65 frequent_country_neigh_Leme_BR (0.000379)
57. feature 29 frequent_country_neigh_Botafogo_FR (0.000368)
58. feature 55 frequent_country_neigh_Ipanema_DE (0.000366)
59. feature 37 frequent_country_neigh_Copacabana_CH (0.000343)
60. feature 69 frequent_country_neigh_Santa Teresa_FR (0.000340)
61. feature 41 frequent_country_neigh_Copacabana_ES (0.000336)
62. feature 45 frequent_country_neigh_Copacabana_NL (0.000329)
63. feature 62 frequent_country_neigh_Leblon_AR (0.000322)
64. feature 49 frequent_country_neigh_Glória_BR (0.000312)
65. feature 51 frequent_country_neigh_Ipanema_AU (0.000287)
66. feature 27 frequent_country_neigh_Barra da Tijuca_US (0.000279)
67. feature 54 frequent_country_neigh_Ipanema_CL (0.000268)
68. feature 59 frequent_country_neigh_Lagoa_BR (0.000259)
69. feature 31 frequent_country_neigh_Catete_BR (0.000242)
70. feature 39 frequent_country_neigh_Copacabana_CO (0.000239)
71. feature 44 frequent_country_neigh_Copacabana_IT (0.000197)
72. feature 32 frequent_country_neigh_Centro_BR (0.000183)
73. feature 53 frequent_country_neigh_Ipanema_CA (0.000155)
In [24]:
y = df_model_1['booked']
#X = df_model_1[['contact_channel_first_instant_book', 'm_interactions', 'contact_channel_first_contact_me', 'total_reviews', 'm_first_message_length_in_characters']]
X = df_model_1.drop(columns='booked')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

scores = cross_val_score(model, X_train, y_train, cv=5)

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_macro')
print("F1 macro: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


import graphviz
from sklearn import tree

# in line function for small trees
dot_data = tree.export_graphviz(model, out_file=None, 
                         feature_names=X.columns.tolist(),
                                class_names=["No booking","Booked"],
                         filled=True, rounded=False, precision=2, label='root', impurity=False,
                         special_characters=True, max_depth=4) 
graph = graphviz.Source(dot_data)
graph
#graph.format = 'png'
#graph.render("tree")
Accuracy: 0.84 (+/- 0.01)
F1 macro: 0.83 (+/- 0.01)
Out[24]:
Tree 0 contact_channel_first_instant_book ≤ 0.5 samples = 18745 value = [11001, 7744] class = No booking 1 contact_channel_first_contact_me ≤ 0.5 14285 [11001, 3284] No booking 0->1 True 3864 4460 [0, 4460] Booked 0->3864 False 2 log_reviews ≤ 1.5 5611 [2926, 2685] No booking 1->2 2205 log_reviews ≤ 1.5 8674 [8075, 599] No booking 1->2205 3 log_delay_firstreply ≤ 11.62 2584 [1831, 753] No booking 2->3 1012 log_delay_firstreply ≤ 11.62 3027 [1095, 1932] Booked 2->1012 4 log_enquiry2checkin ≤ 4.42 2189 [1436, 753] No booking 3->4 1011 395 [395, 0] No booking 3->1011 5 (...) 4->5 604 (...) 4->604 1013 checkin_month ≤ 7.5 2880 [948, 1932] Booked 1012->1013 2204 147 [147, 0] No booking 1012->2204 1014 (...) 1013->1014 1917 (...) 1013->1917 2206 log_delay_firstreply ≤ 11.62 5569 [5325, 244] No booking 2205->2206 3041 log_delay_firstreply ≤ 8.86 3105 [2750, 355] No booking 2205->3041 2207 m_first_message_length_in_characters ≤ 235.5 4896 [4652, 244] No booking 2206->2207 3040 673 [673, 0] No booking 2206->3040 2208 (...) 2207->2208 2649 (...) 2207->2649 3042 m_first_message_length_in_characters ≤ 267.5 1606 [1369, 237] No booking 3041->3042 3549 m_first_message_length_in_characters ≤ 331.5 1499 [1381, 118] No booking 3041->3549 3043 (...) 3042->3043 3352 (...) 3042->3352 3550 (...) 3549->3550 3761 (...) 3549->3761

(3) visualization of key features, and how to linearize them

(3.2) funnel analysis by contact medium

In [25]:
funnel_contactchannel = joined.groupby('contact_channel_first')['contact','replied', 'accepted', 'booked'].sum()
funnel_contactchannel['replied%'] = funnel_contactchannel['replied'] / funnel_contactchannel['contact']
funnel_contactchannel['accepted%'] = funnel_contactchannel['accepted'] / funnel_contactchannel['contact']
funnel_contactchannel['booked%'] = funnel_contactchannel['booked'] / funnel_contactchannel['contact']
funnel_contactchannel
Out[25]:
contact replied accepted booked replied% accepted% booked%
contact_channel_first
book_it 8382 7517 4249 3991 0.896803 0.506920 0.476139
contact_me 12891 11712 5499 911 0.908541 0.426577 0.070669
instant_book 6706 6706 6706 6706 1.000000 1.000000 1.000000
In [26]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
funnel_contactchannel[['contact','replied', 'accepted', 'booked']].plot(kind='bar', ax=ax)
plt.title('Matching funnel by contact channel (Jan-Jun 2016)')
plt.tight_layout()
plt.xlabel('Initial contact channel')
plt.ylabel('Enquiries between Jan & Jun 2016')
Out[26]:
Text(83.5, 0.5, 'Enquiries between Jan & Jun 2016')
In [27]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
funnel_contactchannel[['replied%', 'accepted%', 'booked%']].plot(kind='bar', ax=ax)
plt.title('conversion funnel by contact channel')
Out[27]:
Text(0.5, 1.0, 'conversion funnel by contact channel')

(3.2) impact of prompt responses

In [28]:
print('instant book')
print(joined[joined['contact_channel_first'] == 'instant_book'][['delay_firstreply', 'delay_firstaccepted', 'delay_booked']].agg(['mean', 'median']))
print()
print('contact_me or book_it')
print(joined[joined['contact_channel_first'] != 'instant_book'][['delay_firstreply', 'delay_firstaccepted', 'delay_booked']].agg(['mean', 'median']))
instant book
        delay_firstreply  delay_firstaccepted  delay_booked
mean           885.39159           970.999105    974.526693
median           0.00000             0.000000      0.000000

contact_me or book_it
        delay_firstreply  delay_firstaccepted  delay_booked
mean        32522.763644         23081.724969  24021.865157
median      12203.000000         10013.000000  11230.000000
In [41]:
not_instantbooked = joined[joined['contact_channel_first'] != 'instant_book']
book_it =  joined[joined['contact_channel_first'] == 'book_it']
In [32]:
with sns.plotting_context('notebook', font_scale=1.5):
    fig, ax = plt.subplots()
    fig.set_size_inches(16, 7)
    bins = np.arange(0, 48, 0.5)
    g = sns.distplot(book_it[book_it['booked']==1]['delay_firstreply']/3600, label='booked', ax=ax, kde=False, bins=bins)
    g = sns.distplot(book_it[book_it['booked']==0]['delay_firstreply']/3600, label='not booked', ax=ax, kde=False, bins=bins)
    g.set_xlim(left=0, right=23)
    g.legend()
    plt.title("Host response time delay, and impact on booking rates (Jan-Jun 2016, 'book it' medium)")
    plt.xlabel('Delay of first reply (h)')
    plt.ylabel('Number enquiries')
In [34]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
g = sns.distplot(np.log(book_it[book_it['booked']==1]['delay_firstreply']+1), label='booked', ax=ax, kde=False)
g = sns.distplot(np.log(book_it[book_it['booked']==0]['delay_firstreply']+1), label='not booked', ax=ax, kde=False)
g.set_xlim(left=0, right=12)
g.legend()
plt.xlabel('Delay of first reply (log of response time in seconds)')
plt.ylabel('Number enquiries')
Out[34]:
Text(0, 0.5, 'Number enquiries')
In [35]:
firstreply_h_bin = np.concatenate(([0, 0.25, 0.5], np.arange(1,24,1)), axis=None)
book_it['firstreply_bins'] = pd.cut(book_it['delay_firstreply']/3600, bins=firstreply_h_bin, labels=False)
bookit_binned = book_it.groupby('firstreply_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate']  = bookit_binned['sum'] / bookit_binned['count'] 
bookit_binned_aug = pd.concat([bookit_binned, pd.DataFrame({'firstreply_h_bin':firstreply_h_bin[0:-1]})], axis=1)
bookit_binned_aug.set_index(keys='firstreply_h_bin', inplace=True)
bookit_binned_aug
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[35]:
count sum booking_rate
firstreply_h_bin
0.00 1446 947 0.654910
0.25 665 441 0.663158
0.50 760 456 0.600000
1.00 808 422 0.522277
2.00 475 241 0.507368
3.00 351 177 0.504274
4.00 310 153 0.493548
5.00 213 97 0.455399
6.00 238 114 0.478992
7.00 192 93 0.484375
8.00 209 96 0.459330
9.00 207 96 0.463768
10.00 138 63 0.456522
11.00 143 58 0.405594
12.00 183 65 0.355191
13.00 143 60 0.419580
14.00 111 43 0.387387
15.00 95 33 0.347368
16.00 88 29 0.329545
17.00 93 26 0.279570
18.00 90 27 0.300000
19.00 92 28 0.304348
20.00 81 31 0.382716
21.00 70 16 0.228571
22.00 66 15 0.227273
In [36]:
with sns.plotting_context('notebook', font_scale=1.5):
    fig, ax = plt.subplots()
    fig.set_size_inches(14, 7)

    g = sns.regplot(x="firstreply_h_bin", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
    #g.set_axis_labels("Response time (h)", "Average booking rate")
    g.set(ylim=(0, 0.8))
    plt.xlabel('First response delay (h)')
    plt.title("Impact of host response delay on booking rate (Jan-Jun 2016, 'book it' medium)")
  • initially big loss in booking rate in the first hour (5% in first 45min, 13% in first hour)
  • next drop seems to start after 2h (from log curves)
  • roughly 0.3 / 20 = 1.5% booking rate loss per hour in the long run
In [37]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
g = sns.regplot(x="firstreply_h_bin", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
g.set(xlim=(0, 48), ylim=(0, 1))
Out[37]:
[(0, 1), (0, 48)]

linear regression seems to cross y=0 at 40h

--> we'll use that to fill the NaN values and attribute the impact of response time to booking rate

transform so response is more linear

In [38]:
firstreply_log_bin = range(0,13)
book_it['firstreply_bins'] = pd.cut(np.log(book_it['delay_firstreply']+1), bins=firstreply_log_bin, labels=False)
bookit_binned = book_it.groupby('firstreply_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate']  = bookit_binned['sum'] / bookit_binned['count'] 

firstreply_bin_h = (np.exp(firstreply_log_bin)/3600)[0:-1]
bookit_binned_aug = pd.concat([pd.DataFrame({'firstreply_bin_h':firstreply_bin_h}),bookit_binned], axis=1)
bookit_binned_aug.set_index('firstreply_bin_h', inplace=True)
bookit_binned_aug
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[38]:
count sum booking_rate
firstreply_bin_h
0.000278 8 5 0.625000
0.000755 24 16 0.666667
0.002053 51 37 0.725490
0.005579 83 61 0.734940
0.015166 244 158 0.647541
0.041226 448 292 0.651786
0.112064 765 504 0.658824
0.304620 1035 651 0.628986
0.828044 1160 620 0.534483
2.250857 1239 605 0.488297
6.118463 1686 725 0.430012
16.631706 1635 314 0.192049
In [39]:
fig, ax = plt.subplots()
ax.set(xscale="log")
ax.set(xlim=(30/3600, 10))
fig.set_size_inches(14, 6)
ax.plot(bookit_binned_aug[['booking_rate']])
plt.xlabel('enquiry response time (h)')
plt.title('Response time influence on booking rates')
plt.show()

(3.3) impact of prompt responses on new guests vs experienced guests

In [43]:
with sns.plotting_context('notebook', font_scale=1.5):
    g = sns.FacetGrid(not_instantbooked[not_instantbooked['guest_user_stage_first']!='-unknown-']
      , row='guest_user_stage_first', col='contact_channel_first', hue='booked',  height=5, aspect=2)
    g.map(sns.distplot, 'log_delay_firstreply', kde=False)
    #g.set(xlim=(0, 300), ylim=(0, 2500))
    g.set_axis_labels("response time (s)", "Number of enquiries")
    g.add_legend()

new guests seem to have booking rates similar to returning guests on the 'book_it' channel. However, they seem to perform even worse than past bookers on the 'contact_me' channel.

(3.4) room type

In [44]:
joined['room_type'].value_counts()
Out[44]:
Entire home/apt    21056
Private room        6331
Shared room          592
Name: room_type, dtype: int64
In [45]:
funnel_roomtype = joined.groupby('room_type')[['contact', 'replied', 'accepted', 'booked']].sum()
funnel_roomtype['replied%'] = funnel_roomtype['replied'] / funnel_roomtype['contact']
funnel_roomtype['accepted%'] = funnel_roomtype['accepted'] / funnel_roomtype['contact']
funnel_roomtype['booked%'] = funnel_roomtype['booked'] / funnel_roomtype['contact']
funnel_roomtype
Out[45]:
contact replied accepted booked replied% accepted% booked%
room_type
Entire home/apt 21056 19488 12135 8434 0.925532 0.576320 0.400551
Private room 6331 5909 3994 2981 0.933344 0.630864 0.470858
Shared room 592 538 325 193 0.908784 0.548986 0.326014
In [46]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
funnel_roomtype[['contact', 'replied', 'accepted', 'booked']].plot(kind='bar', ax=ax)
plt.title('conversion funnel by room type')
plt.tight_layout()
In [47]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
funnel_roomtype[['replied%', 'accepted%', 'booked%']].plot(kind='bar', ax=ax)
plt.title('conversion funnel by room type')
plt.ylim(0,1)
Out[47]:
(0, 1)

room type doesn't seem strongly different, and probably catters to different needs and availabilities, so this probably doesn't need to be a choice that needs to be nudged

(3.5) number of interactions (contains posterior information)

In [48]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
sns.distplot(joined[joined['booked']==1]['m_interactions'], ax=ax, bins=range(0,22), kde=False, label='booked')
sns.distplot(joined[joined['booked']==0]['m_interactions'], ax=ax, bins=range(0,22), kde=False, label='not booked')
plt.legend()
ax.set_xlim(left=0, right=20)
Out[48]:
(0, 20)

m_interactions is not quite a predictive variable, in the sense that if probably counts all interactions related to the booking, not just the interactions prior to the booking?

(3.6) importance of initial message

In [54]:
# exclude auto booking
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
sns.distplot(not_instantbooked[not_instantbooked['booked']==1]['m_first_message_length_in_characters'], ax=ax, bins=100, kde=False, label='booked')
sns.distplot(not_instantbooked[not_instantbooked['booked']==0]['m_first_message_length_in_characters'], ax=ax, bins=100, kde=False, label='not booked')
plt.legend()
plt.title('book it or contact me media')
ax.set_xlim(left=0, right=500)
Out[54]:
(0, 500)
In [55]:
# booking-it media
book_it = joined[joined['contact_channel_first']=='book_it']


# exclude auto booking
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
sns.distplot(book_it[book_it['booked']==1]['m_first_message_length_in_characters'], ax=ax, bins=100, kde=False, label='booked')
sns.distplot(book_it[book_it['booked']==0]['m_first_message_length_in_characters'], ax=ax, bins=100, kde=False, label='not booked')
plt.legend()
plt.title('book it medium')
ax.set_xlim(left=0, right=500)
Out[55]:
(0, 500)
  • for the book it medium, the quantitative importance of message length seems limited
  • message quality may still matter though

(3.7) for guest or host acquisition, do some segments convert better

In [56]:
country_conv = joined.groupby('country')[['contact', 'booked']].sum()
country_conv['booking_rate'] = country_conv['booked'] / country_conv['contact'] 
sns.distplot(country_conv['booking_rate'], kde=False, bins=5)
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2c4bfd68>
In [57]:
sns.distplot(country_conv[country_conv['contact']>=50]['booking_rate'], kde=False, bins=5)
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x11be07320>
In [58]:
country_conv[country_conv['contact']>=50].sort_values(by='booking_rate', ascending=False).head()
Out[58]:
contact booked booking_rate
country
BE 112 81 0.723214
NO 78 53 0.679487
DK 92 62 0.673913
SE 83 53 0.638554
CH 262 166 0.633588
In [60]:
neighborhood_conv = joined.groupby('listing_neighborhood')[['contact', 'booked']].sum()
neighborhood_conv['booking_rate'] = neighborhood_conv['booked'] / neighborhood_conv['contact']
sns.distplot(neighborhood_conv[neighborhood_conv['contact']>=50]['booking_rate'], kde=False, bins=5)
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x11bb365f8>
In [61]:
neighborhood_conv[neighborhood_conv['contact']>=50].sort_values(by='booking_rate', ascending=False).head()
Out[61]:
contact booked booking_rate
listing_neighborhood
Urca 78 48 0.615385
Santa Teresa 590 359 0.608475
Lapa 425 245 0.576471
Vidigal 105 56 0.533333
Centro 122 63 0.516393

(3.8) correlation between guest countries, listing neighborhoods and room type for recommendation engine

-> perhaps some nationalities have better booking rates in some areas than others? and so the recommendation engine could try to match those nationalities to these areas?

How shall we structure this?

  • calculate conversion rate for countries & neighborhoods pairs that are in the dataset
    • this could be used to make recommendations by itself
    • if some countries tend to be associated with some neighboorhoods, that could be a meaningful way to cluster?
In [62]:
country_neighborhood = joined.groupby(['country', 'listing_neighborhood'])[['contact', 'booked']].sum().reset_index()
country_neighborhood['booking_rate%'] = country_neighborhood['booked'] / country_neighborhood['contact']
country_neighborhood.sort_values(by='booking_rate%', ascending=False).head()
Out[62]:
country listing_neighborhood contact booked booking_rate%
986 ZA Todos os Santos 1 1 1.0
664 MU Botafogo 1 1 1.0
271 CN Leblon 2 2 1.0
651 MA Ipanema 2 2 1.0
652 MA Leblon 1 1 1.0
In [65]:
country_neighborhood_filtered = country_neighborhood[country_neighborhood['contact']>20]

# keep only countries and neighborhoods that have more than 2 pairs
countries_count = country_neighborhood_filtered.groupby(['country'])[['listing_neighborhood']].count()
common_countries = countries_count[countries_count['listing_neighborhood']>=2].reset_index().drop(columns='listing_neighborhood')

neigborhood_count = country_neighborhood_filtered.groupby(['listing_neighborhood'])[['country']].count()
common_neighborhood = neigborhood_count[neigborhood_count['country']>=2].reset_index().drop(columns='country')


country_neighborhood_filtered2 = country_neighborhood_filtered.merge(common_countries, on='country', how='inner')
country_neighborhood_filtered2 = country_neighborhood_filtered2.merge(common_neighborhood, on='listing_neighborhood', how='inner')

country_neighborhood_pivot = country_neighborhood_filtered2.pivot(index='country', columns='listing_neighborhood', values='booking_rate%')
In [66]:
# heatmap
with sns.plotting_context('notebook', font_scale=1.3):
    plt.figure(figsize=(18,8))
    sns.heatmap(country_neighborhood_pivot.dropna(how='all'), cmap="Greens")
    plt.title('Booking rates grouped by host country and listing neighborhood (where enough data is available)')
    plt.tight_layout()
    plt.xlabel('listing neighborhood')
    plt.show()

Several problems with the data:

  • many bookings don't indicae the neighborhood ('unknown')
  • small dataset has many gaps in contacts, and so may give an unrepresentative picture.
In [67]:
plt.figure(figsize=(18,10))
sns.heatmap(country_neighborhood_pivot.corr().dropna(how='all', axis=0).dropna(how='all', axis=1), cmap="RdYlGn")
plt.title('Correlation between listing neighborhood, in terms of conversion rates for different countries')
plt.show()
In [69]:
plt.figure(figsize=(18,10))
sns.clustermap(country_neighborhood_pivot.transpose().corr().dropna(how='all', axis=0).dropna(how='all', axis=1).dropna(), cmap='RdYlGn')
plt.title('Correlation between guest nationality, in terms of conversion rates for different neighborhoods')
plt.show()
<Figure size 1296x720 with 0 Axes>

cluster countries, neighborhoods and pairs if time allows

(3.9) number of nights booked (not sure how this can be acted upon)

In [70]:
with sns.plotting_context('notebook', font_scale=1.5):
    g = sns.FacetGrid(joined, row='contact_channel_first', hue='booked',  height=3.5, aspect=4)
    g.map(sns.distplot, 'nights_booked', kde=False, bins=range(0,15,1))
    g.set(xlim=(0, 14), ylim=(0, 2500))
    g.set_axis_labels("Nights booked", "Number of enquiries")
    g.add_legend()

there seem to be some optimal booking rate around 3 nights --> not sure how to act upon that

  • logistics challenge of having a long period available without bookings in the middle could benefit from some extra tools?

(3.10) number of guests

In [72]:
with sns.plotting_context('notebook', font_scale=1.5):
    g = sns.FacetGrid(joined, row='contact_channel_first', hue='booked',  height=3.5, aspect=4)
    g.map(sns.distplot, 'm_guests', kde=False, bins=range(1,11,1))
    g.set(xlim=(1, 10), ylim=(0, 2500))
    g.set_axis_labels("Number of guests", "Number of enquiries")
    g.add_legend()
  • doesn't seem to be a particularly strong predictor or to be insightful
  • not quite sure how I'd act upon it also...

(3.11) total reviews impact on booking rates

In [73]:
sns.barplot(y='total_reviews', x='booked', data=joined)
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x11bb2a320>
In [74]:
with sns.plotting_context('notebook', font_scale=1.5):
    g = sns.FacetGrid(joined, row='contact_channel_first', hue='booked',  height=3.5, aspect=4)
    g.map(sns.distplot, 'total_reviews', kde=False)
    g.set(xlim=(0, 100), ylim=(0, 2500))
    g.set_axis_labels("Total reviews to date on listing", "Number of enquiries")
    g.add_legend()
In [76]:
reviews_bins = np.concatenate(([0, 1, 2, 3], np.arange(5,44,5)), axis=None)
book_it['reviews_bins'] = pd.cut(book_it['total_reviews'], bins=reviews_bins, labels=False)
bookit_binned = book_it.groupby('reviews_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate']  = bookit_binned['sum'] / bookit_binned['count'] 
bookit_binned_aug = pd.concat([bookit_binned, pd.DataFrame({'reviews_bins':reviews_bins[0:-1]})], axis=1)
bookit_binned_aug.set_index(keys='reviews_bins', inplace=True)
bookit_binned_aug
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[76]:
count sum booking_rate
reviews_bins
0 777 240 0.308880
1 433 162 0.374134
2 390 165 0.423077
3 561 290 0.516934
5 830 469 0.565060
10 616 397 0.644481
15 494 297 0.601215
20 300 209 0.696667
25 235 147 0.625532
30 263 160 0.608365
35 170 131 0.770588
In [77]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)

g = sns.lineplot(x="reviews_bins", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
#g.set_axis_labels("response time (s)", "Conversion rate")
g.set(ylim=(0, 1))
Out[77]:
[(0, 1)]
In [78]:
log_reviews_bins = np.log(np.concatenate(([1, 2, 3], np.arange(5,44,5)), axis=None))
book_it['log_reviews_bins'] = pd.cut(book_it['log_reviews'], bins=log_reviews_bins, labels=False)
bookit_binned = book_it.groupby('log_reviews_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate']  = bookit_binned['sum'] / bookit_binned['count'] 
bookit_binned_aug = pd.concat([bookit_binned, pd.DataFrame({'log_reviews_bins':log_reviews_bins[0:-1]})], axis=1)
bookit_binned_aug.set_index(keys='log_reviews_bins', inplace=True)
bookit_binned_aug
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[78]:
count sum booking_rate
log_reviews_bins
0.000000 777 240 0.308880
0.693147 433 162 0.374134
1.098612 691 317 0.458755
1.609438 899 493 0.548387
2.302585 696 438 0.629310
2.708050 495 287 0.579798
2.995732 368 262 0.711957
3.218876 238 148 0.621849
3.401197 270 166 0.614815
3.555348 178 137 0.769663
  • try to linearize
In [79]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)

g = sns.regplot(x="log_reviews_bins", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
#g.set_axis_labels("response time (s)", "Conversion rate")
g.set(ylim=(0, 1))
Out[79]:
[(0, 1)]

(3.12) number of days between enquiry and checkin

In [80]:
with sns.plotting_context('notebook', font_scale=1.5):
    g = sns.FacetGrid(joined, row='contact_channel_first', hue='booked',  height=3.5, aspect=4)
    g.map(sns.distplot, 'enquiry_to_checkin', kde=False)
    g.set(xlim=(0, 300), ylim=(0, 2500))
    g.set_axis_labels("Days between enquiry and prospective check-in", "Number of enquiries")
    g.add_legend()
In [81]:
log_enquiry2checkin_bins = np.log(np.concatenate(([1, 2, 3, 4, 7, 14], np.arange(21,200,20)), axis=None))
book_it['log_enquiry2checkin_bins'] = pd.cut(book_it['log_enquiry2checkin'], bins=log_enquiry2checkin_bins, labels=False)
bookit_binned = book_it.groupby('log_enquiry2checkin_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate']  = bookit_binned['sum'] / bookit_binned['count'] 
bookit_binned_aug = pd.concat([bookit_binned, pd.DataFrame({'log_enquiry2checkin_bins':log_enquiry2checkin_bins[0:-1]})], axis=1)
bookit_binned_aug.set_index(keys='log_enquiry2checkin_bins', inplace=True)
bookit_binned_aug
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[81]:
count sum booking_rate
log_enquiry2checkin_bins
0.000000 278 164 0.589928
0.693147 164 102 0.621951
1.098612 165 108 0.654545
1.386294 403 248 0.615385
1.945910 707 429 0.606789
2.639057 693 370 0.533911
3.044522 1621 827 0.510179
3.713572 711 417 0.586498
4.110874 568 268 0.471831
4.394449 450 173 0.384444
4.615121 369 138 0.373984
4.795791 314 109 0.347134
4.948760 304 91 0.299342
5.081404 320 107 0.334375
In [82]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
g = sns.regplot(x="log_enquiry2checkin_bins", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
#g.set_axis_labels("response time (s)", "Conversion rate")
g.set(ylim=(0, 1))
Out[82]:
[(0, 1)]
  • log transformation hasn't really lnearized this - first few days and bookings a month ahead of time seem to be outliers --> perhaps work on another transformation if time allows
  • initially make dummy variable for same day checkin and next day checkin
  • bookings the same day could also correlate with response time actually: particularly important to be fast for those... --> should I build an interaction variable?

(3.13) booking per host, and booking per guest

with hindsight I have considered the dataset may just be a random sample of the real data

  • kept exploration shallow
In [84]:
joined.groupby('id_listing_anon')[['id_guest_anon']].count().describe([0.5, 0.75, 0.9, 0.95, 0.99, 0.999])
Out[84]:
id_guest_anon
count 12819.000000
mean 2.182620
std 2.266554
min 1.000000
50% 1.000000
75% 2.000000
90% 4.000000
95% 6.000000
99% 12.000000
99.9% 20.000000
max 31.000000

--> there are probably fixed costs to setup a host (taking pictures etc). After how many bookings does Airbnb recoupe its costs? (will vary with listing price & # nights probably, if income is percentage of booking price)

In [85]:
joined.groupby('id_host_anon')[['id_listing_anon']].count().describe([0.5, 0.75, 0.9, 0.95, 0.99, 0.999])
Out[85]:
id_listing_anon
count 8959.000000
mean 3.123005
std 6.718118
min 1.000000
50% 1.000000
75% 3.000000
90% 6.000000
95% 9.000000
99% 24.000000
99.9% 103.000000
max 216.000000
  • Some hosts seem to have many listings: are those agencies?
  • Should airbnb deal with them differently than actual individuals (product, pricing, recommendations etc)
In [86]:
joined.groupby('id_guest_anon')[['id_host_anon']].count().describe([0.5, 0.75, 0.9, 0.95, 0.99, 0.999])
Out[86]:
id_host_anon
count 22566.000000
mean 1.239874
std 0.705666
min 1.000000
50% 1.000000
75% 1.000000
90% 2.000000
95% 2.000000
99% 4.000000
99.9% 7.000000
max 40.000000

(3.14) are unsuccessful bookings impacting guest retention?

  • limited data, and self-selection bias..
In [87]:
joined.groupby('id_guest_anon')[['booked']].agg(['count', 'sum']).describe([0.5, 0.75, 0.9, 0.95, 0.98])
Out[87]:
booked
count sum
count 22566.000000 22566.000000
mean 1.239874 0.514402
std 0.705666 0.565686
min 1.000000 0.000000
50% 1.000000 0.000000
75% 1.000000 1.000000
90% 2.000000 1.000000
95% 2.000000 1.000000
98% 3.000000 2.000000
max 40.000000 6.000000

not enough data really to answer this question

  • users that booked more than once are very few, so data would likely not be representative of entire population...
  • next step, on larger dataset: try to get a better sense of causality, and look at the time sequence of failed bookings on retention

(4) attribution of selected variables to conversion rates

regression 1

instant_booking gets 100% booking rate messaging has a big penalty. We'll treat that separately

booked(binary) ~ intercept + log(first_response_time) + isBrazilian +

  • model with few variables, both actionable levers and major confounding factors
In [89]:
book_it2 = joined[joined['contact_channel_first'] == 'book_it']
book_it2['intercept'] = 1

X = book_it2[[
        'isbrazilian',
        #'frequent_country_neigh',
        #'frequent_countries', 'frequent_neighborhoods',
            #'guest_user_stage_first', # news users seem to struggle mainly on the contact me channel, not book it
                'log_delay_firstreply', 'log_reviews', 'log_enquiry2checkin',
        'enquiry_sameday_checkin', 'enquiry_nextday_checkin',
        'm_first_message_length_in_characters',
               'intercept'
               ]]
#X = pd.get_dummies(X, columns=[
    #'guest_user_stage_first'
#    'frequent_country_neigh'
    # 'frequent_countries', 'frequent_neighborhoods'
#   ],drop_first=True)

y = book_it2['booked']

logit = sm.Logit(y, X)
logit_fitted = logit.fit()
print(logit_fitted.summary())
print()
number_rows = len(y)
print(logit_fitted.pred_table(threshold=0.5))
print(logit_fitted.pred_table(threshold=0.5)/number_rows)
Optimization terminated successfully.
         Current function value: 0.577636
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 booked   No. Observations:                 8382
Model:                          Logit   Df Residuals:                     8374
Method:                           MLE   Df Model:                            7
Date:                Mon, 03 Dec 2018   Pseudo R-squ.:                  0.1653
Time:                        09:58:02   Log-Likelihood:                -4841.7
converged:                       True   LL-Null:                       -5800.4
                                        LLR p-value:                     0.000
========================================================================================================
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
isbrazilian                             -0.2758      0.053     -5.165      0.000      -0.381      -0.171
log_delay_firstreply                    -0.2500      0.012    -20.662      0.000      -0.274      -0.226
log_reviews                              0.4599      0.018     26.091      0.000       0.425       0.494
log_enquiry2checkin                     -0.2786      0.021    -13.303      0.000      -0.320      -0.238
enquiry_sameday_checkin                 -0.5307      0.153     -3.470      0.001      -0.830      -0.231
enquiry_nextday_checkin                 -0.2586      0.098     -2.647      0.008      -0.450      -0.067
m_first_message_length_in_characters    -0.0003      0.000     -2.033      0.042      -0.001   -1.07e-05
intercept                                2.5070      0.149     16.779      0.000       2.214       2.800
========================================================================================================

[[3282. 1109.]
 [1291. 2700.]]
[[0.39155333 0.13230733]
 [0.15402052 0.32211883]]
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy