Improving matching performance in a two-sided market¶

The aim of this notebook is to identify 2-3 actionable levers to increase the number of successful bookings.

As the dataset pertains mostly to matching, the focus of this notebook will be on booking rates. Further analysis with access to more data should consider ratings and Customer Lifetime Value optimization (for guests and hosts).

Approach¶

(0) Data importation
(1) Features engineering
- Either important confounding variables
- Likely actionable levers
(2) Get a first intuition of what features matter
- User a random forest classifier with many trees, and look at feature importance
(3) Visualization of key features
- and how to transform them so they provide a linear response
(4) Logistic regression to attribute impact of significant variables, controlling for confounding factors
- Focused on the ‘book it’ medium here, as I have recommended to steer away from the ‘contact me’ medium

Assumptions¶

Dataset
- Assumed that dataset is a random unbiased sample of actual market
- No multi-user analysis was made on that basis. If this is the entire dataset, the high number of hosts with only one booking in the 6 month period may require more attention. Given setup cost considerations, better targeting and active retention is needed (better matching is only part of it).
next point

(0) Data importation¶

# data manipulation
import numpy as np
import pandas as pd

# plots
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# machine-assisted exploration
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# statistical analysis and regression model
import statsmodels.discrete.discrete_model as sm

contacts = pd.read_csv('contacts.csv')
contacts.head(2)

contacts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27887 entries, 0 to 27886
Data columns (total 14 columns):
id_guest_anon                           27887 non-null object
id_host_anon                            27887 non-null object
id_listing_anon                         27887 non-null object
ts_interaction_first                    27887 non-null object
ts_reply_at_first                       25855 non-null object
ts_accepted_at_first                    16415 non-null object
ts_booking_at                           11587 non-null object
ds_checkin_first                        27887 non-null object
ds_checkout_first                       27887 non-null object
m_guests                                27886 non-null float64
m_interactions                          27887 non-null int64
m_first_message_length_in_characters    27887 non-null float64
contact_channel_first                   27887 non-null object
guest_user_stage_first                  27887 non-null object
dtypes: float64(2), int64(1), object(11)
memory usage: 3.0+ MB

get a feel for the context of this dataset

contacts['id_guest_anon'].agg(['nunique', 'count'])

nunique    22566
count      27887
Name: id_guest_anon, dtype: int64

contacts['id_host_anon'].agg(['nunique', 'count'])

nunique     8959
count      27887
Name: id_host_anon, dtype: int64

contacts['id_listing_anon'].agg(['nunique', 'count'])

nunique    12819
count      27887
Name: id_listing_anon, dtype: int64

contacts['ts_interaction_first'].agg(['min', 'max'])

min    2016-01-01 00:46:47
max    2016-06-30 23:58:14
Name: ts_interaction_first, dtype: object

users = pd.read_csv('users.csv')
users.head(2)

listings = pd.read_csv('listings.csv')
listings.head(2)

print(listings['listing_neighborhood'].nunique())
listings['listing_neighborhood'].value_counts()[:10]

68

-unknown-          6221
Copacabana         2531
Ipanema            1041
Barra da Tijuca     593
Leblon              458
Botafogo            345
Santa Teresa        243
Flamengo            171
Lapa                171
Leme                168
Name: listing_neighborhood, dtype: int64

It'd be cool to visualize booking number and conversion rates in different neighborhoods. However, a large part of the listings do not have any neighborhoods affiliation

map with the 'Bairro' (neighborhoods) http://portalgeo-pcrj.opendata.arcgis.com/datasets/8454eb0454b7424d89c61b67742286a1_15?geometry=-77.293%2C5.937%2C-73.723%2C6.756

brainstorm how to approach the problem¶

KPIs: what do people care about?

Listing: how full is it? (cost efficiency)
Listing & Guest: how many interactions are needed to get a successful transaction (time efficiency)
Listing & Guest: lead time to get booking finalized (time efficiency)

What can be done differently

encourage auto booking
tweak recommendation engine to maximise conversion rates, or rather the number of overall successful bookings
- e.g. match nationality to entire appartments in a certain neighborhood
direct marketing campaigns to
- some nationalities or neighborhoods
- at certain times of the year, if there is a supply vs demand mismatch
suggest faster response time or manage expectations
setup reminders for bookings? or follow up emails with other suggestions?
incentive to develop profile and make a long first message
etc

Potential ways to identify inefficiencies

conversion funnel of users and listings
geographical or temporal mismatch
regression on performance variables e.g. faster response

Potential segments

area of the city
nationality from person that bookeed
message length and number
first contact channel
number of guests
number of booking made (experienced guests and hosts)

Machine-assisted exploration¶

dataset is looking at only part of the matching process. Indeed, this dataset is partial, and captures what happens for people that already expressed an interest into each other
in that situation, what we can learn is:
- what interaction modalities are associated with a conversion --> that'd be useful to know so Airbnb can encourage the most effective behaviors
- what context is more auspicious to a match, so Airbnb can
- invest more in that context (e.g. part of the city), advertize to a certain type of customers
- encourage users to shift to that context / segment

However,

There are a lot of potential insightful variables
Let's build a random forest classifier to predict booking, and see what features matter

pre-processing¶

# --------------- join datasets ---------------
joined = contacts.merge(users, left_on = 'id_guest_anon', right_on = 'id_user_anon', how='left')
joined = joined.merge(listings, on = 'id_listing_anon', how='left')
# checked the match: 100%

# --------------- data type conversion ---------------
joined['ts_interaction_first'] = pd.to_datetime(joined['ts_interaction_first'])
joined['ts_reply_at_first'] = pd.to_datetime(joined['ts_reply_at_first'])
joined['ts_accepted_at_first'] = pd.to_datetime(joined['ts_accepted_at_first'])
joined['ts_booking_at'] = pd.to_datetime(joined['ts_booking_at'])

joined['ds_checkin_first'] = pd.to_datetime(joined['ds_checkin_first'])
joined['ds_checkout_first'] = pd.to_datetime(joined['ds_checkout_first'])

feature engineering (filter later)¶

# --------------- feature engineering ---------------
# boolean variables for the conversion funnel: reply, accepted, booking
joined['contact'] = 1 #non predictive feature that's just convenient for funnel analysis
joined['replied'] = joined['ts_reply_at_first'].apply(lambda x: 0 if x is pd.NaT else 1)
joined['accepted'] = joined['ts_accepted_at_first'].apply(lambda x: 0 if x is pd.NaT else 1)
joined['booked'] = joined['ts_booking_at'].apply(lambda x: 0 if x is pd.NaT else 1)

# booking speed (seconds)
joined['delay_firstreply'] = joined.apply(lambda x: (x['ts_reply_at_first'] - x['ts_interaction_first']).seconds, axis=1)
joined['delay_firstaccepted'] = joined.apply(lambda x: (x['ts_accepted_at_first'] - x['ts_interaction_first']).seconds, axis=1)
joined['delay_booked'] = joined.apply(lambda x: (x['ts_booking_at'] - x['ts_interaction_first']).seconds , axis=1)

# deal with nA
joined['delay_firstreply'] = joined['delay_firstreply'].fillna(value=3600*40)

joined['log_delay_firstreply'] = np.log(joined['delay_firstreply']+1)
joined['response_within_15min'] = joined['delay_firstreply'].apply(lambda x: 1 if x<=15*60 else 0)
joined['response_between_15_and_45min'] = joined['delay_firstreply'].apply(lambda x: 1 if ((x>15*60)&(x<=45*60)) else 0)

# ------ deal with NaN for response time and Cie
# it seems that NaN corresponds to an expired response time

# regression on conversion rate vs response time suggests that a ~40h response time would have 0% booking rate
# so for our attribution model, it seems that attributing 40h to the response time NaN could do the trick
#joined['delay_firstreply'] = joined['delay_firstreply'].fillna(value=3600*40)

# previous approach
#no_answer_infinitedelay = 6*30*24*3600 #arbitrary number for now: potentially improve with
#joined[['delay_firstreply', 'delay_firstaccepted', 'delay_booked']] = joined[['delay_firstreply', 'delay_firstaccepted', 'delay_booked']].fillna(value=no_answer_infinitedelay, axis=1)

# only 35 people have an unknown 'guest_user_stage_first' status
# for statistical significance we'll get rid of that category, and initially assign it to new users
joined['guest_user_stage_first'] = joined['guest_user_stage_first'].apply(lambda x: 'new' if x=='-unknown-' else x)


# --> potentially create another variable that's the time to answer normalized by enquiry_to_checkin

# how far in advance is enquiry made (days)
joined['enquiry_to_checkin'] = joined.apply(lambda x: (x['ds_checkin_first'] - x['ts_interaction_first']).days, axis=1)
joined['enquiry_sameday_checkin'] = joined['enquiry_to_checkin'].apply(lambda x: 1 if x==0 else 0)
joined['enquiry_nextday_checkin'] = joined['enquiry_to_checkin'].apply(lambda x: 1 if x<=1 else 0)
joined['enquiry_sameweek_checkin'] = joined['enquiry_to_checkin'].apply(lambda x: 1 if x<=7 else 0)

# how many nights booked
joined['nights_booked'] = joined.apply(lambda x: (x['ds_checkout_first'] - x['ds_checkin_first']).days, axis=1)

# random forrest suggests that being brazilian vs not makes a difference
# we'll create a specific feature for this
joined['isbrazilian'] = joined['country'].apply(lambda x: 1 if x=='BR' else 0)

# keep only most frequent nationalities and neighborhoods
country_freq = joined[['id_guest_anon', 'country']].groupby('country').count()
frequent_country = set(country_freq[country_freq['id_guest_anon']>50].index)
joined['frequent_countries'] = joined['country'].apply(lambda x: x if (x in frequent_country) else 'other_country')

neighborhood_freq = joined[['id_guest_anon', 'listing_neighborhood']].groupby('listing_neighborhood').count()
frequent_neighborhood = set(neighborhood_freq[neighborhood_freq['id_guest_anon']>50].index)
joined['frequent_neighborhoods'] = joined['listing_neighborhood'].apply(lambda x: x if (x in frequent_neighborhood) else 'other_neighborhood')

not_unknown_neigh = joined[joined['listing_neighborhood']!='-unknown-']
country_neighborhood_freq = not_unknown_neigh[['id_listing_anon', 'listing_neighborhood', 'country']].groupby(['listing_neighborhood', 'country']).count()
frequent_country_neigh = set(country_neighborhood_freq[country_neighborhood_freq['id_listing_anon']>50].index)
joined['frequent_country_neigh'] = joined[['listing_neighborhood', 'country']].apply(lambda x: 
     x['listing_neighborhood']+'_'+x['country'] if ((x['listing_neighborhood'], x['country']) in frequent_country_neigh) 
                                             else 'other_country_neighborhood', axis=1)

# identify the very top super hosts

# -------- transform variable so response is more linear ----------
# a few values were negatives... interpreted as error
median_reviews = joined['total_reviews'].median()
joined['total_reviews'] = joined['total_reviews'].apply(lambda x: x if x>=0 else median_reviews) 
joined['log_reviews'] = np.log(joined['total_reviews']+1)

# a few negative values.. interpreted as error
median_enquiry2checkin = joined['enquiry_to_checkin'].median()
joined['enquiry_to_checkin'] = joined['enquiry_to_checkin'].apply(lambda x: x if x>=0 else median_enquiry2checkin)
joined['log_enquiry2checkin'] = np.log(joined['enquiry_to_checkin']+1)


# hour of the day, day of the week and month of the year, for both enquiry and checkin
joined['enquiry_h'] = joined['ts_interaction_first'].apply(lambda x: x.hour)
joined['enquiry_dow'] = joined['ts_interaction_first'].apply(lambda x: x.weekday())
joined['enquiry_month'] = joined['ts_interaction_first'].apply(lambda x: x.month)

joined['enquiry_night'] = joined['enquiry_h'].apply(lambda x: 0 if ((x>=7)&(x<=23)) else 1)
joined['enquiry_middleday'] = joined['enquiry_h'].apply(lambda x: 1 if ((x>=9)&(x<=18)) else 0)
joined['enquiry_weekend'] = joined['enquiry_dow'].apply(lambda x: 1 if x>=4 else 0)
joined['enquiry_likely_unavailabletime'] = joined.apply(lambda x: 
        1 if ((x['enquiry_night']+(1-x['enquiry_weekend'])*x['enquiry_middleday'])>=1) else 0, axis=1)

# all those indicators are useful for host behaviors, as they are in Rio's local time
# if one wanted to infer information about guests from timing, time zone correction may need to be made (from nationality isn't great --> from ip location)

joined['checkin_dow'] = joined['ds_checkin_first'].apply(lambda x: x.weekday())
joined['checkin_month'] = joined['ds_checkin_first'].apply(lambda x: x.month)

joined['checkin_weekend'] = joined['checkin_dow'].apply(lambda x: 1 if x>=4 else 0)

# complete or substitute these seasonal numerical variables with
# --> Fourier transform coefficients, or other methods (ARIMA, prophet)
# --> perhaps some dummy variables where it could make sense (week vs weekend, working hours , sleeping hours etc)

joined.head()

(2) Get a first intuition of what features matter¶

prepare for random forrest, and filter a bit¶

#joined.columns

# ------------- prepare for exploratory model (random forrest) -------------

# drop variables that aren't generalizable
df_model_1 = joined.drop(columns=['id_guest_anon', 'id_host_anon', 'id_listing_anon', 'ts_interaction_first', 'ts_reply_at_first',
                          'ts_accepted_at_first', 'ts_booking_at', 'ds_checkin_first', 'ds_checkout_first',
                          'id_user_anon', 'contact'], axis=1)

# variables that are partial success factors in the conversion funnel
# we'll keep them in the model initially, just to check that the funnel indeed progresses in stages
# replied, accepted, (booked), as well as time to do those things --> they lead to 100% prediction indeed

df_model_1.drop(columns=['replied', 'accepted', 'delay_firstreply', 'delay_firstaccepted', 
                         'delay_booked', 'm_interactions'], inplace=True)
# caution also about log_delay_firstreply
# keep delay_firstreply only if interested in attributing the impact of response time

# drop variables that do not seem to have a high predictive power, or are redundant
# (method: looked at feature importance in a random forrest + business judgement)
df_model_1.drop(columns=['country', 'listing_neighborhood', 'enquiry_h', 'enquiry_dow',
                        'enquiry_to_checkin', 'total_reviews',
                        'frequent_countries', 'frequent_neighborhoods'], inplace=True)


# convert categorical variables into dummy variables
df_model_1 = pd.get_dummies(df_model_1, columns=['contact_channel_first', 'guest_user_stage_first', 'room_type'
                                                 , 'frequent_country_neigh'
                                                ],drop_first=True)

# because we're using a random forest initially, no need to convert seasonal numerical variables into dummie


# ----- caution: a lot of features: high risk of overfit --------
# look indicatively at: rows / features, and then learning curves etc

# ------ initially do that, but bad idea to keep it like this
df_model_1.dropna(inplace=True)

y = df_model_1['booked']
X = df_model_1.drop(columns='booked')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = RandomForestClassifier(n_jobs=-1, random_state=42, n_estimators=1000)
model.fit(X_train, y_train)


from sklearn.model_selection import cross_val_score, StratifiedKFold

scores = cross_val_score(model, X_train, y_train, cv=5)

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_macro')
print("F1 macro: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.89 (+/- 0.01)
F1 macro: 0.88 (+/- 0.01)

hyperparamamter optimization and learning curves if time allows

get a feel for what features matter and how the problem works¶

importances = model.feature_importances_
std = np.std([tree.feature_importances_ for tree in model.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking and contribution
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d %s (%f)" % (f + 1, indices[f],X.columns.tolist()[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
#plt.xlim([-1, X.shape[1]])
plt.show()

Feature ranking:
1. feature 21 contact_channel_first_contact_me (0.181782)
2. feature 22 contact_channel_first_instant_book (0.162868)
3. feature 3 log_delay_firstreply (0.151128)
4. feature 11 log_reviews (0.071082)
5. feature 4 response_within_15min (0.054449)
6. feature 1 m_first_message_length_in_characters (0.052770)
7. feature 12 log_enquiry2checkin (0.047524)
8. feature 9 nights_booked (0.034063)
9. feature 19 checkin_month (0.029637)
10. feature 13 enquiry_month (0.022598)
11. feature 0 m_guests (0.021247)
12. feature 18 checkin_dow (0.020823)
13. feature 2 words_in_user_profile (0.019977)
14. feature 10 isbrazilian (0.015266)
15. feature 23 guest_user_stage_first_past_booker (0.011331)
16. feature 15 enquiry_middleday (0.007610)
17. feature 16 enquiry_weekend (0.007589)
18. feature 17 enquiry_likely_unavailabletime (0.007193)
19. feature 24 room_type_Private room (0.007091)
20. feature 72 frequent_country_neigh_other_country_neighborhood (0.006972)
21. feature 20 checkin_weekend (0.006303)
22. feature 5 response_between_15_and_45min (0.005906)
23. feature 8 enquiry_sameweek_checkin (0.005892)
24. feature 14 enquiry_night (0.005867)
25. feature 35 frequent_country_neigh_Copacabana_BR (0.003727)
26. feature 7 enquiry_nextday_checkin (0.003610)
27. feature 52 frequent_country_neigh_Ipanema_BR (0.002325)
28. feature 25 room_type_Shared room (0.002180)
29. feature 6 enquiry_sameday_checkin (0.002135)
30. feature 46 frequent_country_neigh_Copacabana_US (0.002020)
31. feature 63 frequent_country_neigh_Leblon_BR (0.001740)
32. feature 26 frequent_country_neigh_Barra da Tijuca_BR (0.001604)
33. feature 33 frequent_country_neigh_Copacabana_AR (0.001568)
34. feature 58 frequent_country_neigh_Ipanema_US (0.001497)
35. feature 28 frequent_country_neigh_Botafogo_BR (0.001295)
36. feature 43 frequent_country_neigh_Copacabana_GB (0.001164)
37. feature 48 frequent_country_neigh_Flamengo_BR (0.001080)
38. feature 42 frequent_country_neigh_Copacabana_FR (0.001026)
39. feature 57 frequent_country_neigh_Ipanema_GB (0.000918)
40. feature 64 frequent_country_neigh_Leblon_US (0.000854)
41. feature 60 frequent_country_neigh_Lapa_BR (0.000845)
42. feature 50 frequent_country_neigh_Ipanema_AR (0.000789)
43. feature 68 frequent_country_neigh_Santa Teresa_BR (0.000789)
44. feature 34 frequent_country_neigh_Copacabana_AU (0.000772)
45. feature 67 frequent_country_neigh_Recreio dos Bandeirantes_BR (0.000696)
46. feature 40 frequent_country_neigh_Copacabana_DE (0.000653)
47. feature 70 frequent_country_neigh_Santa Teresa_US (0.000605)
48. feature 56 frequent_country_neigh_Ipanema_FR (0.000593)
49. feature 36 frequent_country_neigh_Copacabana_CA (0.000580)
50. feature 66 frequent_country_neigh_Leme_US (0.000515)
51. feature 38 frequent_country_neigh_Copacabana_CL (0.000513)
52. feature 47 frequent_country_neigh_Copacabana_UY (0.000487)
53. feature 71 frequent_country_neigh_Tijuca_BR (0.000458)
54. feature 61 frequent_country_neigh_Laranjeiras_BR (0.000407)
55. feature 30 frequent_country_neigh_Botafogo_US (0.000385)
56. feature 65 frequent_country_neigh_Leme_BR (0.000379)
57. feature 29 frequent_country_neigh_Botafogo_FR (0.000368)
58. feature 55 frequent_country_neigh_Ipanema_DE (0.000366)
59. feature 37 frequent_country_neigh_Copacabana_CH (0.000343)
60. feature 69 frequent_country_neigh_Santa Teresa_FR (0.000340)
61. feature 41 frequent_country_neigh_Copacabana_ES (0.000336)
62. feature 45 frequent_country_neigh_Copacabana_NL (0.000329)
63. feature 62 frequent_country_neigh_Leblon_AR (0.000322)
64. feature 49 frequent_country_neigh_Glória_BR (0.000312)
65. feature 51 frequent_country_neigh_Ipanema_AU (0.000287)
66. feature 27 frequent_country_neigh_Barra da Tijuca_US (0.000279)
67. feature 54 frequent_country_neigh_Ipanema_CL (0.000268)
68. feature 59 frequent_country_neigh_Lagoa_BR (0.000259)
69. feature 31 frequent_country_neigh_Catete_BR (0.000242)
70. feature 39 frequent_country_neigh_Copacabana_CO (0.000239)
71. feature 44 frequent_country_neigh_Copacabana_IT (0.000197)
72. feature 32 frequent_country_neigh_Centro_BR (0.000183)
73. feature 53 frequent_country_neigh_Ipanema_CA (0.000155)

y = df_model_1['booked']
#X = df_model_1[['contact_channel_first_instant_book', 'm_interactions', 'contact_channel_first_contact_me', 'total_reviews', 'm_first_message_length_in_characters']]
X = df_model_1.drop(columns='booked')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

scores = cross_val_score(model, X_train, y_train, cv=5)

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_macro')
print("F1 macro: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


import graphviz
from sklearn import tree

# in line function for small trees
dot_data = tree.export_graphviz(model, out_file=None, 
                         feature_names=X.columns.tolist(),
                                class_names=["No booking","Booked"],
                         filled=True, rounded=False, precision=2, label='root', impurity=False,
                         special_characters=True, max_depth=4) 
graph = graphviz.Source(dot_data)
graph
#graph.format = 'png'
#graph.render("tree")

Accuracy: 0.84 (+/- 0.01)
F1 macro: 0.83 (+/- 0.01)

(3) visualization of key features, and how to linearize them¶

(3.2) funnel analysis by contact medium¶

funnel_contactchannel = joined.groupby('contact_channel_first')['contact','replied', 'accepted', 'booked'].sum()
funnel_contactchannel['replied%'] = funnel_contactchannel['replied'] / funnel_contactchannel['contact']
funnel_contactchannel['accepted%'] = funnel_contactchannel['accepted'] / funnel_contactchannel['contact']
funnel_contactchannel['booked%'] = funnel_contactchannel['booked'] / funnel_contactchannel['contact']
funnel_contactchannel

fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
funnel_contactchannel[['contact','replied', 'accepted', 'booked']].plot(kind='bar', ax=ax)
plt.title('Matching funnel by contact channel (Jan-Jun 2016)')
plt.tight_layout()
plt.xlabel('Initial contact channel')
plt.ylabel('Enquiries between Jan & Jun 2016')

Text(83.5, 0.5, 'Enquiries between Jan & Jun 2016')

fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
funnel_contactchannel[['replied%', 'accepted%', 'booked%']].plot(kind='bar', ax=ax)
plt.title('conversion funnel by contact channel')

Text(0.5, 1.0, 'conversion funnel by contact channel')

(3.2) impact of prompt responses¶

print('instant book')
print(joined[joined['contact_channel_first'] == 'instant_book'][['delay_firstreply', 'delay_firstaccepted', 'delay_booked']].agg(['mean', 'median']))
print()
print('contact_me or book_it')
print(joined[joined['contact_channel_first'] != 'instant_book'][['delay_firstreply', 'delay_firstaccepted', 'delay_booked']].agg(['mean', 'median']))

instant book
        delay_firstreply  delay_firstaccepted  delay_booked
mean           885.39159           970.999105    974.526693
median           0.00000             0.000000      0.000000

contact_me or book_it
        delay_firstreply  delay_firstaccepted  delay_booked
mean        32522.763644         23081.724969  24021.865157
median      12203.000000         10013.000000  11230.000000

not_instantbooked = joined[joined['contact_channel_first'] != 'instant_book']
book_it =  joined[joined['contact_channel_first'] == 'book_it']

with sns.plotting_context('notebook', font_scale=1.5):
    fig, ax = plt.subplots()
    fig.set_size_inches(16, 7)
    bins = np.arange(0, 48, 0.5)
    g = sns.distplot(book_it[book_it['booked']==1]['delay_firstreply']/3600, label='booked', ax=ax, kde=False, bins=bins)
    g = sns.distplot(book_it[book_it['booked']==0]['delay_firstreply']/3600, label='not booked', ax=ax, kde=False, bins=bins)
    g.set_xlim(left=0, right=23)
    g.legend()
    plt.title("Host response time delay, and impact on booking rates (Jan-Jun 2016, 'book it' medium)")
    plt.xlabel('Delay of first reply (h)')
    plt.ylabel('Number enquiries')

fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
g = sns.distplot(np.log(book_it[book_it['booked']==1]['delay_firstreply']+1), label='booked', ax=ax, kde=False)
g = sns.distplot(np.log(book_it[book_it['booked']==0]['delay_firstreply']+1), label='not booked', ax=ax, kde=False)
g.set_xlim(left=0, right=12)
g.legend()
plt.xlabel('Delay of first reply (log of response time in seconds)')
plt.ylabel('Number enquiries')

Text(0, 0.5, 'Number enquiries')

firstreply_h_bin = np.concatenate(([0, 0.25, 0.5], np.arange(1,24,1)), axis=None)
book_it['firstreply_bins'] = pd.cut(book_it['delay_firstreply']/3600, bins=firstreply_h_bin, labels=False)
bookit_binned = book_it.groupby('firstreply_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate']  = bookit_binned['sum'] / bookit_binned['count'] 
bookit_binned_aug = pd.concat([bookit_binned, pd.DataFrame({'firstreply_h_bin':firstreply_h_bin[0:-1]})], axis=1)
bookit_binned_aug.set_index(keys='firstreply_h_bin', inplace=True)
bookit_binned_aug

/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

with sns.plotting_context('notebook', font_scale=1.5):
    fig, ax = plt.subplots()
    fig.set_size_inches(14, 7)

    g = sns.regplot(x="firstreply_h_bin", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
    #g.set_axis_labels("Response time (h)", "Average booking rate")
    g.set(ylim=(0, 0.8))
    plt.xlabel('First response delay (h)')
    plt.title("Impact of host response delay on booking rate (Jan-Jun 2016, 'book it' medium)")

initially big loss in booking rate in the first hour (5% in first 45min, 13% in first hour)
next drop seems to start after 2h (from log curves)
roughly 0.3 / 20 = 1.5% booking rate loss per hour in the long run

fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
g = sns.regplot(x="firstreply_h_bin", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
g.set(xlim=(0, 48), ylim=(0, 1))

[(0, 1), (0, 48)]

linear regression seems to cross y=0 at 40h

--> we'll use that to fill the NaN values and attribute the impact of response time to booking rate

transform so response is more linear

firstreply_log_bin = range(0,13)
book_it['firstreply_bins'] = pd.cut(np.log(book_it['delay_firstreply']+1), bins=firstreply_log_bin, labels=False)
bookit_binned = book_it.groupby('firstreply_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate']  = bookit_binned['sum'] / bookit_binned['count'] 

firstreply_bin_h = (np.exp(firstreply_log_bin)/3600)[0:-1]
bookit_binned_aug = pd.concat([pd.DataFrame({'firstreply_bin_h':firstreply_bin_h}),bookit_binned], axis=1)
bookit_binned_aug.set_index('firstreply_bin_h', inplace=True)
bookit_binned_aug

/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

fig, ax = plt.subplots()
ax.set(xscale="log")
ax.set(xlim=(30/3600, 10))
fig.set_size_inches(14, 6)
ax.plot(bookit_binned_aug[['booking_rate']])
plt.xlabel('enquiry response time (h)')
plt.title('Response time influence on booking rates')
plt.show()

(3.3) impact of prompt responses on new guests vs experienced guests¶

with sns.plotting_context('notebook', font_scale=1.5):
    g = sns.FacetGrid(not_instantbooked[not_instantbooked['guest_user_stage_first']!='-unknown-']
      , row='guest_user_stage_first', col='contact_channel_first', hue='booked',  height=5, aspect=2)
    g.map(sns.distplot, 'log_delay_firstreply', kde=False)
    #g.set(xlim=(0, 300), ylim=(0, 2500))
    g.set_axis_labels("response time (s)", "Number of enquiries")
    g.add_legend()

new guests seem to have booking rates similar to returning guests on the 'book_it' channel. However, they seem to perform even worse than past bookers on the 'contact_me' channel.

(3.4) room type¶

joined['room_type'].value_counts()

Entire home/apt    21056
Private room        6331
Shared room          592
Name: room_type, dtype: int64

funnel_roomtype = joined.groupby('room_type')[['contact', 'replied', 'accepted', 'booked']].sum()
funnel_roomtype['replied%'] = funnel_roomtype['replied'] / funnel_roomtype['contact']
funnel_roomtype['accepted%'] = funnel_roomtype['accepted'] / funnel_roomtype['contact']
funnel_roomtype['booked%'] = funnel_roomtype['booked'] / funnel_roomtype['contact']
funnel_roomtype

fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
funnel_roomtype[['contact', 'replied', 'accepted', 'booked']].plot(kind='bar', ax=ax)
plt.title('conversion funnel by room type')
plt.tight_layout()

fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
funnel_roomtype[['replied%', 'accepted%', 'booked%']].plot(kind='bar', ax=ax)
plt.title('conversion funnel by room type')
plt.ylim(0,1)

(0, 1)

room type doesn't seem strongly different, and probably catters to different needs and availabilities, so this probably doesn't need to be a choice that needs to be nudged

(3.5) number of interactions (contains posterior information)¶

fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
sns.distplot(joined[joined['booked']==1]['m_interactions'], ax=ax, bins=range(0,22), kde=False, label='booked')
sns.distplot(joined[joined['booked']==0]['m_interactions'], ax=ax, bins=range(0,22), kde=False, label='not booked')
plt.legend()
ax.set_xlim(left=0, right=20)

(0, 20)

m_interactions is not quite a predictive variable, in the sense that if probably counts all interactions related to the booking, not just the interactions prior to the booking?

(3.6) importance of initial message¶

# exclude auto booking
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
sns.distplot(not_instantbooked[not_instantbooked['booked']==1]['m_first_message_length_in_characters'], ax=ax, bins=100, kde=False, label='booked')
sns.distplot(not_instantbooked[not_instantbooked['booked']==0]['m_first_message_length_in_characters'], ax=ax, bins=100, kde=False, label='not booked')
plt.legend()
plt.title('book it or contact me media')
ax.set_xlim(left=0, right=500)

(0, 500)

# booking-it media
book_it = joined[joined['contact_channel_first']=='book_it']


# exclude auto booking
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
sns.distplot(book_it[book_it['booked']==1]['m_first_message_length_in_characters'], ax=ax, bins=100, kde=False, label='booked')
sns.distplot(book_it[book_it['booked']==0]['m_first_message_length_in_characters'], ax=ax, bins=100, kde=False, label='not booked')
plt.legend()
plt.title('book it medium')
ax.set_xlim(left=0, right=500)

(0, 500)

for the book it medium, the quantitative importance of message length seems limited
message quality may still matter though

(3.7) for guest or host acquisition, do some segments convert better¶

country_conv = joined.groupby('country')[['contact', 'booked']].sum()
country_conv['booking_rate'] = country_conv['booked'] / country_conv['contact'] 
sns.distplot(country_conv['booking_rate'], kde=False, bins=5)

<matplotlib.axes._subplots.AxesSubplot at 0x1a2c4bfd68>

sns.distplot(country_conv[country_conv['contact']>=50]['booking_rate'], kde=False, bins=5)

<matplotlib.axes._subplots.AxesSubplot at 0x11be07320>

country_conv[country_conv['contact']>=50].sort_values(by='booking_rate', ascending=False).head()

neighborhood_conv = joined.groupby('listing_neighborhood')[['contact', 'booked']].sum()
neighborhood_conv['booking_rate'] = neighborhood_conv['booked'] / neighborhood_conv['contact']
sns.distplot(neighborhood_conv[neighborhood_conv['contact']>=50]['booking_rate'], kde=False, bins=5)

<matplotlib.axes._subplots.AxesSubplot at 0x11bb365f8>

neighborhood_conv[neighborhood_conv['contact']>=50].sort_values(by='booking_rate', ascending=False).head()

(3.8) correlation between guest countries, listing neighborhoods and room type for recommendation engine¶

-> perhaps some nationalities have better booking rates in some areas than others? and so the recommendation engine could try to match those nationalities to these areas?

How shall we structure this?

calculate conversion rate for countries & neighborhoods pairs that are in the dataset
- this could be used to make recommendations by itself
- if some countries tend to be associated with some neighboorhoods, that could be a meaningful way to cluster?

country_neighborhood = joined.groupby(['country', 'listing_neighborhood'])[['contact', 'booked']].sum().reset_index()
country_neighborhood['booking_rate%'] = country_neighborhood['booked'] / country_neighborhood['contact']
country_neighborhood.sort_values(by='booking_rate%', ascending=False).head()

country_neighborhood_filtered = country_neighborhood[country_neighborhood['contact']>20]

# keep only countries and neighborhoods that have more than 2 pairs
countries_count = country_neighborhood_filtered.groupby(['country'])[['listing_neighborhood']].count()
common_countries = countries_count[countries_count['listing_neighborhood']>=2].reset_index().drop(columns='listing_neighborhood')

neigborhood_count = country_neighborhood_filtered.groupby(['listing_neighborhood'])[['country']].count()
common_neighborhood = neigborhood_count[neigborhood_count['country']>=2].reset_index().drop(columns='country')


country_neighborhood_filtered2 = country_neighborhood_filtered.merge(common_countries, on='country', how='inner')
country_neighborhood_filtered2 = country_neighborhood_filtered2.merge(common_neighborhood, on='listing_neighborhood', how='inner')

country_neighborhood_pivot = country_neighborhood_filtered2.pivot(index='country', columns='listing_neighborhood', values='booking_rate%')

# heatmap
with sns.plotting_context('notebook', font_scale=1.3):
    plt.figure(figsize=(18,8))
    sns.heatmap(country_neighborhood_pivot.dropna(how='all'), cmap="Greens")
    plt.title('Booking rates grouped by host country and listing neighborhood (where enough data is available)')
    plt.tight_layout()
    plt.xlabel('listing neighborhood')
    plt.show()

Several problems with the data:

many bookings don't indicae the neighborhood ('unknown')
small dataset has many gaps in contacts, and so may give an unrepresentative picture.

plt.figure(figsize=(18,10))
sns.heatmap(country_neighborhood_pivot.corr().dropna(how='all', axis=0).dropna(how='all', axis=1), cmap="RdYlGn")
plt.title('Correlation between listing neighborhood, in terms of conversion rates for different countries')
plt.show()

plt.figure(figsize=(18,10))
sns.clustermap(country_neighborhood_pivot.transpose().corr().dropna(how='all', axis=0).dropna(how='all', axis=1).dropna(), cmap='RdYlGn')
plt.title('Correlation between guest nationality, in terms of conversion rates for different neighborhoods')
plt.show()

<Figure size 1296x720 with 0 Axes>

cluster countries, neighborhoods and pairs if time allows

(3.9) number of nights booked (not sure how this can be acted upon)¶

with sns.plotting_context('notebook', font_scale=1.5):
    g = sns.FacetGrid(joined, row='contact_channel_first', hue='booked',  height=3.5, aspect=4)
    g.map(sns.distplot, 'nights_booked', kde=False, bins=range(0,15,1))
    g.set(xlim=(0, 14), ylim=(0, 2500))
    g.set_axis_labels("Nights booked", "Number of enquiries")
    g.add_legend()

there seem to be some optimal booking rate around 3 nights --> not sure how to act upon that

logistics challenge of having a long period available without bookings in the middle could benefit from some extra tools?

(3.10) number of guests¶

with sns.plotting_context('notebook', font_scale=1.5):
    g = sns.FacetGrid(joined, row='contact_channel_first', hue='booked',  height=3.5, aspect=4)
    g.map(sns.distplot, 'm_guests', kde=False, bins=range(1,11,1))
    g.set(xlim=(1, 10), ylim=(0, 2500))
    g.set_axis_labels("Number of guests", "Number of enquiries")
    g.add_legend()

doesn't seem to be a particularly strong predictor or to be insightful
not quite sure how I'd act upon it also...

(3.11) total reviews impact on booking rates¶

sns.barplot(y='total_reviews', x='booked', data=joined)

<matplotlib.axes._subplots.AxesSubplot at 0x11bb2a320>

with sns.plotting_context('notebook', font_scale=1.5):
    g = sns.FacetGrid(joined, row='contact_channel_first', hue='booked',  height=3.5, aspect=4)
    g.map(sns.distplot, 'total_reviews', kde=False)
    g.set(xlim=(0, 100), ylim=(0, 2500))
    g.set_axis_labels("Total reviews to date on listing", "Number of enquiries")
    g.add_legend()

reviews_bins = np.concatenate(([0, 1, 2, 3], np.arange(5,44,5)), axis=None)
book_it['reviews_bins'] = pd.cut(book_it['total_reviews'], bins=reviews_bins, labels=False)
bookit_binned = book_it.groupby('reviews_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate']  = bookit_binned['sum'] / bookit_binned['count'] 
bookit_binned_aug = pd.concat([bookit_binned, pd.DataFrame({'reviews_bins':reviews_bins[0:-1]})], axis=1)
bookit_binned_aug.set_index(keys='reviews_bins', inplace=True)
bookit_binned_aug

/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

fig, ax = plt.subplots()
fig.set_size_inches(14, 6)

g = sns.lineplot(x="reviews_bins", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
#g.set_axis_labels("response time (s)", "Conversion rate")
g.set(ylim=(0, 1))

[(0, 1)]

log_reviews_bins = np.log(np.concatenate(([1, 2, 3], np.arange(5,44,5)), axis=None))
book_it['log_reviews_bins'] = pd.cut(book_it['log_reviews'], bins=log_reviews_bins, labels=False)
bookit_binned = book_it.groupby('log_reviews_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate']  = bookit_binned['sum'] / bookit_binned['count'] 
bookit_binned_aug = pd.concat([bookit_binned, pd.DataFrame({'log_reviews_bins':log_reviews_bins[0:-1]})], axis=1)
bookit_binned_aug.set_index(keys='log_reviews_bins', inplace=True)
bookit_binned_aug

/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

try to linearize

fig, ax = plt.subplots()
fig.set_size_inches(14, 6)

g = sns.regplot(x="log_reviews_bins", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
#g.set_axis_labels("response time (s)", "Conversion rate")
g.set(ylim=(0, 1))

[(0, 1)]

(3.12) number of days between enquiry and checkin¶

with sns.plotting_context('notebook', font_scale=1.5):
    g = sns.FacetGrid(joined, row='contact_channel_first', hue='booked',  height=3.5, aspect=4)
    g.map(sns.distplot, 'enquiry_to_checkin', kde=False)
    g.set(xlim=(0, 300), ylim=(0, 2500))
    g.set_axis_labels("Days between enquiry and prospective check-in", "Number of enquiries")
    g.add_legend()

log_enquiry2checkin_bins = np.log(np.concatenate(([1, 2, 3, 4, 7, 14], np.arange(21,200,20)), axis=None))
book_it['log_enquiry2checkin_bins'] = pd.cut(book_it['log_enquiry2checkin'], bins=log_enquiry2checkin_bins, labels=False)
bookit_binned = book_it.groupby('log_enquiry2checkin_bins')[['booked']].agg(['count', 'sum'])
bookit_binned.columns = bookit_binned.columns.droplevel(0)
#bookit_binned.reset_index(inplace=True)
bookit_binned['booking_rate']  = bookit_binned['sum'] / bookit_binned['count'] 
bookit_binned_aug = pd.concat([bookit_binned, pd.DataFrame({'log_enquiry2checkin_bins':log_enquiry2checkin_bins[0:-1]})], axis=1)
bookit_binned_aug.set_index(keys='log_enquiry2checkin_bins', inplace=True)
bookit_binned_aug

/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
g = sns.regplot(x="log_enquiry2checkin_bins", y="booking_rate", data=bookit_binned_aug.reset_index(), ax=ax)
#g.set_axis_labels("response time (s)", "Conversion rate")
g.set(ylim=(0, 1))

[(0, 1)]

log transformation hasn't really lnearized this - first few days and bookings a month ahead of time seem to be outliers --> perhaps work on another transformation if time allows
initially make dummy variable for same day checkin and next day checkin
bookings the same day could also correlate with response time actually: particularly important to be fast for those... --> should I build an interaction variable?

(3.13) booking per host, and booking per guest¶

with hindsight I have considered the dataset may just be a random sample of the real data

kept exploration shallow

joined.groupby('id_listing_anon')[['id_guest_anon']].count().describe([0.5, 0.75, 0.9, 0.95, 0.99, 0.999])

--> there are probably fixed costs to setup a host (taking pictures etc). After how many bookings does Airbnb recoupe its costs? (will vary with listing price & # nights probably, if income is percentage of booking price)

joined.groupby('id_host_anon')[['id_listing_anon']].count().describe([0.5, 0.75, 0.9, 0.95, 0.99, 0.999])

Some hosts seem to have many listings: are those agencies?
Should airbnb deal with them differently than actual individuals (product, pricing, recommendations etc)

joined.groupby('id_guest_anon')[['id_host_anon']].count().describe([0.5, 0.75, 0.9, 0.95, 0.99, 0.999])

(3.14) are unsuccessful bookings impacting guest retention?¶

limited data, and self-selection bias..

joined.groupby('id_guest_anon')[['booked']].agg(['count', 'sum']).describe([0.5, 0.75, 0.9, 0.95, 0.98])

not enough data really to answer this question

users that booked more than once are very few, so data would likely not be representative of entire population...
next step, on larger dataset: try to get a better sense of causality, and look at the time sequence of failed bookings on retention

(4) attribution of selected variables to conversion rates¶

regression 1

instant_booking gets 100% booking rate messaging has a big penalty. We'll treat that separately

booked(binary) ~ intercept + log(first_response_time) + isBrazilian +

model with few variables, both actionable levers and major confounding factors

book_it2 = joined[joined['contact_channel_first'] == 'book_it']
book_it2['intercept'] = 1

X = book_it2[[
        'isbrazilian',
        #'frequent_country_neigh',
        #'frequent_countries', 'frequent_neighborhoods',
            #'guest_user_stage_first', # news users seem to struggle mainly on the contact me channel, not book it
                'log_delay_firstreply', 'log_reviews', 'log_enquiry2checkin',
        'enquiry_sameday_checkin', 'enquiry_nextday_checkin',
        'm_first_message_length_in_characters',
               'intercept'
               ]]
#X = pd.get_dummies(X, columns=[
    #'guest_user_stage_first'
#    'frequent_country_neigh'
    # 'frequent_countries', 'frequent_neighborhoods'
#   ],drop_first=True)

y = book_it2['booked']

logit = sm.Logit(y, X)
logit_fitted = logit.fit()
print(logit_fitted.summary())
print()
number_rows = len(y)
print(logit_fitted.pred_table(threshold=0.5))
print(logit_fitted.pred_table(threshold=0.5)/number_rows)

Optimization terminated successfully.
         Current function value: 0.577636
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 booked   No. Observations:                 8382
Model:                          Logit   Df Residuals:                     8374
Method:                           MLE   Df Model:                            7
Date:                Mon, 03 Dec 2018   Pseudo R-squ.:                  0.1653
Time:                        09:58:02   Log-Likelihood:                -4841.7
converged:                       True   LL-Null:                       -5800.4
                                        LLR p-value:                     0.000
========================================================================================================
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
isbrazilian                             -0.2758      0.053     -5.165      0.000      -0.381      -0.171
log_delay_firstreply                    -0.2500      0.012    -20.662      0.000      -0.274      -0.226
log_reviews                              0.4599      0.018     26.091      0.000       0.425       0.494
log_enquiry2checkin                     -0.2786      0.021    -13.303      0.000      -0.320      -0.238
enquiry_sameday_checkin                 -0.5307      0.153     -3.470      0.001      -0.830      -0.231
enquiry_nextday_checkin                 -0.2586      0.098     -2.647      0.008      -0.450      -0.067
m_first_message_length_in_characters    -0.0003      0.000     -2.033      0.042      -0.001   -1.07e-05
intercept                                2.5070      0.149     16.779      0.000       2.214       2.800
========================================================================================================

[[3282. 1109.]
 [1291. 2700.]]
[[0.39155333 0.13230733]
 [0.15402052 0.32211883]]

/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

	id_guest_anon	id_host_anon	id_listing_anon	ts_interaction_first	ts_reply_at_first	ts_accepted_at_first	ts_booking_at	ds_checkin_first	ds_checkout_first	m_guests	...	enquiry_h	enquiry_dow	enquiry_month	enquiry_night	enquiry_middleday	enquiry_likely_unavailabletime	checkin_dow	checkin_month	checkin_weekend
0	da8656a1-51af-4f38-b1c4-94be1f585157	5426897d-960d-4013-9e38-606ae746793c	a408a8b2-0d44-4513-a611-3736d0409bb2	2016-04-21 02:55:53	2016-04-21 03:15:00	2016-04-21 03:15:00	2016-04-21 03:15:00	2016-08-02	2016-08-06	4.0	...	2	3	4	1	0	1	1	8	0
1	8590d6f1-8bc9-4e8b-bdfb-de78f69fcf37	f30417c5-6df4-45ac-bfc2-6ad1cce398ab	e387c705-0aeb-464c-9375-ece63be6f006	2016-02-16 22:14:01	2016-02-16 23:37:36	NaT	NaT	2016-08-11	2016-08-22	5.0	...	22	1	2	0	0	0	3	8	0
2	ebcd83ba-bda1-47eb-9680-2dd04ccf3988	13cbf50a-3272-45d4-9866-a06b6ea1b99a	d1eb1960-938f-4305-a353-51e224414dd4	2016-01-27 23:33:38	2016-01-28 02:12:47	NaT	NaT	2016-03-14	2016-03-23	1.0	...	23	2	1	0	0	0	0	3	0
3	b0af8848-fe2a-4ef1-991e-26ab3066feb3	01614601-d5a4-4776-ab9b-c10d3b865bf0	855f6779-346c-45fc-a64b-ea133250ca7a	2016-05-05 14:42:52	2016-05-05 15:17:40	2016-05-05 15:17:40	NaT	2016-05-27	2016-05-29	2.0	...	14	3	5	0	1	1	4	5	1
4	5ddbbcc3-ac1a-4d8a-b6f8-0f675b76d1b2	f2fed6f3-4c5c-453d-9e64-37c62b8bd06d	f2928a59-c5e7-42b2-9c37-240a40f38dc0	2016-06-23 03:09:25	2016-06-23 03:09:26	2016-06-23 03:09:33	2016-06-23 03:09:33	2016-08-19	2016-08-21	3.0	...	3	3	6	1	0	1	4	8	1

	contact	replied	accepted	booked	replied%	accepted%	booked%
contact_channel_first
book_it	8382	7517	4249	3991	0.896803	0.506920	0.476139
contact_me	12891	11712	5499	911	0.908541	0.426577	0.070669
instant_book	6706	6706	6706	6706	1.000000	1.000000	1.000000

	count	sum	booking_rate
firstreply_h_bin
0.00	1446	947	0.654910
0.25	665	441	0.663158
0.50	760	456	0.600000
1.00	808	422	0.522277
2.00	475	241	0.507368
3.00	351	177	0.504274
4.00	310	153	0.493548
5.00	213	97	0.455399
6.00	238	114	0.478992
7.00	192	93	0.484375
8.00	209	96	0.459330
9.00	207	96	0.463768
10.00	138	63	0.456522
11.00	143	58	0.405594
12.00	183	65	0.355191
13.00	143	60	0.419580
14.00	111	43	0.387387
15.00	95	33	0.347368
16.00	88	29	0.329545
17.00	93	26	0.279570
18.00	90	27	0.300000
19.00	92	28	0.304348
20.00	81	31	0.382716
21.00	70	16	0.228571
22.00	66	15	0.227273

	count	sum	booking_rate
firstreply_bin_h
0.000278	8	5	0.625000
0.000755	24	16	0.666667
0.002053	51	37	0.725490
0.005579	83	61	0.734940
0.015166	244	158	0.647541
0.041226	448	292	0.651786
0.112064	765	504	0.658824
0.304620	1035	651	0.628986
0.828044	1160	620	0.534483
2.250857	1239	605	0.488297
6.118463	1686	725	0.430012
16.631706	1635	314	0.192049

	count	sum	booking_rate
reviews_bins
0	777	240	0.308880
1	433	162	0.374134
2	390	165	0.423077
3	561	290	0.516934
5	830	469	0.565060
10	616	397	0.644481
15	494	297	0.601215
20	300	209	0.696667
25	235	147	0.625532
30	263	160	0.608365
35	170	131	0.770588

	id_user_anon	country	words_in_user_profile
0	1d16a001-31a2-494c-a101-1f7b308adc62	FR	0
1	42607e0a-86c0-472e-b633-9e192114e93c	AR	0

	id_listing_anon	room_type	listing_neighborhood	total_reviews
0	71582793-e5f8-46d7-afdf-7a31d2341c79	Private room	-unknown-	0.0
1	a1a3f728-e21f-4432-96aa-361d28e2b319	Entire home/apt	Copacabana	0.0

	contact	replied	accepted	booked	replied%	accepted%	booked%
room_type
Entire home/apt	21056	19488	12135	8434	0.925532	0.576320	0.400551
Private room	6331	5909	3994	2981	0.933344	0.630864	0.470858
Shared room	592	538	325	193	0.908784	0.548986	0.326014

	contact	booked	booking_rate
country
BE	112	81	0.723214
NO	78	53	0.679487
DK	92	62	0.673913
SE	83	53	0.638554
CH	262	166	0.633588

	contact	booked	booking_rate
listing_neighborhood
Urca	78	48	0.615385
Santa Teresa	590	359	0.608475
Lapa	425	245	0.576471
Vidigal	105	56	0.533333
Centro	122	63	0.516393

	country	listing_neighborhood	contact	booked	booking_rate%
986	ZA	Todos os Santos	1	1	1.0
664	MU	Botafogo	1	1	1.0
271	CN	Leblon	2	2	1.0
651	MA	Ipanema	2	2	1.0
652	MA	Leblon	1	1	1.0

	count	sum	booking_rate
log_reviews_bins
0.000000	777	240	0.308880
0.693147	433	162	0.374134
1.098612	691	317	0.458755
1.609438	899	493	0.548387
2.302585	696	438	0.629310
2.708050	495	287	0.579798
2.995732	368	262	0.711957
3.218876	238	148	0.621849
3.401197	270	166	0.614815
3.555348	178	137	0.769663

	count	sum	booking_rate
log_enquiry2checkin_bins
0.000000	278	164	0.589928
0.693147	164	102	0.621951
1.098612	165	108	0.654545
1.386294	403	248	0.615385
1.945910	707	429	0.606789
2.639057	693	370	0.533911
3.044522	1621	827	0.510179
3.713572	711	417	0.586498
4.110874	568	268	0.471831
4.394449	450	173	0.384444
4.615121	369	138	0.373984
4.795791	314	109	0.347134
4.948760	304	91	0.299342
5.081404	320	107	0.334375

	id_guest_anon
count	12819.000000
mean	2.182620
std	2.266554
min	1.000000
50%	1.000000
75%	2.000000
90%	4.000000
95%	6.000000
99%	12.000000
99.9%	20.000000
max	31.000000

	id_listing_anon
count	8959.000000
mean	3.123005
std	6.718118
min	1.000000
50%	1.000000
75%	3.000000
90%	6.000000
95%	9.000000
99%	24.000000
99.9%	103.000000
max	216.000000

	id_host_anon
count	22566.000000
mean	1.239874
std	0.705666
min	1.000000
50%	1.000000
75%	1.000000
90%	2.000000
95%	2.000000
99%	4.000000
99.9%	7.000000
max	40.000000