This anonymised dataset contains information about US TV commercials that ran for 5 movies. Column are:
We are trying to determine which creative is better than another creative.
Document structure:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv('ads_sample_data.csv')
print(df.shape)
df.head()
However, creatives are not the only factor impacting ad performance. Indeed, independantly of the qualities of the artist, the number of views (impressions) will linearly scale the number of actions for a fixed performance. Similarly, for a given budget, as ad cost is largely linear with its duration, one can show three 10s ads for the cost of one 30s ad.
The first order impact of these counfounding variables has already been treated in the data set by computing ad_score_rate. At the first order, this seems to work well for impressions, as shown on the figure below. We note that for small number of impressions ad_score_rate diverge, but this could naturally be kept controled by weighting by impressions.
fig = plt.figure()
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
sns.regplot(x='impressions', y='ad_score', data=df, ax=ax1)
sns.regplot(x='impressions', y='ad_score_rate', data=df, ax=ax2)
ax1.set_ylim(0,50)
ax2.set_ylim(0,5)
Although the argument to normalise ad_score by the ad length can be make on a cost basis, there is also a (noisy) upward trend in ad length and ad_score, which is kept under control at the first order.
fig = plt.figure()
ax1 = fig.add_subplot(311)
ax2 = fig.add_subplot(312)
ax3 = fig.add_subplot(313)
sns.boxplot(x='length', y='ad_score', data=df, ax=ax1)
sns.boxplot(x='length', y='ad_score_rate', data=df, ax=ax2)
sns.boxplot(x='length', y='impressions', data=df, ax=ax3)
ax1.set_ylim(0,10)
ax2.set_ylim(0,4)
ax3.set_ylim(0,2*1E6)
Beyond those first order corrections, is there any reasons for the impressions and length to influence the performance of an ad? It could be. That being said, the rational for it isn't clear and modelling this could lead to overfitting noise.
Additional confounding variables to control for:
Computing the mean ad_score_rate for different creatives, averaging over
# (a)
df.groupby('creative')['ad_score_rate'].agg(['mean', 'median', 'count']).nlargest(5, 'mean')
Some of the top performing creatives have very few channel-movie count. This means individual specificities of channel-movie will impact the mean estimator. This is on top of potential bias from an artist running on a well performing network etc.
# (b)
df.groupby('creative').apply(lambda x: np.average(x['ad_score_rate'], weights=x['impressions'])).nlargest(5)
# (c)
import weighted
df.groupby('creative').apply(lambda x: weighted.median(x['ad_score_rate'], x['impressions'])).nlargest(5)
# (d) mean ad_score_rate of creatives for a given movie - here movie #51
df[df['movie_id']==51].groupby('creative').apply(lambda x: np.average(x['ad_score_rate'], weights=x['impressions'])).nlargest(5)
# better approaches would probably use all the data from all movies,
# correct for the movie factor, and then restrict to creatives having worked on that one movie
The advantage of these methods is good scalability, with a complexity that scales with the number of samples O(N_samples).
In the case in which one wants to identify creative performance within one movie, the complexity probably wouldn't be far from O(N_movie_sample) if the list is sorted. As mentioned, this may not be the best estimate of the top performing creatives for one movie, as it does not leverage the rest of the dataset.
All in all, we think it is worth knowing these raw rankings of performance, as any further ranking will be based on various assumptions about the behavior of the counterfactual (confounding variables).
Indeed, although correlation doesn't imply causation, it is not trivial to disentangle creative performance and pure chance. An example could be good channels working only with great creatives and vice versa (good creatives working only with great channels - is that chance or selectivity?).
One way to control for confounding variables would be to compare artists in the same context. This is called matching. The advantage of matching is that this brings us closer to a balanced controled experiment, without the needed to model how the variables interact between each other (is it a linear relation, polynomial, are there interactions etc). When the match isn't perfect, some modelling becomes necessary, but this is then a local model, less likely to lead to large extrapolation errors.
Once matched, one could rank artists directly, or e.g. normalize the data. However, there are many combinations for the counterfactual variable.
Let's get a sense for the number of combinations between variables
print(df['creative'].nunique())
print(df['movie_id'].nunique())
print(df['network'].nunique())
print('')
print(df.groupby(['movie_id','creative','network'])['ad_score_rate'].count().describe())
print(df.groupby(['movie_id','creative'])['network'].nunique().describe())
print(df.groupby(['network'])['creative'].nunique().describe())
print(df.groupby(['network', 'movie_id'])['creative'].nunique().describe())
combined into:
If we were to match / normalize on just the network
If we were to match / normalize on both the network on the movies:
If we were to match on network, movies, dbr and time, this becomes very sparse with no real overlap in each particular case
One typical context in which confounding variables are controlled for is for the assessment of treatment effects. One complication compared to this framework is that the treatments would have many levels (86 creatives) and no clear base case.
When there are only 2 cases (treatment / no treatment), matching on a highly dimensional counterfactual can be collapsed onto a new one-dimensional variable called sparsity.
I do not know how this concept generalises to multiple-level treatments, and this may take more than a few hours to properly digest, so I will use another approach for this exercise. However, given more time this could be another option.
Another way to use this concept could have been a one vs all approach for each creative - however, quite a few creatives have less treated units than there are variables in the counterfactual (68 dummies for the networks + the over variables)
print(df.groupby(['creative'])['ad_score_rate'].count().describe())
One approach could be to normalize by the networks, and then do a one vs all matching with the rest of the counterfactual variables.
This could work for most creatives, but some would not be assessed by this method, as they have run too few ads (movie-network combinations) - as low as 2.
The model we are considering is: $$ y_{it} = Z_{it} * \beta + \alpha_i + u_{it}$$
where:
The $\alpha_i$ would be a performance measure of each creator $i$, and this could be used to rank them.
(Actually, given the nature of networks and movie_id, they will also be treated as fixed effects, but this is a special case of the notations above, and I chose to emphasize creatives as this is the variable we are interested in. In this notebook, explanatory variables are grouped in X, whether this is alpha_i or the counterfactual variables)
One of the key potential problem with this model is that the impact of variable only is through additions, both for the counterfactual and the impact of the artists. We'll try later to improve this.
A simple way to implement this is to run a linear regression with dummy variables for both the creators and the networks. This works on this data sample, and may or may not work so well depending on the total number of networks and creators - O(n_samples * n_variables^2)
import statsmodels.api as sm
import statsmodels.formula.api as smf
effect = df['ad_score_rate']
X = df.drop(['ad_score_rate', 'ad_score', 'impressions', 'length'], axis=1)
X = pd.get_dummies(X, columns=['creative', 'network', 'movie_id'],drop_first=True)
# one of the dummies is dropped to avoid multi-colinearity
# ad_score isn't the effect we chose in this model
# it isn't clear is one should control for impressions or length further
# for this first model we kick them out, as their first-order impact has already been taken into account
results = smf.WLS(effect, X, weights=df['impressions']).fit()
print(results.summary())
This is not the best model, with residuals not normally distributed, potentially some multi-colinearity, and only 36% of the variance beeing explained. But it could well do the job to rank our creatives. If the dataset can indeed be modelled as the fixed-effect linear model described above, then the noise on creatives variable often isn't catastrophic, typically 25% if the assessed value. Let's see if the predicted coefficients are plausible.
results.params.nlargest(17)
The top creatives are estimated to be CH, BT, BS, CG, BY. This makes some sense, as:
Similarly the network ranking isn't too far from a simple average approach:
df.groupby('network').apply(lambda x: np.average(x['ad_score_rate'], weights=x['impressions'])).nlargest(5)
The fact that all movies have a negative coefficient is surprising. The dummy for the most performing movie has been excluded to reduce redundancy (if it's not the other 4, it's this one), so at least the ranking seems to hold. It feels like the mean of the movie coefficient should arrive to something close to 0, with the intercept being modelled in a separate variable. However, this may be a human preference, and may not actually reflect an issue.
df.groupby('movie_id').apply(lambda x: np.average(x['ad_score_rate'], weights=x['impressions']))
Interestingly, 'dbr' is found to have little effect, and so are 'hour' and 'minutegroup'. We should check if this is a good reflection of reality, or if the model is too simple to capture those effects.
dbr
sns.jointplot(x='dbr', y='ad_score_rate', data=df, kind='kde', ylim=(0,10))
The bulk of movies do not seem to be affected by the release date, which would be in line with the the very small coefficients in the regression.
It anything, it looks like the standard deviation increases closer to the release date. As the score can not et more negative, it looks like this asymmetry creates an increase of the mean score.
Data 30 days before the release date looks more sparse and potentially less reliable.
def group_and_plot(df, X, Y, n_rolling=5):
grouped = df.groupby(X)
g_25 = grouped[Y].apply(lambda x: np.percentile(x, 25))
g_25 = g_25.rolling(n_rolling).mean()
g_median_weighted = grouped.apply(lambda x: weighted.median(x[Y],x['impressions']))
g_median_weighted = g_median_weighted.rolling(n_rolling).mean()
g_mean_weighted = grouped.apply(lambda x: np.average(x[Y],weights=x['impressions']))
g_mean_weighted = g_mean_weighted.rolling(n_rolling).mean()
g_75 = grouped[Y].apply(lambda x: np.percentile(x, 75))
g_75 = g_75.rolling(n_rolling).mean()
plt.figure()
plt.plot(g_median_weighted.index, g_median_weighted.values, 'b--')
plt.plot(g_mean_weighted.index, g_mean_weighted.values, 'k--')
plt.fill_between(g_25.index, g_25.values, g_75.values, color='g', alpha=0.2)
plt.legend(['weighted median', 'weighted mean', '25-75th percentile'])
group_and_plot(df, 'dbr', 'ad_score_rate', n_rolling=5)
These opposite trends would not be well captured by a linear model - we may need to create new variables to transform this - square?
df['dbr2'] = (df['dbr']-50)**2
group_and_plot(df, 'dbr2', 'ad_score_rate', n_rolling=5)
if anything this seems to have created a more linear relationship - could help
time
ax = sns.boxplot(x='minutegroup', y='ad_score_rate', data=df)
ax.set_ylim([0, 8])
doesn't look like there are global trends
ax = sns.boxplot(x='hour', y='ad_score_rate', data=df)
ax.set_ylim([0, 8])
group_and_plot(df, 'hour', 'ad_score_rate', n_rolling=5)
sns.jointplot(x='hour', y='ad_score_rate', data=df, kind='kde', ylim=(0,10))
Ad score rate seems to increase in the middle of the day. This is when the least ads happen though, so it may just be less reliable data.
So it isn't so surprising for the regression coefficient to be quasi null.
What's surprising is that there is so much ad activity between 0 and 4 am - is it all in the same time zone? In a real project I'd want to check that.
It could be that people genuinely have more time to take actions at this point of the day, or it could be a reflection that ads that touch few people are actually more effective for some reason. Or it could be a full spurious correlation. The middle of the day hypothesis is plausible enough that I'll create a binary variable to capture this.
df['middle_day'] = df['hour'].apply(lambda x: 1 if (x>=9 and x<=18) else 0)
interactions, and other non-linearities: a few thoughts
Adding complexity for its own sake could lead to overfitting. Looking at the learning curves and complexity curves can help find the right balance between bias and variance. Given that this model is expected to receive a lot more data, leaning on a more complex model would probably make the most of the extra data.
One way to control the expansion of these extra variables is to use regularization, which would penalize all coefficients so only useful ones remain.
So far, the ad score rate has been corrected by multiplicative coefficients rather than additive coefficients so far. It may or may not be for good reasons. One way to switch from one to another may be to work with logarithms of the variables, as log(a*b) = log(a) + log(b). I have also noticed there is quite a bit of skew in the data, and taking the log could help.
To capture more complexity in the model, I'd start from domain knowledge, or graph the data and try to look at plausible causal effects. This can take a bit of time, and I may try another model instead given time restrictions.
# we'll just capture one interaction that could make sense given ad_score_rate
df['impressions*length'] = df['impressions'] * df['length']
df['1e7/(impressions*length)'] = 1e7/(df['impressions'] * df['length'])
I'd like to try a different approach that is a lot more experimental: leverage some of the power of complex machine learning algorithms to capture counterfactual complexity, while keeping the easy interpretability of the fixed effects model for creatives ranking.
The model we are considering is: $$ y_{it} = f(Z_{it})*\gamma + \alpha_i + u_{it}$$
where:
At the moment I am training f to predict y, which is a way to collapse the counterfactual into one variable. It relies on the counterfactual having enough information to predict the main effects $y_{it}$, and the creatives are treated as fixed adjustments $\alpha_i$ to this main effect.
This isn't a conventional model, so it would need improvements through iterations and some more thoughts.
In terms of tools, we'll do a bit a of paradigm shift and do predictive modelling to assess model performance. We will use Scikit Learn for its machine learning libraries, and in this notebook we'll explore a Random Forrest Regressor, but given more time it may be worth exploring more models.
For this section:
In both cases we'll compare the random forrest classifier to a linear model.
from sklearn.model_selection import train_test_split
# the effect is y ad_score_rate
# X are all the explanatory variables
# treatment_creatives are the creatives dummy variables
# X = counterfactual and treatment_creatives
y = df['ad_score_rate']
X2 = df.drop(['ad_score_rate', 'ad_score'], axis=1)
counterfactual = X2.drop('creative', axis=1)
treatment_creatives = df[['creative', 'impressions']]
X2 = pd.get_dummies(X2, columns=['network', 'movie_id', 'creative'],drop_first=True)
counterfactual = pd.get_dummies(counterfactual, columns=['network', 'movie_id'],drop_first=True)
treatment_creatives = pd.get_dummies(treatment_creatives, columns=['creative'],drop_first=True)
X2_train, X2_test, y_train, y_test = train_test_split(X2, y, test_size=0.3, random_state=42)
counterfactual_train, counterfactual_test, y_train, y_test = train_test_split(counterfactual, y, test_size=0.3, random_state=42)
treatment_train, treatment_test, y_train, y_test = train_test_split(treatment_creatives, y, test_size=0.3, random_state=42)
first, predictions on the entire dataset
# step 1: first a random forest regressor to model the counterfactual
from sklearn.ensemble import RandomForestRegressor
RF_parameters = {'n_estimators':100, 'min_samples_split':6, 'min_samples_leaf':6, 'bootstrap':True, 'n_jobs':4}
#RF_parameters = {'n_estimators':100}
RFR = RandomForestRegressor(**RF_parameters)
RFR.fit(counterfactual, y, sample_weight=counterfactual['impressions'])
print('RFR counterfactual - train score: ', RFR.score(counterfactual, y, sample_weight=counterfactual['impressions']))
# -------- plot feature importance in random forrest -------------
importances = RFR.feature_importances_
std = np.std([tree.feature_importances_ for tree in RFR.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking and contribution
print("Feature ranking:")
for f in range(counterfactual.shape[1]):
print("%d. feature %d %s (%f)" % (f + 1, indices[f],counterfactual.columns.tolist()[indices[f]], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(counterfactual.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(counterfactual.shape[1]), indices)
plt.xlim([-1, counterfactual.shape[1]])
plt.show()
The few interactions and new variables I have created are deamed the most important features in the random forrest regressor: more of those may well be the way to improve the model.
Note: usually I'd use gridsearchCV to optimize the models' hyperparameters at this stage. However, I've noticed that the cross-validation in gridsearchCV doens't take weighting into account, so I'll keep it simple for this exercise.
y_counterfactual_est = RFR.predict(counterfactual)
# step 2: linear regression from f(counterfactual) and treatment
treatment_creatives['y_counterfactual_est'] = y_counterfactual_est
from sklearn.linear_model import LinearRegression
LR_2stepmodel = LinearRegression()
LR_2stepmodel.fit(treatment_creatives.drop('impressions', axis=1), y, sample_weight=treatment_creatives['impressions'])
print('LR second step of the model - train score: ', LR_2stepmodel.score(treatment_creatives.drop('impressions', axis=1), y, sample_weight=treatment_creatives['impressions']))
plt.scatter(y, y_counterfactual_est)
est_coeff_2steps = pd.DataFrame({'features':treatment_creatives.drop('impressions', axis=1).columns.values,'est. coeffs.':LR_2stepmodel.coef_})
est_coeff_2steps.sort_values('est. coeffs.', ascending=False).head(10)
First we can notice that $\gamma$ corresponding to y_counterfactual_est is close to 1 (which is good), but not quite 1.
The top 5 creatives are estimated to be CH, CG, BS, CE, and BQ, which is 3/5 similar to our previous linear regression with less variables. Domain knowledge could help to determine whether this is an improvement on the previous model.
# see how a simple linear model performs with these new features
LR = LinearRegression()
LR.fit(X2, y, sample_weight=X2['impressions'])
print('LR only - train score: ', LR.score(X2, y, sample_weight=X2['impressions']))
est_coeff_pure_LR = pd.DataFrame({'features':X2.columns.values,'est. coeffs.':LR.coef_})
est_coeff_pure_LR.sort_values('est. coeffs.', ascending=False).head(14)
For the straight linear regression with these new variables, the top creatives are CH, BS, BT, CG and CE.
results2 = smf.WLS(y, X2, weights=X2['impressions']).fit()
print(results2.summary())
probably some small improvement compared to the previous model - now 42% of the variance is explained by our model
This can be further improved, as previously discussed.
second, assess model performance on separated train and test datasets
# step 1: first a random forest regressor to model the counterfactual
RFR_split = RandomForestRegressor(**RF_parameters)
RFR_split.fit(counterfactual_train, y_train, sample_weight=counterfactual_train['impressions'])
print('RFR counterfactual - train score: ', RFR_split.score(counterfactual_train, y_train, sample_weight=counterfactual_train['impressions']))
print('RFR counterfactual - train score: ', RFR_split.score(counterfactual_test, y_test, sample_weight=counterfactual_test['impressions']))
y_counterfactual_est_train = RFR_split.predict(counterfactual_train)
y_counterfactual_est_test = RFR_split.predict(counterfactual_test)
# step 2: linear regression from f(counterfactual) and treatment
treatment_train['y_counterfactual_est'] = y_counterfactual_est_train
treatment_test['y_counterfactual_est'] = y_counterfactual_est_test
LR_2stepmodel_split = LinearRegression()
LR_2stepmodel_split.fit(treatment_train.drop('impressions', axis=1), y_train, sample_weight=treatment_train['impressions'])
print('LR second step of the model - train score: ', LR_2stepmodel_split.score(treatment_train.drop('impressions', axis=1), y_train, sample_weight=treatment_train['impressions']))
print('LR second step of the model - test score: ', LR_2stepmodel_split.score(treatment_test.drop('impressions', axis=1), y_test, sample_weight=treatment_test['impressions']))
# see how a simple linear model performs with these new features
LR_split = LinearRegression()
LR_split.fit(X2_train, y_train, sample_weight=X2_train['impressions'])
print('LR only - train score: ', LR_split.score(X2_train, y_train, sample_weight=X2_train['impressions']))
print('LR only - test score: ', LR_split.score(X2_test, y_test, sample_weight=X2_test['impressions']))
Usually at this stage I'd be looking at the learning curves to know. However the sklearn learning_curve tool doesn't seem to support sample_weight, so I'll pass it for this time.
What we can see already is that the linear model's test score is close to the train score, so I wouldn't expect the model to improve much once we scale to the full dataset. Additional features may help, but at the moment it looks like it will be a high biased model.
The 2 steps model seems to be high variance, with a big gap between the train and test score. More data would probably help improve the model performance. If the model was still high variance with the full data set, we should work on avoiding overfitting.
A thought about scaling: the complexity of a random forrest is probably something like O(n_tree variables_per_node n_sample*log(n_sample)). nlogn looks pretty scalable