This anonymised dataset contains information about US TV commercials that ran for 5 movies. Column are:

  • movie_id - identify which movie is which
  • creative - identify what creative ran the given commercial. The same code for different movies indicate different commercials, but the same creative.
  • network - The TV network that the commercial aired on
  • length - How long the commercial was, in seconds
  • impressions - estimate for the number of people who saw the commercial (views)
  • ad_score - estimated score for how many people performed an action that we track as a result of the commercial (absolute number of people but scaled by a constant factor)
  • ad_score_rate - Computed as [ad_score * 1e7 / (impressions * length)]
  • dbr - The number of days that this before the release of the movie
  • hour - Hour of day when the commercial ran on the East Coast
  • minutegroup - Number of minutes past the hour when the commercial ran: 0 = 0-9, 1 = 10-19, 2 = 20-29, 3=30-39, 4=40-49, 5=50-59

We are trying to determine which creative is better than another creative.

Document structure:

  • (0) naive approach - straight averages
  • (1) matching would suffer from sparsity issues
  • (2) a simple fixed-effect linear-regression model
  • (3) understanding better the confounding variables dbr, hour and minutegroup
  • (4) hybrid model: more complexity for the counterfactual, and fixed effects for creatives
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [3]:
df = pd.read_csv('ads_sample_data.csv')
print(df.shape)
df.head()
(19911, 10)
Out[3]:
movie_id creative network length impressions ad_score ad_score_rate dbr hour minutegroup
0 51 0B MTV 15 94000.0 1.49936 10.633759 135 9 1
1 51 0B VH1 15 39000.0 1.07447 18.367009 135 11 1
2 51 0B Comedy Central 15 134000.0 3.57074 17.764876 135 12 0
3 51 0B Comedy Central 15 153000.0 4.82151 21.008758 135 13 0
4 51 0B VH1 15 47000.0 1.54256 21.880284 135 13 5

However, creatives are not the only factor impacting ad performance. Indeed, independantly of the qualities of the artist, the number of views (impressions) will linearly scale the number of actions for a fixed performance. Similarly, for a given budget, as ad cost is largely linear with its duration, one can show three 10s ads for the cost of one 30s ad.

The first order impact of these counfounding variables has already been treated in the data set by computing ad_score_rate. At the first order, this seems to work well for impressions, as shown on the figure below. We note that for small number of impressions ad_score_rate diverge, but this could naturally be kept controled by weighting by impressions.

In [3]:
fig = plt.figure()
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
sns.regplot(x='impressions',  y='ad_score', data=df, ax=ax1)
sns.regplot(x='impressions',  y='ad_score_rate', data=df, ax=ax2)
ax1.set_ylim(0,50)
ax2.set_ylim(0,5)
Out[3]:
(0, 5)

Although the argument to normalise ad_score by the ad length can be make on a cost basis, there is also a (noisy) upward trend in ad length and ad_score, which is kept under control at the first order.

In [4]:
fig = plt.figure()
ax1 = fig.add_subplot(311)
ax2 = fig.add_subplot(312)
ax3 = fig.add_subplot(313)
sns.boxplot(x='length',  y='ad_score', data=df, ax=ax1)
sns.boxplot(x='length',  y='ad_score_rate', data=df, ax=ax2)
sns.boxplot(x='length',  y='impressions', data=df, ax=ax3)
ax1.set_ylim(0,10)
ax2.set_ylim(0,4)
ax3.set_ylim(0,2*1E6)
Out[4]:
(0, 2000000.0)

Beyond those first order corrections, is there any reasons for the impressions and length to influence the performance of an ad? It could be. That being said, the rational for it isn't clear and modelling this could lead to overfitting noise.

Additional confounding variables to control for:

  • network: some networks create more actions than others
  • dbr: number of Days Before Release may impact ad performance
  • hour: time of the day
  • minutegroup: could be a way to get more precision on the time of the day, or potentially be a separate information in itself if people's actions are impacted by how close to the next hour they are (e.g. news on the hour etc)

0. naive approach - straight averages

Computing the mean ad_score_rate for different creatives, averaging over

  • (a) number of channel-movie over which the creative had his ads run
  • (b) number of impressions (views) that the creative made, across all channel-movie
  • (c) like (b), but using the median instead of the mean, as a way to be more robust to extremes and get a 'typical' score for this creative (instead of an average effect)
  • (d) number of impressions (views) that the creative made, across all channels, but for one movie (one specific ad)
In [5]:
# (a)
df.groupby('creative')['ad_score_rate'].agg(['mean', 'median', 'count']).nlargest(5, 'mean')
Out[5]:
mean median count
creative
BT 10.495160 4.367288 44
CH 10.172789 5.610796 4
BR 9.414887 12.211601 3
BN 7.376727 2.174129 64
AS 7.263093 1.678383 297

Some of the top performing creatives have very few channel-movie count. This means individual specificities of channel-movie will impact the mean estimator. This is on top of potential bias from an artist running on a well performing network etc.

In [6]:
# (b)
df.groupby('creative').apply(lambda x: np.average(x['ad_score_rate'], weights=x['impressions'])).nlargest(5)
Out[6]:
creative
CH    14.393372
BS     5.335156
CG     4.680067
BT     4.666555
BQ     3.294564
dtype: float64
In [7]:
# (c)
import weighted
df.groupby('creative').apply(lambda x: weighted.median(x['ad_score_rate'], x['impressions'])).nlargest(5)
Out[7]:
creative
CH    9.463857
CE    2.090763
BY    2.021227
BQ    2.004876
BZ    1.994444
dtype: float64
In [8]:
# (d) mean ad_score_rate of creatives for a given movie - here movie #51
df[df['movie_id']==51].groupby('creative').apply(lambda x: np.average(x['ad_score_rate'], weights=x['impressions'])).nlargest(5)

# better approaches would probably use all the data from all movies,
# correct for the movie factor, and then restrict to creatives having worked on that one movie
Out[8]:
creative
AJ    12.143115
AI    11.292443
0C     8.846082
AH     8.096525
AO     7.872271
dtype: float64

The advantage of these methods is good scalability, with a complexity that scales with the number of samples O(N_samples).

In the case in which one wants to identify creative performance within one movie, the complexity probably wouldn't be far from O(N_movie_sample) if the list is sorted. As mentioned, this may not be the best estimate of the top performing creatives for one movie, as it does not leverage the rest of the dataset.

All in all, we think it is worth knowing these raw rankings of performance, as any further ranking will be based on various assumptions about the behavior of the counterfactual (confounding variables).

Indeed, although correlation doesn't imply causation, it is not trivial to disentangle creative performance and pure chance. An example could be good channels working only with great creatives and vice versa (good creatives working only with great channels - is that chance or selectivity?).

1. matching would suffer from sparsity issues

One way to control for confounding variables would be to compare artists in the same context. This is called matching. The advantage of matching is that this brings us closer to a balanced controled experiment, without the needed to model how the variables interact between each other (is it a linear relation, polynomial, are there interactions etc). When the match isn't perfect, some modelling becomes necessary, but this is then a local model, less likely to lead to large extrapolation errors.

Once matched, one could rank artists directly, or e.g. normalize the data. However, there are many combinations for the counterfactual variable.

Let's get a sense for the number of combinations between variables

In [9]:
print(df['creative'].nunique())
print(df['movie_id'].nunique())
print(df['network'].nunique())
print('')
print(df.groupby(['movie_id','creative','network'])['ad_score_rate'].count().describe())
print(df.groupby(['movie_id','creative'])['network'].nunique().describe())
print(df.groupby(['network'])['creative'].nunique().describe())
print(df.groupby(['network', 'movie_id'])['creative'].nunique().describe())
86
5
69

count    2561.000000
mean        7.774697
std        11.411155
min         1.000000
25%         2.000000
50%         4.000000
75%        10.000000
max       151.000000
Name: ad_score_rate, dtype: float64
count    279.000000
mean       9.179211
std        9.323722
min        1.000000
25%        1.000000
50%        5.000000
75%       15.000000
max       37.000000
Name: network, dtype: float64
count    69.000000
mean     27.782609
std      17.042222
min       1.000000
25%      15.000000
50%      26.000000
75%      41.000000
max      61.000000
Name: creative, dtype: float64
count    188.000000
mean      13.622340
std        7.924843
min        1.000000
25%        7.000000
50%       14.000000
75%       19.000000
max       33.000000
Name: creative, dtype: float64
  • 86 creatives
  • 5 movies
  • 69 networks

combined into:

  • 2561 combinations of movie_id-creative-network with a mean of 7.8 repeats of an ad on each network (median 4)
  • 279 ads aired onto 9 networks in average (5 networks median) => this gives $279 * 9 * 7.8 \approx 19911$, which is the number of lines in this file

If we were to match / normalize on just the network

  • each network has a mean of 27 creatives that have worked for a diversity of movies

If we were to match / normalize on both the network on the movies:

  • there are about 14 creatives per movie-network combinations, which isn't too bad but isn't quite falling into the law of large numbers (for a gaussian approximation)

If we were to match on network, movies, dbr and time, this becomes very sparse with no real overlap in each particular case

One typical context in which confounding variables are controlled for is for the assessment of treatment effects. One complication compared to this framework is that the treatments would have many levels (86 creatives) and no clear base case.

When there are only 2 cases (treatment / no treatment), matching on a highly dimensional counterfactual can be collapsed onto a new one-dimensional variable called sparsity.

I do not know how this concept generalises to multiple-level treatments, and this may take more than a few hours to properly digest, so I will use another approach for this exercise. However, given more time this could be another option.

Another way to use this concept could have been a one vs all approach for each creative - however, quite a few creatives have less treated units than there are variables in the counterfactual (68 dummies for the networks + the over variables)

In [10]:
print(df.groupby(['creative'])['ad_score_rate'].count().describe())
count      86.000000
mean      231.523256
std       253.455634
min         2.000000
25%        40.250000
50%       122.500000
75%       368.750000
max      1109.000000
Name: ad_score_rate, dtype: float64

One approach could be to normalize by the networks, and then do a one vs all matching with the rest of the counterfactual variables.

This could work for most creatives, but some would not be assessed by this method, as they have run too few ads (movie-network combinations) - as low as 2.

2. a simple fixed effects linear regression model

The model we are considering is: $$ y_{it} = Z_{it} * \beta + \alpha_i + u_{it}$$

where:

  • $i$ corresponds to each artist
  • $t$ corresponds to the combinations in our counterfactual, in practice movie-network-dbr-time combinations
  • $y_{it}$ is the effect, here ad_score_rate
  • $X_{it}$ is the counterfactual 1xk independant variables
  • $\beta$ is a kx1 matrix of parameters to fit the counterfactual
  • $\alpha_i$ is the unobserved innate (fixed-effect) ability of the creators to create good ads
  • $u_it$ is the error term

The $\alpha_i$ would be a performance measure of each creator $i$, and this could be used to rank them.

(Actually, given the nature of networks and movie_id, they will also be treated as fixed effects, but this is a special case of the notations above, and I chose to emphasize creatives as this is the variable we are interested in. In this notebook, explanatory variables are grouped in X, whether this is alpha_i or the counterfactual variables)

One of the key potential problem with this model is that the impact of variable only is through additions, both for the counterfactual and the impact of the artists. We'll try later to improve this.

A simple way to implement this is to run a linear regression with dummy variables for both the creators and the networks. This works on this data sample, and may or may not work so well depending on the total number of networks and creators - O(n_samples * n_variables^2)

In [11]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
In [12]:
effect = df['ad_score_rate']

X = df.drop(['ad_score_rate', 'ad_score', 'impressions', 'length'], axis=1)
X = pd.get_dummies(X, columns=['creative', 'network', 'movie_id'],drop_first=True)
# one of the dummies is dropped to avoid multi-colinearity
# ad_score isn't the effect we chose in this model

# it isn't clear is one should control for impressions or length further
# for this first model we kick them out, as their first-order impact has already been taken into account
In [13]:
results = smf.WLS(effect, X, weights=df['impressions']).fit()
print(results.summary())
                            WLS Regression Results                            
==============================================================================
Dep. Variable:          ad_score_rate   R-squared:                       0.360
Model:                            WLS   Adj. R-squared:                  0.355
Method:                 Least Squares   F-statistic:                     69.57
Date:                Mon, 11 Jun 2018   Prob (F-statistic):               0.00
Time:                        09:00:25   Log-Likelihood:                -49834.
No. Observations:               19911   AIC:                         9.999e+04
Df Residuals:                   19751   BIC:                         1.013e+05
Df Model:                         160                                         
Covariance Type:            nonrobust                                         
======================================================================================================
                                         coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
dbr                                    0.0040      0.002      1.963      0.050    5.44e-06       0.008
hour                                   0.0006      0.002      0.257      0.797      -0.004       0.005
minutegroup                            0.0016      0.010      0.154      0.878      -0.018       0.021
creative_0B                            1.1294      0.206      5.477      0.000       0.725       1.534
creative_0C                            1.4757      0.388      3.802      0.000       0.715       2.236
creative_0D                            1.0716      0.307      3.492      0.000       0.470       1.673
creative_0E                            0.3974      0.291      1.364      0.173      -0.174       0.969
creative_0F                            0.9891      0.212      4.677      0.000       0.575       1.404
creative_0G                            1.0100      0.214      4.731      0.000       0.592       1.429
creative_0H                            1.0159      0.211      4.806      0.000       0.602       1.430
creative_0I                            0.6442      0.196      3.279      0.001       0.259       1.029
creative_0J                            0.5895      0.208      2.832      0.005       0.181       0.998
creative_0K                            0.7294      0.197      3.705      0.000       0.344       1.115
creative_0L                            0.8491      0.197      4.304      0.000       0.462       1.236
creative_0M                            1.3761      0.210      6.550      0.000       0.964       1.788
creative_0N                            0.9750      0.224      4.351      0.000       0.536       1.414
creative_0O                            1.0781      0.210      5.128      0.000       0.666       1.490
creative_0P                            0.6167      0.214      2.881      0.004       0.197       1.036
creative_0Q                            1.3614      0.209      6.520      0.000       0.952       1.771
creative_0R                            1.1485      0.206      5.570      0.000       0.744       1.553
creative_0S                            1.1449      0.226      5.055      0.000       0.701       1.589
creative_0T                            1.1186      0.229      4.882      0.000       0.669       1.568
creative_0U                            0.9904      0.201      4.926      0.000       0.596       1.384
creative_0V                            1.3300      0.217      6.119      0.000       0.904       1.756
creative_0W                            1.2732      0.222      5.727      0.000       0.837       1.709
creative_0X                            1.0723      0.215      4.994      0.000       0.651       1.493
creative_0Y                            1.7684      0.216      8.187      0.000       1.345       2.192
creative_0Z                            1.4357      0.219      6.549      0.000       1.006       1.865
creative_AA                            1.0598      0.227      4.668      0.000       0.615       1.505
creative_AB                            1.0163      0.207      4.906      0.000       0.610       1.422
creative_AC                            1.3624      0.219      6.230      0.000       0.934       1.791
creative_AD                            1.5332      0.203      7.560      0.000       1.136       1.931
creative_AE                            1.1586      0.239      4.854      0.000       0.691       1.626
creative_AF                            1.2081      0.232      5.197      0.000       0.753       1.664
creative_AG                            1.2028      0.236      5.093      0.000       0.740       1.666
creative_AH                            2.6073      0.309      8.429      0.000       2.001       3.214
creative_AI                            2.4668      0.284      8.690      0.000       1.910       3.023
creative_AJ                            2.3594      0.266      8.864      0.000       1.838       2.881
creative_AK                            1.6876      0.272      6.215      0.000       1.155       2.220
creative_AL                            0.7703      0.468      1.646      0.100      -0.147       1.688
creative_AM                            1.8161      0.294      6.177      0.000       1.240       2.392
creative_AN                            0.8868      0.346      2.560      0.010       0.208       1.566
creative_AO                            2.4027      0.313      7.678      0.000       1.789       3.016
creative_AP                            2.0663      0.563      3.672      0.000       0.963       3.169
creative_AQ                            1.6498      0.227      7.273      0.000       1.205       2.094
creative_AR                            1.7152      0.270      6.343      0.000       1.185       2.245
creative_AS                            2.3446      0.241      9.734      0.000       1.872       2.817
creative_AT                            1.2957      0.320      4.051      0.000       0.669       1.923
creative_AU                            1.6523      0.337      4.898      0.000       0.991       2.313
creative_AV                            1.4086      0.321      4.391      0.000       0.780       2.037
creative_AW                            2.2227      0.235      9.469      0.000       1.763       2.683
creative_AX                            2.6423      0.286      9.246      0.000       2.082       3.203
creative_AY                            1.8428      0.268      6.888      0.000       1.318       2.367
creative_AZ                            2.2701      0.241      9.410      0.000       1.797       2.743
creative_BA                            1.8858      0.285      6.626      0.000       1.328       2.444
creative_BB                            2.3671      0.247      9.602      0.000       1.884       2.850
creative_BC                            2.1852      0.240      9.108      0.000       1.715       2.655
creative_BD                            1.1896      0.376      3.163      0.002       0.452       1.927
creative_BE                            2.2697      0.232      9.762      0.000       1.814       2.725
creative_BF                            1.8436      0.283      6.523      0.000       1.290       2.398
creative_BG                            1.4224      0.317      4.493      0.000       0.802       2.043
creative_BH                            1.9460      0.278      7.010      0.000       1.402       2.490
creative_BI                            2.0676      0.621      3.332      0.001       0.851       3.284
creative_BJ                            2.4243      0.367      6.603      0.000       1.705       3.144
creative_BK                            1.5685      1.149      1.365      0.172      -0.684       3.821
creative_BL                            1.7223      0.537      3.206      0.001       0.669       2.775
creative_BM                            0.6616      0.374      1.768      0.077      -0.072       1.395
creative_BN                            2.6370      0.340      7.749      0.000       1.970       3.304
creative_BO                            1.7028      0.546      3.116      0.002       0.632       2.774
creative_BP                            2.4464      0.319      7.666      0.000       1.821       3.072
creative_BQ                            2.7111      0.454      5.972      0.000       1.821       3.601
creative_BR                            2.3767      0.531      4.478      0.000       1.336       3.417
creative_BS                            5.0250      0.537      9.351      0.000       3.972       6.078
creative_BT                            5.2009      0.457     11.370      0.000       4.304       6.097
creative_BU                            1.2251      0.586      2.090      0.037       0.076       2.374
creative_BV                            1.9757      0.709      2.788      0.005       0.587       3.365
creative_BW                            2.1310      0.435      4.898      0.000       1.278       2.984
creative_BX                            2.2170      0.946      2.342      0.019       0.362       4.072
creative_BY                            3.4462      0.303     11.368      0.000       2.852       4.040
creative_BZ                            3.1595      0.343      9.221      0.000       2.488       3.831
creative_CA                            2.7051      0.344      7.864      0.000       2.031       3.379
creative_CB                            2.3852      0.537      4.438      0.000       1.332       3.439
creative_CC                            2.2072      0.592      3.730      0.000       1.047       3.367
creative_CD                            1.8769      0.409      4.585      0.000       1.075       2.679
creative_CE                            3.2628      1.139      2.864      0.004       1.029       5.496
creative_CF                            1.4456      0.945      1.531      0.126      -0.406       3.297
creative_CG                            4.6622      0.815      5.719      0.000       3.064       6.260
creative_CH                           13.4731      1.453      9.275      0.000      10.626      16.320
network_ABC                            0.8296      0.217      3.819      0.000       0.404       1.255
network_AMC                            0.8092      0.244      3.313      0.001       0.330       1.288
network_Adult Swim                     0.9515      0.210      4.535      0.000       0.540       1.363
network_American Heroes Channel        5.5110      0.924      5.966      0.000       3.700       7.322
network_Animal Planet                  1.5432      0.294      5.250      0.000       0.967       2.119
network_BBC America                    1.6559      0.497      3.330      0.001       0.681       2.631
network_BET                            1.7804      0.211      8.446      0.000       1.367       2.194
network_BRAVO                          1.3365      0.220      6.082      0.000       0.906       1.767
network_CBS                            0.4986      0.220      2.268      0.023       0.068       0.930
network_CMTV                           3.7847      2.386      1.586      0.113      -0.893       8.462
network_CNN                            3.4372      0.504      6.822      0.000       2.450       4.425
network_CW                             1.1048      0.287      3.848      0.000       0.542       1.668
network_Cartoon Network                2.4262      0.237     10.254      0.000       1.962       2.890
network_Comedy Central                 2.8393      0.218     13.040      0.000       2.413       3.266
network_Discovery Channel              1.0409      0.221      4.717      0.000       0.608       1.473
network_Disney Channel                 1.4022      0.205      6.827      0.000       1.000       1.805
network_Disney XD                      2.2149      0.229      9.665      0.000       1.766       2.664
network_E!                             1.7424      0.226      7.703      0.000       1.299       2.186
network_ESPN                           1.4396      0.219      6.561      0.000       1.010       1.870
network_ESPN Deportes                  6.4523      1.130      5.709      0.000       4.237       8.667
network_ESPN News                      4.0467      0.473      8.554      0.000       3.119       4.974
network_ESPN2                          3.8176      0.536      7.122      0.000       2.767       4.868
network_ESPNU                         24.1525      4.762      5.072      0.000      14.819      33.486
network_Esquire                        3.6812      0.713      5.166      0.000       2.285       5.078
network_FOX                            1.0764      0.230      4.675      0.000       0.625       1.528
network_FX                             1.6431      0.210      7.826      0.000       1.232       2.055
network_FX Movie Channel               2.2056      0.560      3.937      0.000       1.108       3.304
network_FXX                            2.8531      0.258     11.080      0.000       2.348       3.358
network_Food Network                   1.1807      0.646      1.827      0.068      -0.086       2.447
network_Fox News                       0.0823      0.325      0.253      0.800      -0.555       0.720
network_Fox Sports 1                   2.9700      0.760      3.910      0.000       1.481       4.459
network_Freeform                       1.7073      0.225      7.595      0.000       1.267       2.148
network_Galavision                     1.2836      4.281      0.300      0.764      -7.108       9.675
network_Game Show                      2.0573      0.632      3.253      0.001       0.818       3.297
network_HGTV                           1.1323      0.270      4.193      0.000       0.603       1.662
network_History Channel                0.9899      0.256      3.874      0.000       0.489       1.491
network_Independent Film (IFC)         3.1984      0.494      6.478      0.000       2.231       4.166
network_Investigation Discovery        1.1715      0.239      4.895      0.000       0.702       1.641
network_Lifetime                       1.1619      0.301      3.854      0.000       0.571       1.753
network_Lifetime Movie Network         1.5797      0.444      3.559      0.000       0.710       2.450
network_MLB Network                    0.7764      3.801      0.204      0.838      -6.674       8.227
network_MSNBC                          2.1560      0.473      4.558      0.000       1.229       3.083
network_MTV                            2.9391      0.220     13.352      0.000       2.508       3.371
network_MTV2                           6.2273      0.351     17.740      0.000       5.539       6.915
network_NBA TV                         5.9966      0.642      9.344      0.000       4.739       7.255
network_NBC                            0.4672      0.227      2.061      0.039       0.023       0.911
network_National Geographic            1.6793      0.341      4.922      0.000       1.011       2.348
network_Nick                           1.1782      0.218      5.412      0.000       0.751       1.605
network_Nick Toons                     2.3992      0.285      8.432      0.000       1.841       2.957
network_Nick at Nite                   1.3168      0.237      5.553      0.000       0.852       1.782
network_OXYGEN                         2.5587      0.289      8.869      0.000       1.993       3.124
network_Oprah Winfrey Network          1.3651      0.274      4.988      0.000       0.829       1.902
network_SYFY                           0.7538      0.234      3.219      0.001       0.295       1.213
network_Spike                          1.7067      0.232      7.349      0.000       1.252       2.162
network_TBS                            1.5802      0.208      7.592      0.000       1.172       1.988
network_TLC                            1.4977      0.227      6.606      0.000       1.053       1.942
network_TNT                            0.7783      0.215      3.622      0.000       0.357       1.199
network_TV LAND                        0.2589      1.803      0.144      0.886      -3.274       3.792
network_TV ONE                         2.4962      0.427      5.848      0.000       1.660       3.333
network_Teen Nick                      3.6126      0.430      8.408      0.000       2.770       4.455
network_Telemundo                      0.4704      0.294      1.602      0.109      -0.105       1.046
network_USA Network                    0.9248      0.211      4.383      0.000       0.511       1.338
network_Univision                      0.6561      0.275      2.389      0.017       0.118       1.194
network_Univision Deportes Network    15.1291      3.153      4.799      0.000       8.950      21.308
network_VH1                            2.0765      0.217      9.580      0.000       1.652       2.501
network_WE TV                          1.1015      0.435      2.535      0.011       0.250       1.953
network_WGN America                    4.1123      1.398      2.942      0.003       1.373       6.852
network_truTV                          1.7076      0.252      6.770      0.000       1.213       2.202
movie_id_73                           -2.0752      0.069    -29.860      0.000      -2.211      -1.939
movie_id_151                          -1.3293      0.077    -17.222      0.000      -1.481      -1.178
movie_id_165                          -2.1811      0.082    -26.660      0.000      -2.341      -2.021
movie_id_437                          -1.3091      0.093    -14.054      0.000      -1.492      -1.127
==============================================================================
Omnibus:                    25242.442   Durbin-Watson:                   1.561
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         10371374.634
Skew:                           6.652   Prob(JB):                         0.00
Kurtosis:                     114.015   Cond. No.                     6.98e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.98e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

This is not the best model, with residuals not normally distributed, potentially some multi-colinearity, and only 36% of the variance beeing explained. But it could well do the job to rank our creatives. If the dataset can indeed be modelled as the fixed-effect linear model described above, then the noise on creatives variable often isn't catastrophic, typically 25% if the assessed value. Let's see if the predicted coefficients are plausible.

In [14]:
results.params.nlargest(17)
Out[14]:
network_ESPNU                         24.152504
network_Univision Deportes Network    15.129112
creative_CH                           13.473137
network_ESPN Deportes                  6.452261
network_MTV2                           6.227304
network_NBA TV                         5.996572
network_American Heroes Channel        5.510958
creative_BT                            5.200879
creative_BS                            5.024987
creative_CG                            4.662222
network_WGN America                    4.112304
network_ESPN News                      4.046706
network_ESPN2                          3.817588
network_CMTV                           3.784665
network_Esquire                        3.681232
network_Teen Nick                      3.612571
creative_BY                            3.446162
dtype: float64

The top creatives are estimated to be CH, BT, BS, CG, BY. This makes some sense, as:

  • the top 5 by channel-movie count was BT, CH, BR, BN, AS 2/5
  • top 5 by impressions weighted average was CH, BS, CG, BT, BQ 4/5
  • top 5 by impressions weighted median was CH, CE, BY, BQ, BZ 2/5

Similarly the network ranking isn't too far from a simple average approach:

In [15]:
df.groupby('network').apply(lambda x: np.average(x['ad_score_rate'], weights=x['impressions'])).nlargest(5)
Out[15]:
network
ESPNU                         25.933228
Univision Deportes Network    16.346817
MTV2                           6.569976
ESPN Deportes                  6.541738
American Heroes Channel        6.364487
dtype: float64

The fact that all movies have a negative coefficient is surprising. The dummy for the most performing movie has been excluded to reduce redundancy (if it's not the other 4, it's this one), so at least the ranking seems to hold. It feels like the mean of the movie coefficient should arrive to something close to 0, with the intercept being modelled in a separate variable. However, this may be a human preference, and may not actually reflect an issue.

In [16]:
df.groupby('movie_id').apply(lambda x: np.average(x['ad_score_rate'], weights=x['impressions']))
Out[16]:
movie_id
51     2.629552
73     0.917531
151    1.686156
165    0.720432
437    1.257026
dtype: float64

Interestingly, 'dbr' is found to have little effect, and so are 'hour' and 'minutegroup'. We should check if this is a good reflection of reality, or if the model is too simple to capture those effects.

3. understanding better the confounding variables dbr, hour and minutegroup

dbr

In [17]:
sns.jointplot(x='dbr', y='ad_score_rate', data=df, kind='kde', ylim=(0,10))
Out[17]:
<seaborn.axisgrid.JointGrid at 0x2da0df0a4e0>

The bulk of movies do not seem to be affected by the release date, which would be in line with the the very small coefficients in the regression.

It anything, it looks like the standard deviation increases closer to the release date. As the score can not et more negative, it looks like this asymmetry creates an increase of the mean score.

Data 30 days before the release date looks more sparse and potentially less reliable.

In [18]:
def group_and_plot(df, X, Y, n_rolling=5):
    grouped = df.groupby(X)
    
    g_25 = grouped[Y].apply(lambda x: np.percentile(x, 25))
    g_25 = g_25.rolling(n_rolling).mean()

    g_median_weighted = grouped.apply(lambda x: weighted.median(x[Y],x['impressions']))
    g_median_weighted = g_median_weighted.rolling(n_rolling).mean()

    g_mean_weighted = grouped.apply(lambda x: np.average(x[Y],weights=x['impressions']))
    g_mean_weighted = g_mean_weighted.rolling(n_rolling).mean()

    g_75 = grouped[Y].apply(lambda x: np.percentile(x, 75))
    g_75 = g_75.rolling(n_rolling).mean()
    

    plt.figure()

    plt.plot(g_median_weighted.index, g_median_weighted.values, 'b--')
    plt.plot(g_mean_weighted.index, g_mean_weighted.values, 'k--')
    plt.fill_between(g_25.index, g_25.values, g_75.values, color='g', alpha=0.2)
    plt.legend(['weighted median', 'weighted mean', '25-75th percentile'])
In [19]:
group_and_plot(df, 'dbr', 'ad_score_rate', n_rolling=5)

These opposite trends would not be well captured by a linear model - we may need to create new variables to transform this - square?

In [20]:
df['dbr2'] = (df['dbr']-50)**2
In [21]:
group_and_plot(df, 'dbr2', 'ad_score_rate', n_rolling=5)

if anything this seems to have created a more linear relationship - could help

time

In [22]:
ax = sns.boxplot(x='minutegroup', y='ad_score_rate', data=df)
ax.set_ylim([0, 8])
Out[22]:
(0, 8)

doesn't look like there are global trends

In [23]:
ax = sns.boxplot(x='hour', y='ad_score_rate', data=df)
ax.set_ylim([0, 8])
Out[23]:
(0, 8)
In [24]:
group_and_plot(df, 'hour', 'ad_score_rate', n_rolling=5)
In [25]:
sns.jointplot(x='hour', y='ad_score_rate', data=df, kind='kde', ylim=(0,10))
Out[25]:
<seaborn.axisgrid.JointGrid at 0x2da12b155c0>

Ad score rate seems to increase in the middle of the day. This is when the least ads happen though, so it may just be less reliable data.

So it isn't so surprising for the regression coefficient to be quasi null.

What's surprising is that there is so much ad activity between 0 and 4 am - is it all in the same time zone? In a real project I'd want to check that.

It could be that people genuinely have more time to take actions at this point of the day, or it could be a reflection that ads that touch few people are actually more effective for some reason. Or it could be a full spurious correlation. The middle of the day hypothesis is plausible enough that I'll create a binary variable to capture this.

In [26]:
df['middle_day'] = df['hour'].apply(lambda x: 1 if (x>=9 and x<=18) else 0)

interactions, and other non-linearities: a few thoughts

Adding complexity for its own sake could lead to overfitting. Looking at the learning curves and complexity curves can help find the right balance between bias and variance. Given that this model is expected to receive a lot more data, leaning on a more complex model would probably make the most of the extra data.

One way to control the expansion of these extra variables is to use regularization, which would penalize all coefficients so only useful ones remain.

So far, the ad score rate has been corrected by multiplicative coefficients rather than additive coefficients so far. It may or may not be for good reasons. One way to switch from one to another may be to work with logarithms of the variables, as log(a*b) = log(a) + log(b). I have also noticed there is quite a bit of skew in the data, and taking the log could help.

To capture more complexity in the model, I'd start from domain knowledge, or graph the data and try to look at plausible causal effects. This can take a bit of time, and I may try another model instead given time restrictions.

In [27]:
# we'll just capture one interaction that could make sense given ad_score_rate
In [28]:
df['impressions*length'] = df['impressions'] * df['length']
df['1e7/(impressions*length)'] = 1e7/(df['impressions'] * df['length'])

4. hybrid model: more complexity for the counterfactual, and fixed effects for creatives

I'd like to try a different approach that is a lot more experimental: leverage some of the power of complex machine learning algorithms to capture counterfactual complexity, while keeping the easy interpretability of the fixed effects model for creatives ranking.

The model we are considering is: $$ y_{it} = f(Z_{it})*\gamma + \alpha_i + u_{it}$$

where:

  • $i$ corresponds to each artist
  • $t$ corresponds to the combinations in our counterfactual, in practice movie-network-dbr-time combinations
  • $y_{it}$ is the effect, here ad_score_rate
  • $Z_{it}$ is the counterfactual 1xk independant variables
  • $x -> f(x)$ is function hopefully capturing more of the counterfactual's complexity, here a Random Forrest Regressor. In this first version $f$ is trained to predict $y$.
  • $\gamma$ is a scalar parameter (further improvements might want to transform this gamma into something that captures more complexity).
  • $\alpha_i$ is the unobserved innate (fixed-effect) ability of the creators to create good ads
  • $u_it$ is the error term

At the moment I am training f to predict y, which is a way to collapse the counterfactual into one variable. It relies on the counterfactual having enough information to predict the main effects $y_{it}$, and the creatives are treated as fixed adjustments $\alpha_i$ to this main effect.

This isn't a conventional model, so it would need improvements through iterations and some more thoughts.

In terms of tools, we'll do a bit a of paradigm shift and do predictive modelling to assess model performance. We will use Scikit Learn for its machine learning libraries, and in this notebook we'll explore a Random Forrest Regressor, but given more time it may be worth exploring more models.

For this section:

  • first, do predictions on the entire data set = first training set
  • second, separately assess how models do when fitted onto a training set and tested on an unseen test set

In both cases we'll compare the random forrest classifier to a linear model.

In [29]:
from sklearn.model_selection import train_test_split

# the effect is y ad_score_rate
# X are all the explanatory variables
# treatment_creatives are the creatives dummy variables
# X = counterfactual and treatment_creatives

y = df['ad_score_rate']

X2 = df.drop(['ad_score_rate', 'ad_score'], axis=1)
counterfactual = X2.drop('creative', axis=1)
treatment_creatives = df[['creative', 'impressions']]

X2 = pd.get_dummies(X2, columns=['network', 'movie_id', 'creative'],drop_first=True)
counterfactual = pd.get_dummies(counterfactual, columns=['network', 'movie_id'],drop_first=True)
treatment_creatives = pd.get_dummies(treatment_creatives, columns=['creative'],drop_first=True)

X2_train, X2_test, y_train, y_test = train_test_split(X2, y, test_size=0.3, random_state=42)
counterfactual_train, counterfactual_test, y_train, y_test = train_test_split(counterfactual, y, test_size=0.3, random_state=42)
treatment_train, treatment_test, y_train, y_test = train_test_split(treatment_creatives, y, test_size=0.3, random_state=42)

first, predictions on the entire dataset

In [30]:
# step 1: first a random forest regressor to model the counterfactual

from sklearn.ensemble import RandomForestRegressor
RF_parameters = {'n_estimators':100, 'min_samples_split':6, 'min_samples_leaf':6, 'bootstrap':True, 'n_jobs':4}
#RF_parameters = {'n_estimators':100}
RFR = RandomForestRegressor(**RF_parameters)
RFR.fit(counterfactual, y, sample_weight=counterfactual['impressions'])
print('RFR counterfactual - train score: ', RFR.score(counterfactual, y, sample_weight=counterfactual['impressions']))
RFR counterfactual - train score:  0.5850560887547009
In [31]:
# -------- plot feature importance in random forrest -------------

importances = RFR.feature_importances_
std = np.std([tree.feature_importances_ for tree in RFR.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking and contribution
print("Feature ranking:")

for f in range(counterfactual.shape[1]):
    print("%d. feature %d %s (%f)" % (f + 1, indices[f],counterfactual.columns.tolist()[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(counterfactual.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(counterfactual.shape[1]), indices)
plt.xlim([-1, counterfactual.shape[1]])
plt.show()
Feature ranking:
1. feature 8 1e7/(impressions*length) (0.202590)
2. feature 7 impressions*length (0.156974)
3. feature 5 dbr2 (0.113894)
4. feature 3 hour (0.089319)
5. feature 1 impressions (0.070000)
6. feature 77 movie_id_73 (0.066546)
7. feature 2 dbr (0.055944)
8. feature 79 movie_id_165 (0.047562)
9. feature 4 minutegroup (0.040616)
10. feature 80 movie_id_437 (0.039508)
11. feature 6 middle_day (0.030532)
12. feature 78 movie_id_151 (0.025953)
13. feature 22 network_Comedy Central (0.013680)
14. feature 73 network_VH1 (0.007308)
15. feature 51 network_MTV (0.007085)
16. feature 15 network_BET (0.003180)
17. feature 52 network_MTV2 (0.002714)
18. feature 63 network_TBS (0.002691)
19. feature 34 network_FX (0.002540)
20. feature 0 length (0.002295)
21. feature 25 network_Disney XD (0.001749)
22. feature 21 network_Cartoon Network (0.001645)
23. feature 29 network_ESPN News (0.001515)
24. feature 62 network_Spike (0.001484)
25. feature 40 network_Freeform (0.001324)
26. feature 36 network_FXX (0.001189)
27. feature 61 network_SYFY (0.000937)
28. feature 11 network_Adult Swim (0.000909)
29. feature 26 network_E! (0.000791)
30. feature 10 network_AMC (0.000588)
31. feature 27 network_ESPN (0.000499)
32. feature 76 network_truTV (0.000483)
33. feature 24 network_Disney Channel (0.000477)
34. feature 59 network_OXYGEN (0.000433)
35. feature 56 network_Nick (0.000402)
36. feature 53 network_NBA TV (0.000393)
37. feature 57 network_Nick Toons (0.000384)
38. feature 16 network_BRAVO (0.000382)
39. feature 65 network_TNT (0.000258)
40. feature 68 network_Teen Nick (0.000244)
41. feature 23 network_Discovery Channel (0.000241)
42. feature 28 network_ESPN Deportes (0.000229)
43. feature 70 network_USA Network (0.000217)
44. feature 44 network_History Channel (0.000208)
45. feature 19 network_CNN (0.000203)
46. feature 20 network_CW (0.000200)
47. feature 58 network_Nick at Nite (0.000182)
48. feature 64 network_TLC (0.000178)
49. feature 45 network_Independent Film (IFC) (0.000169)
50. feature 9 network_ABC (0.000142)
51. feature 30 network_ESPN2 (0.000136)
52. feature 32 network_Esquire (0.000124)
53. feature 46 network_Investigation Discovery (0.000111)
54. feature 60 network_Oprah Winfrey Network (0.000108)
55. feature 17 network_CBS (0.000081)
56. feature 54 network_NBC (0.000078)
57. feature 67 network_TV ONE (0.000076)
58. feature 35 network_FX Movie Channel (0.000061)
59. feature 55 network_National Geographic (0.000039)
60. feature 43 network_HGTV (0.000029)
61. feature 12 network_American Heroes Channel (0.000024)
62. feature 48 network_Lifetime Movie Network (0.000021)
63. feature 39 network_Fox Sports 1 (0.000020)
64. feature 42 network_Game Show (0.000019)
65. feature 71 network_Univision (0.000018)
66. feature 38 network_Fox News (0.000016)
67. feature 13 network_Animal Planet (0.000015)
68. feature 47 network_Lifetime (0.000015)
69. feature 14 network_BBC America (0.000008)
70. feature 33 network_FOX (0.000007)
71. feature 74 network_WE TV (0.000007)
72. feature 69 network_Telemundo (0.000001)
73. feature 18 network_CMTV (0.000000)
74. feature 41 network_Galavision (0.000000)
75. feature 49 network_MLB Network (0.000000)
76. feature 50 network_MSNBC (0.000000)
77. feature 66 network_TV LAND (0.000000)
78. feature 37 network_Food Network (0.000000)
79. feature 72 network_Univision Deportes Network (0.000000)
80. feature 75 network_WGN America (0.000000)
81. feature 31 network_ESPNU (0.000000)

The few interactions and new variables I have created are deamed the most important features in the random forrest regressor: more of those may well be the way to improve the model.

Note: usually I'd use gridsearchCV to optimize the models' hyperparameters at this stage. However, I've noticed that the cross-validation in gridsearchCV doens't take weighting into account, so I'll keep it simple for this exercise.

In [32]:
y_counterfactual_est = RFR.predict(counterfactual)
In [33]:
# step 2: linear regression from f(counterfactual) and treatment
treatment_creatives['y_counterfactual_est'] = y_counterfactual_est

from sklearn.linear_model import LinearRegression
LR_2stepmodel = LinearRegression()

LR_2stepmodel.fit(treatment_creatives.drop('impressions', axis=1), y, sample_weight=treatment_creatives['impressions'])

print('LR second step of the model - train score: ', LR_2stepmodel.score(treatment_creatives.drop('impressions', axis=1), y, sample_weight=treatment_creatives['impressions']))
LR second step of the model - train score:  0.6029259201545611
In [34]:
plt.scatter(y, y_counterfactual_est)
Out[34]:
<matplotlib.collections.PathCollection at 0x2da140749b0>
In [35]:
est_coeff_2steps = pd.DataFrame({'features':treatment_creatives.drop('impressions', axis=1).columns.values,'est. coeffs.':LR_2stepmodel.coef_})
est_coeff_2steps.sort_values('est. coeffs.', ascending=False).head(10)
Out[35]:
est. coeffs. features
84 5.506407 creative_CH
83 1.718383 creative_CG
85 1.195326 y_counterfactual_est
69 0.491258 creative_BS
67 0.337823 creative_BQ
39 0.336068 creative_AO
81 0.333625 creative_CE
33 0.263871 creative_AI
60 0.216154 creative_BJ
73 0.175318 creative_BW

First we can notice that $\gamma$ corresponding to y_counterfactual_est is close to 1 (which is good), but not quite 1.

The top 5 creatives are estimated to be CH, CG, BS, CE, and BQ, which is 3/5 similar to our previous linear regression with less variables. Domain knowledge could help to determine whether this is an improvement on the previous model.

In [36]:
# see how a simple linear model performs with these new features

LR = LinearRegression()
LR.fit(X2, y, sample_weight=X2['impressions'])
print('LR only - train score: ', LR.score(X2, y, sample_weight=X2['impressions']))
LR only - train score:  0.27280315226110285
In [37]:
est_coeff_pure_LR = pd.DataFrame({'features':X2.columns.values,'est. coeffs.':LR.coef_})
est_coeff_pure_LR.sort_values('est. coeffs.', ascending=False).head(14)
Out[37]:
est. coeffs. features
31 20.182542 network_ESPNU
165 10.625547 creative_CH
150 2.716570 creative_BS
151 2.609615 creative_BT
75 2.363441 network_WGN America
12 2.247993 network_American Heroes Channel
164 2.124627 creative_CG
52 1.964643 network_MTV2
19 1.809392 network_CNN
18 1.666614 network_CMTV
30 1.631032 network_ESPN2
22 1.475888 network_Comedy Central
51 1.310492 network_MTV
162 1.185197 creative_CE

For the straight linear regression with these new variables, the top creatives are CH, BS, BT, CG and CE.

In [38]:
results2 = smf.WLS(y, X2, weights=X2['impressions']).fit()
print(results2.summary())
                            WLS Regression Results                            
==============================================================================
Dep. Variable:          ad_score_rate   R-squared:                       0.426
Model:                            WLS   Adj. R-squared:                  0.421
Method:                 Least Squares   F-statistic:                     88.21
Date:                Mon, 11 Jun 2018   Prob (F-statistic):               0.00
Time:                        09:01:27   Log-Likelihood:                -48760.
No. Observations:               19911   AIC:                         9.785e+04
Df Residuals:                   19745   BIC:                         9.916e+04
Df Model:                         166                                         
Covariance Type:            nonrobust                                         
======================================================================================================
                                         coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
length                                -0.0066      0.001     -4.847      0.000      -0.009      -0.004
impressions                        -4.479e-08    1.9e-08     -2.355      0.019   -8.21e-08   -7.52e-09
dbr                                    0.0093      0.002      4.430      0.000       0.005       0.013
hour                                   0.0258      0.002     10.616      0.000       0.021       0.031
minutegroup                            0.0026      0.010      0.275      0.783      -0.016       0.021
dbr2                                   0.0004    3.9e-05      9.090      0.000       0.000       0.000
middle_day                             0.5792      0.038     15.302      0.000       0.505       0.653
impressions*length                  2.049e-09   6.13e-10      3.345      0.001    8.48e-10    3.25e-09
1e7/(impressions*length)               0.5869      0.015     39.076      0.000       0.557       0.616
network_ABC                            0.7561      0.223      3.393      0.001       0.319       1.193
network_AMC                            0.5748      0.246      2.336      0.020       0.092       1.057
network_Adult Swim                     0.8702      0.207      4.207      0.000       0.465       1.276
network_American Heroes Channel        2.7951      0.880      3.178      0.001       1.071       4.519
network_Animal Planet                  0.8502      0.285      2.984      0.003       0.292       1.409
network_BBC America                    0.0598      0.476      0.126      0.900      -0.874       0.994
network_BET                            0.9050      0.210      4.303      0.000       0.493       1.317
network_BRAVO                          0.7541      0.217      3.481      0.000       0.330       1.179
network_CBS                            0.5059      0.241      2.102      0.036       0.034       0.978
network_CMTV                           2.2462      2.263      0.993      0.321      -2.188       6.681
network_CNN                            2.3601      0.481      4.905      0.000       1.417       3.303
network_CW                             0.9821      0.279      3.526      0.000       0.436       1.528
network_Cartoon Network                1.2525      0.234      5.353      0.000       0.794       1.711
network_Comedy Central                 2.0289      0.216      9.392      0.000       1.605       2.452
network_Discovery Channel              0.6710      0.218      3.081      0.002       0.244       1.098
network_Disney Channel                 0.9304      0.211      4.404      0.000       0.516       1.344
network_Disney XD                      1.0935      0.228      4.789      0.000       0.646       1.541
network_E!                             0.7812      0.223      3.503      0.000       0.344       1.218
network_ESPN                           0.9166      0.216      4.237      0.000       0.493       1.341
network_ESPN Deportes                 -2.3525      1.095     -2.148      0.032      -4.499      -0.206
network_ESPN News                      0.2640      0.461      0.572      0.567      -0.640       1.168
network_ESPN2                          2.1791      0.513      4.247      0.000       1.173       3.185
network_ESPNU                         20.7270      4.514      4.592      0.000      11.880      29.574
network_Esquire                       -0.7401      0.686     -1.079      0.281      -2.085       0.605
network_FOX                            0.9161      0.232      3.952      0.000       0.462       1.370
network_FX                             1.1046      0.207      5.325      0.000       0.698       1.511
network_FX Movie Channel              -0.0578      0.537     -0.108      0.914      -1.110       0.994
network_FXX                            1.4181      0.253      5.613      0.000       0.923       1.913
network_Food Network                   0.9764      0.615      1.588      0.112      -0.229       2.182
network_Fox News                      -0.1807      0.313     -0.576      0.564      -0.795       0.434
network_Fox Sports 1                  -3.5464      0.740     -4.791      0.000      -4.997      -2.095
network_Freeform                       1.0238      0.222      4.617      0.000       0.589       1.458
network_Galavision                     0.2078      4.058      0.051      0.959      -7.745       8.161
network_Game Show                      1.2158      0.603      2.016      0.044       0.033       2.398
network_HGTV                           0.6726      0.263      2.561      0.010       0.158       1.187
network_History Channel                0.6300      0.250      2.523      0.012       0.141       1.119
network_Independent Film (IFC)         0.0781      0.477      0.164      0.870      -0.858       1.014
network_Investigation Discovery        0.5992      0.235      2.553      0.011       0.139       1.059
network_Lifetime                       0.7816      0.292      2.675      0.007       0.209       1.354
network_Lifetime Movie Network         0.5276      0.425      1.241      0.214      -0.305       1.361
network_MLB Network                   -0.6862      3.603     -0.190      0.849      -7.748       6.376
network_MSNBC                          1.3030      0.452      2.883      0.004       0.417       2.189
network_MTV                            1.8532      0.219      8.453      0.000       1.423       2.283
network_MTV2                           2.5079      0.350      7.172      0.000       1.823       3.193
network_NBA TV                         0.4852      0.626      0.775      0.438      -0.741       1.712
network_NBC                            0.4664      0.232      2.012      0.044       0.012       0.921
network_National Geographic            0.4974      0.330      1.508      0.132      -0.149       1.144
network_Nick                           0.5516      0.216      2.552      0.011       0.128       0.975
network_Nick Toons                     0.7525      0.279      2.695      0.007       0.205       1.300
network_Nick at Nite                   1.0397      0.232      4.476      0.000       0.584       1.495
network_OXYGEN                         0.3425      0.285      1.204      0.229      -0.215       0.900
network_Oprah Winfrey Network          0.5820      0.268      2.175      0.030       0.057       1.107
network_SYFY                           0.3247      0.229      1.416      0.157      -0.125       0.774
network_Spike                          1.1163      0.229      4.882      0.000       0.668       1.564
network_TBS                            1.1346      0.206      5.505      0.000       0.731       1.539
network_TLC                            0.9686      0.223      4.352      0.000       0.532       1.405
network_TNT                            0.5747      0.212      2.706      0.007       0.158       0.991
network_TV LAND                       -0.6431      1.710     -0.376      0.707      -3.994       2.708
network_TV ONE                        -0.0072      0.414     -0.017      0.986      -0.819       0.804
network_Teen Nick                      1.0248      0.416      2.464      0.014       0.210       1.840
network_Telemundo                      0.1731      0.285      0.606      0.544      -0.386       0.733
network_USA Network                    0.6106      0.209      2.919      0.004       0.201       1.021
network_Univision                      0.2913      0.268      1.088      0.276      -0.233       0.816
network_Univision Deportes Network    -3.4618      3.023     -1.145      0.252      -9.388       2.464
network_VH1                            1.1459      0.214      5.352      0.000       0.726       1.566
network_WE TV                          0.5173      0.417      1.240      0.215      -0.300       1.335
network_WGN America                    2.9262      1.326      2.206      0.027       0.326       5.526
network_truTV                          0.7688      0.247      3.110      0.002       0.284       1.253
movie_id_73                           -1.8068      0.067    -26.872      0.000      -1.939      -1.675
movie_id_151                          -0.9280      0.076    -12.229      0.000      -1.077      -0.779
movie_id_165                          -1.8179      0.080    -22.610      0.000      -1.975      -1.660
movie_id_437                          -1.4639      0.089    -16.472      0.000      -1.638      -1.290
creative_0B                            0.3681      0.197      1.868      0.062      -0.018       0.754
creative_0C                            0.3632      0.373      0.973      0.331      -0.369       1.095
creative_0D                            0.3463      0.300      1.156      0.248      -0.241       0.933
creative_0E                           -0.1396      0.281     -0.497      0.619      -0.690       0.411
creative_0F                            0.4214      0.202      2.083      0.037       0.025       0.818
creative_0G                            0.4006      0.204      1.961      0.050       0.000       0.801
creative_0H                            0.1918      0.203      0.943      0.346      -0.207       0.591
creative_0I                           -0.1361      0.188     -0.725      0.468      -0.504       0.232
creative_0J                           -0.2643      0.200     -1.325      0.185      -0.655       0.127
creative_0K                           -0.0015      0.188     -0.008      0.994      -0.371       0.368
creative_0L                            0.0415      0.190      0.218      0.827      -0.331       0.414
creative_0M                            0.2641      0.202      1.305      0.192      -0.133       0.661
creative_0N                            0.0338      0.215      0.157      0.875      -0.388       0.456
creative_0O                           -0.1165      0.204     -0.572      0.567      -0.516       0.283
creative_0P                           -0.1690      0.205     -0.823      0.411      -0.571       0.234
creative_0Q                            0.1789      0.203      0.882      0.378      -0.218       0.576
creative_0R                            0.0667      0.201      0.333      0.739      -0.327       0.460
creative_0S                            0.2704      0.219      1.235      0.217      -0.159       0.700
creative_0T                           -0.0812      0.222     -0.366      0.714      -0.516       0.354
creative_0U                           -0.0194      0.195     -0.099      0.921      -0.402       0.364
creative_0V                            0.3497      0.210      1.668      0.095      -0.061       0.761
creative_0W                            0.2026      0.215      0.942      0.346      -0.219       0.624
creative_0X                           -0.0685      0.209     -0.327      0.744      -0.479       0.342
creative_0Y                            0.2967      0.213      1.393      0.164      -0.121       0.714
creative_0Z                            0.0815      0.216      0.378      0.705      -0.341       0.504
creative_AA                           -0.1201      0.221     -0.544      0.587      -0.553       0.313
creative_AB                           -0.1629      0.204     -0.800      0.424      -0.562       0.236
creative_AC                            0.0562      0.214      0.262      0.793      -0.364       0.476
creative_AD                            0.0365      0.202      0.181      0.856      -0.359       0.432
creative_AE                           -0.0705      0.233     -0.303      0.762      -0.527       0.385
creative_AF                           -0.3328      0.228     -1.463      0.144      -0.779       0.113
creative_AG                           -0.4761      0.231     -2.058      0.040      -0.929      -0.023
creative_AH                            0.7478      0.300      2.490      0.013       0.159       1.336
creative_AI                            1.0125      0.274      3.689      0.000       0.475       1.551
creative_AJ                            0.5270      0.261      2.017      0.044       0.015       1.039
creative_AK                           -0.0903      0.266     -0.339      0.735      -0.612       0.432
creative_AL                           -0.1976      0.447     -0.442      0.658      -1.074       0.678
creative_AM                            0.4864      0.286      1.701      0.089      -0.074       1.047
creative_AN                           -0.3202      0.333     -0.961      0.336      -0.973       0.333
creative_AO                            1.0520      0.303      3.476      0.001       0.459       1.645
creative_AP                            0.3261      0.537      0.608      0.544      -0.726       1.378
creative_AQ                            0.0479      0.226      0.212      0.832      -0.394       0.490
creative_AR                            0.3054      0.265      1.154      0.248      -0.213       0.824
creative_AS                            0.7248      0.239      3.028      0.002       0.256       1.194
creative_AT                            0.1157      0.310      0.373      0.709      -0.492       0.723
creative_AU                            0.0252      0.327      0.077      0.938      -0.615       0.666
creative_AV                           -0.1535      0.311     -0.493      0.622      -0.763       0.456
creative_AW                            0.4774      0.235      2.033      0.042       0.017       0.938
creative_AX                            0.8574      0.281      3.050      0.002       0.306       1.408
creative_AY                            0.2293      0.264      0.868      0.385      -0.288       0.747
creative_AZ                            0.5218      0.241      2.163      0.031       0.049       0.995
creative_BA                            0.1140      0.280      0.407      0.684      -0.434       0.662
creative_BB                            0.4577      0.250      1.829      0.067      -0.033       0.948
creative_BC                            0.4560      0.243      1.875      0.061      -0.021       0.933
creative_BD                            0.0846      0.362      0.233      0.816      -0.626       0.795
creative_BE                            0.5230      0.234      2.231      0.026       0.064       0.982
creative_BF                            0.3503      0.277      1.265      0.206      -0.192       0.893
creative_BG                            0.0088      0.308      0.029      0.977      -0.595       0.612
creative_BH                            0.2029      0.274      0.741      0.458      -0.333       0.739
creative_BI                            0.0742      0.593      0.125      0.900      -1.088       1.236
creative_BJ                            0.8528      0.360      2.371      0.018       0.148       1.558
creative_BK                           -0.0842      1.091     -0.077      0.939      -2.223       2.055
creative_BL                            0.1109      0.514      0.216      0.829      -0.897       1.119
creative_BM                           -0.5761      0.362     -1.589      0.112      -1.287       0.134
creative_BN                            0.8783      0.332      2.645      0.008       0.228       1.529
creative_BO                            0.2446      0.523      0.467      0.640      -0.781       1.270
creative_BP                            0.8144      0.312      2.607      0.009       0.202       1.427
creative_BQ                            1.5434      0.437      3.531      0.000       0.687       2.400
creative_BR                            0.1293      0.568      0.228      0.820      -0.983       1.242
creative_BS                            3.2876      0.515      6.383      0.000       2.278       4.297
creative_BT                            3.1864      0.441      7.220      0.000       2.321       4.051
creative_BU                            0.0683      0.560      0.122      0.903      -1.030       1.167
creative_BV                            0.1200      0.677      0.177      0.859      -1.206       1.446
creative_BW                            0.6160      0.420      1.467      0.142      -0.207       1.439
creative_BX                            0.4885      0.900      0.543      0.587      -1.276       2.253
creative_BY                            1.6000      0.300      5.329      0.000       1.012       2.189
creative_BZ                            1.3078      0.339      3.860      0.000       0.644       1.972
creative_CA                            0.7577      0.340      2.229      0.026       0.091       1.424
creative_CB                            0.5853      0.519      1.128      0.259      -0.431       1.602
creative_CC                            0.2051      0.569      0.360      0.719      -0.911       1.321
creative_CD                            0.0062      0.401      0.016      0.988      -0.779       0.791
creative_CE                            1.7741      1.083      1.638      0.101      -0.348       3.897
creative_CF                            0.0660      0.901      0.073      0.942      -1.699       1.831
creative_CG                            2.6940      0.778      3.461      0.001       1.168       4.220
creative_CH                           11.1984      1.380      8.114      0.000       8.493      13.904
==============================================================================
Omnibus:                    26097.634   Durbin-Watson:                   1.678
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         14201944.787
Skew:                           6.990   Prob(JB):                         0.00
Kurtosis:                     133.089   Cond. No.                     2.63e+10
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.63e+10. This might indicate that there are
strong multicollinearity or other numerical problems.

probably some small improvement compared to the previous model - now 42% of the variance is explained by our model

This can be further improved, as previously discussed.

second, assess model performance on separated train and test datasets

In [39]:
# step 1: first a random forest regressor to model the counterfactual

RFR_split = RandomForestRegressor(**RF_parameters)
RFR_split.fit(counterfactual_train, y_train, sample_weight=counterfactual_train['impressions'])
print('RFR counterfactual - train score: ', RFR_split.score(counterfactual_train, y_train, sample_weight=counterfactual_train['impressions']))
print('RFR counterfactual - train score: ', RFR_split.score(counterfactual_test, y_test, sample_weight=counterfactual_test['impressions']))
RFR counterfactual - train score:  0.5730210643001432
RFR counterfactual - train score:  0.36709830054888004
In [40]:
y_counterfactual_est_train = RFR_split.predict(counterfactual_train)
y_counterfactual_est_test = RFR_split.predict(counterfactual_test)

# step 2: linear regression from f(counterfactual) and treatment
treatment_train['y_counterfactual_est'] = y_counterfactual_est_train
treatment_test['y_counterfactual_est'] = y_counterfactual_est_test

LR_2stepmodel_split = LinearRegression()

LR_2stepmodel_split.fit(treatment_train.drop('impressions', axis=1), y_train, sample_weight=treatment_train['impressions'])

print('LR second step of the model - train score: ', LR_2stepmodel_split.score(treatment_train.drop('impressions', axis=1), y_train, sample_weight=treatment_train['impressions']))
print('LR second step of the model - test score: ', LR_2stepmodel_split.score(treatment_test.drop('impressions', axis=1), y_test, sample_weight=treatment_test['impressions']))
LR second step of the model - train score:  0.5956589349640307
LR second step of the model - test score:  0.3351340314388038
C:\Users\romai\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
C:\Users\romai\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [41]:
# see how a simple linear model performs with these new features

LR_split = LinearRegression()
LR_split.fit(X2_train, y_train, sample_weight=X2_train['impressions'])
print('LR only - train score: ', LR_split.score(X2_train, y_train, sample_weight=X2_train['impressions']))
print('LR only - test score: ', LR_split.score(X2_test, y_test, sample_weight=X2_test['impressions']))
LR only - train score:  0.273825626308408
LR only - test score:  0.258059128286668

Usually at this stage I'd be looking at the learning curves to know. However the sklearn learning_curve tool doesn't seem to support sample_weight, so I'll pass it for this time.

What we can see already is that the linear model's test score is close to the train score, so I wouldn't expect the model to improve much once we scale to the full dataset. Additional features may help, but at the moment it looks like it will be a high biased model.

The 2 steps model seems to be high variance, with a big gap between the train and test score. More data would probably help improve the model performance. If the model was still high variance with the full data set, we should work on avoiding overfitting.

A thought about scaling: the complexity of a random forrest is probably something like O(n_tree variables_per_node n_sample*log(n_sample)). nlogn looks pretty scalable