Example: Train/test/evaluate pipeline with BCDict#

from pprint import pprint
import math
import pandas as pd
import numpy as np
from typing import Collection
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

import bcdict
from bcdict import BCDict

np.set_printoptions(precision=2)
pd.options.display.precision = 2

Generate random data#

Let’s start by generating some random data.

First of all, a function that returns a random DataFrame with 4 feature columns and one target column:

np.random.seed(42)

def get_random_data():
    """Just create some random data."""
    columns = list("ABCD") + ["target"]
    nrows = np.random.randint(10, 25)
    df = pd.DataFrame(
        np.random.random((nrows, len(columns))) + 0.01, 
        columns=columns,
    )
    return df

We will work with three different dataset:

keys = ["apples", "pears", "bananas"]

First BCDict magic#

Now, generate a dictionary with 3 entries of random data.

The bootstrap() function calls a function for every item in a list and returns a BCDict:

dfs = bcdict.bootstrap(keys, get_random_data)

dfs is a broadcast dict with keys apples, pears and bananas.

It’s values are dataframes of random values.

We can now call arbitrary functions on the BCDict.

It will be called on all values of the dictionary, and return a dictionary with the results of the function calls.

Let’s try with the head() function:

pprint(dfs.head(3))
{'apples':       A     B     C     D  target
0  0.81  0.19  0.79  0.61    0.46
1  0.11  0.47  0.34  0.15    0.66
2  0.07  0.73  0.95  0.01    1.00,
 'bananas':       A     B     C     D  target
0  0.72  0.82  0.36  0.11    0.95
1  0.41  0.53  0.85  0.69    0.75
2  0.22  0.55  0.71  0.24    0.18,
 'pears':       A     B     C     D  target
0  0.63  0.34  0.07  0.32    0.34
1  0.74  0.65  0.90  0.48    0.13
2  0.72  0.77  0.57  0.78    0.50}

We can also access attributes the same way. The following line returns shape attribute of all values in the dictionary:

dfs.shape
BCDict({'apples': (16, 5), 'pears': (18, 5), 'bananas': (12, 5)})

Indexing and column selection#

We can also slice all values in the dictionary at once.

We’ll use this here to get a dictionary of series with the target column, and a DataFrame with all features (X and y in sklearn terminology).

Here we select the ‘target’ column and save it in y:

y = dfs['target']
y.shape
BCDict({'apples': (16,), 'pears': (18,), 'bananas': (12,)})

And we get all X dataframes by dropping the target column:

X = dfs.drop(columns="target")
X.shape
BCDict({'apples': (16, 4), 'pears': (18, 4), 'bananas': (12, 4)})

Split the data into train and test#

Using the apply() function we can apply arbitrary functions on the dictionaries:

from sklearn.model_selection import train_test_split

splits = bcdict.apply(train_test_split, X, y)

Each entry in the dictionary now contains a list with X_train, X_test, y_train, y_test:

splits['apples']
[       A     B     C     D
 9   0.77  0.44  0.22  0.58
 6   0.69  0.46  0.02  0.95
 13  0.29  0.31  0.18  0.03
 7   0.40  0.03  0.24  0.25
 10  0.85  0.46  0.41  0.94
 3   0.63  0.62  0.02  0.03
 2   0.07  0.73  0.95  0.01
 14  0.40  0.30  0.02  0.21
 4   0.41  0.06  0.98  0.24
 15  0.80  0.62  0.94  0.66
 11  0.34  0.58  0.53  0.97
 0   0.81  0.19  0.79  0.61,
        A     B     C     D
 8   0.62  0.84  0.18  0.40
 5   0.63  0.39  0.99  0.48
 1   0.11  0.47  0.34  0.15
 12  0.76  0.55  0.60  0.98,
 9     0.04
 6     0.57
 13    0.43
 7     0.69
 10    0.74
 3     0.53
 2     1.00
 14    0.72
 4     0.10
 15    0.92
 11    0.85
 0     0.46
 Name: target, dtype: float64,
 8     0.19
 5     0.87
 1     0.66
 12    0.62
 Name: target, dtype: float64]

Unpacking dictionaries#

A dictionary with a tuple or a list in each value can be unpacked.

So instead of one dictionary with tuples of 4 values we get 4 separate dictionaries:

X_train, X_test, y_train, y_test = splits.unpack()
X_train.shape, y_train.shape, X_test.shape, y_test.shape
(BCDict({'apples': (12, 4), 'pears': (13, 4), 'bananas': (9, 4)}),
 BCDict({'apples': (12,), 'pears': (13,), 'bananas': (9,)}),
 BCDict({'apples': (4, 4), 'pears': (5, 4), 'bananas': (3, 4)}),
 BCDict({'apples': (4,), 'pears': (5,), 'bananas': (3,)}))

Create models#

Let us now create an (unfitted) linear regression model for each key. We use the bootstrap() function again:

models = bcdict.bootstrap(keys, LinearRegression)
models
BCDict({'apples': LinearRegression(), 'pears': LinearRegression(), 'bananas': LinearRegression()})

… and train all three models:

models.fit(X_train, y_train)
pprint(models.coef_)
{'apples': array([-0.48,  0.63,  0.07,  0.22]),
 'bananas': array([ 0.15,  0.08, -0.33,  0.49]),
 'pears': array([0.62, 0.2 , 0.12, 0.24])}

We have just fitted 3 models without a for loop or any code repetition!

Make predictions…#

…and demonstrate argument broadcast

Apply each model to the correct dataset:

preds = models.predict(X_test)
preds
BCDict({'apples': array([0.8 , 0.58, 0.77, 0.7 ]), 'pears': array([0.63, 0.43, 0.21, 0.37, 0.63]), 'bananas': array([0.45, 0.58, 0.62])})

models is a BCDict.

X_test is a dictionary with the same keys as models.

When calling the predict() function, the X_test argument gets broadcast.

The above line is equivalent to:

preds = {k: model.predict(X_test[k]) for k, model in models.items()}

Evaluate the predictions#

# now we pipe all predictions and the
scores = bcdict.apply(r2_score, y_test, preds)
pprint(scores)
{'apples': -0.9429580126630714,
 'bananas': -0.9640958793433909,
 'pears': -1.3930314279359615}

The apply() function applies a callable (in this case, r2_score) on each element of a BCDict.

The above line is equivalent to:

scores = {k: r2_score(y_test[k], preds[k])}

The first broadcast dictionary in the arguments determines the keys of the output dictionary. All other arguments are either passed on unmodified, or they are broadcast if they are also a BCDict with the same keys.

Conclusion: no single for loop or dict comprehension used to train 3 models predict and evaluate 3 grids :)

Cross validation#

Of course, we can also apply a cross validation on all our data sets:

from sklearn.model_selection import cross_val_score
models = bcdict.bootstrap(keys, LinearRegression)
res = bcdict.apply(cross_val_score, models, X, y, cv=3)
pprint(res)
{'apples': array([-1.99, -1.96, -0.38]),
 'bananas': array([-0.91, -2.28, -1.55]),
 'pears': array([-6.94, -2.62, -0.59])}

Conclusion#

We just created a pipeline to train a model, generate predictions and validate the model for three datasets.

And we did that without writing a single for-loop!