{ "cells": [ { "cell_type": "markdown", "id": "f5d984cd-a355-45af-96b5-e63ae8b9cda8", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Example: Train/test/evaluate pipeline with `BCDict`" ] }, { "cell_type": "code", "execution_count": 17, "id": "917781ee-58af-4bf6-918b-372e9e130f09", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from pprint import pprint\n", "import math\n", "import pandas as pd\n", "import numpy as np\n", "from typing import Collection\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import r2_score\n", "\n", "import bcdict\n", "from bcdict import BCDict\n", "\n", "np.set_printoptions(precision=2)\n", "pd.options.display.precision = 2" ] }, { "cell_type": "markdown", "id": "a539d853-9557-444d-81e1-f33589ac19bc", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Generate random data\n", "\n", "Let's start by generating some random data.\n", "\n", "First of all, a function that returns a random DataFrame with 4 feature columns and one target column:" ] }, { "cell_type": "code", "execution_count": 18, "id": "d7011ec8-f9d1-46b7-a45e-448788b2eeb2", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "np.random.seed(42)\n", "\n", "def get_random_data():\n", " \"\"\"Just create some random data.\"\"\"\n", " columns = list(\"ABCD\") + [\"target\"]\n", " nrows = np.random.randint(10, 25)\n", " df = pd.DataFrame(\n", " np.random.random((nrows, len(columns))) + 0.01, \n", " columns=columns,\n", " )\n", " return df" ] }, { "cell_type": "markdown", "id": "e126f051-5486-4b8a-80e4-d6b694af9707", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We will work with three different dataset:" ] }, { "cell_type": "code", "execution_count": 19, "id": "7dfede4f-7424-4f44-956f-0204da1e653d", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "keys = [\"apples\", \"pears\", \"bananas\"]" ] }, { "cell_type": "markdown", "id": "7f365bdd-6037-4bef-9f0e-190fea273683", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## First BCDict magic\n", "\n", "Now, generate a dictionary with 3 entries of random data.\n", "\n", "The `bootstrap()` function calls a function for every item in a list and returns a BCDict:" ] }, { "cell_type": "code", "execution_count": 20, "id": "4e5ac9b6-a029-4e0a-81e5-8b50ecbe26be", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "dfs = bcdict.bootstrap(keys, get_random_data)" ] }, { "cell_type": "markdown", "source": [ "`dfs` is a broadcast dict with keys apples, pears and bananas.\n", "\n", "It's values are dataframes of random values.\n", "\n", "We can now call arbitrary functions on the BCDict.\n", "\n", "It will be called on all values of the dictionary, and return a dictionary with the results of the function calls.\n", "\n", "Let's try with the `head()` function:" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 21, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'apples': A B C D target\n", "0 0.81 0.19 0.79 0.61 0.46\n", "1 0.11 0.47 0.34 0.15 0.66\n", "2 0.07 0.73 0.95 0.01 1.00,\n", " 'bananas': A B C D target\n", "0 0.72 0.82 0.36 0.11 0.95\n", "1 0.41 0.53 0.85 0.69 0.75\n", "2 0.22 0.55 0.71 0.24 0.18,\n", " 'pears': A B C D target\n", "0 0.63 0.34 0.07 0.32 0.34\n", "1 0.74 0.65 0.90 0.48 0.13\n", "2 0.72 0.77 0.57 0.78 0.50}\n" ] } ], "source": [ "pprint(dfs.head(3))" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "We can also access attributes the same way. The following line returns `shape` attribute of all values in the dictionary:" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 22, "outputs": [ { "data": { "text/plain": "BCDict({'apples': (16, 5), 'pears': (18, 5), 'bananas': (12, 5)})" }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfs.shape" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "# Indexing and column selection\n", "\n", "We can also slice all values in the dictionary at once.\n", "\n", "We'll use this here to get a dictionary of series with the target column, and a DataFrame with all features (`X` and `y` in sklearn terminology).\n", "\n", "Here we select the 'target' column and save it in `y`:\n" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 23, "outputs": [ { "data": { "text/plain": "BCDict({'apples': (16,), 'pears': (18,), 'bananas': (12,)})" }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = dfs['target']\n", "y.shape" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "And we get all `X` dataframes by dropping the target column:" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 24, "outputs": [ { "data": { "text/plain": "BCDict({'apples': (16, 4), 'pears': (18, 4), 'bananas': (12, 4)})" }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = dfs.drop(columns=\"target\")\n", "X.shape" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "# Split the data into train and test\n", "\n", "Using the `apply()` function we can apply arbitrary functions on the dictionaries:" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 25, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "splits = bcdict.apply(train_test_split, X, y)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Each entry in the dictionary now contains a list with X_train, X_test, y_train, y_test:" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 26, "outputs": [ { "data": { "text/plain": "[ A B C D\n 9 0.77 0.44 0.22 0.58\n 6 0.69 0.46 0.02 0.95\n 13 0.29 0.31 0.18 0.03\n 7 0.40 0.03 0.24 0.25\n 10 0.85 0.46 0.41 0.94\n 3 0.63 0.62 0.02 0.03\n 2 0.07 0.73 0.95 0.01\n 14 0.40 0.30 0.02 0.21\n 4 0.41 0.06 0.98 0.24\n 15 0.80 0.62 0.94 0.66\n 11 0.34 0.58 0.53 0.97\n 0 0.81 0.19 0.79 0.61,\n A B C D\n 8 0.62 0.84 0.18 0.40\n 5 0.63 0.39 0.99 0.48\n 1 0.11 0.47 0.34 0.15\n 12 0.76 0.55 0.60 0.98,\n 9 0.04\n 6 0.57\n 13 0.43\n 7 0.69\n 10 0.74\n 3 0.53\n 2 1.00\n 14 0.72\n 4 0.10\n 15 0.92\n 11 0.85\n 0 0.46\n Name: target, dtype: float64,\n 8 0.19\n 5 0.87\n 1 0.66\n 12 0.62\n Name: target, dtype: float64]" }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splits['apples']" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "## Unpacking dictionaries\n", "\n", "A dictionary with a tuple or a list in each value can be unpacked.\n", "\n", "So instead of one dictionary with tuples of 4 values we get 4 separate dictionaries:" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 27, "outputs": [ { "data": { "text/plain": "(BCDict({'apples': (12, 4), 'pears': (13, 4), 'bananas': (9, 4)}),\n BCDict({'apples': (12,), 'pears': (13,), 'bananas': (9,)}),\n BCDict({'apples': (4, 4), 'pears': (5, 4), 'bananas': (3, 4)}),\n BCDict({'apples': (4,), 'pears': (5,), 'bananas': (3,)}))" }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train, X_test, y_train, y_test = splits.unpack()\n", "X_train.shape, y_train.shape, X_test.shape, y_test.shape" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "# Create models\n", "\n", "Let us now create an (unfitted) linear regression model for each key. We use the `bootstrap()` function again:" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 28, "outputs": [ { "data": { "text/plain": "BCDict({'apples': LinearRegression(), 'pears': LinearRegression(), 'bananas': LinearRegression()})" }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "models = bcdict.bootstrap(keys, LinearRegression)\n", "models" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "... and train all three models:" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 29, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'apples': array([-0.48, 0.63, 0.07, 0.22]),\n", " 'bananas': array([ 0.15, 0.08, -0.33, 0.49]),\n", " 'pears': array([0.62, 0.2 , 0.12, 0.24])}\n" ] } ], "source": [ "models.fit(X_train, y_train)\n", "pprint(models.coef_)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "We have just fitted 3 models without a for loop or any code repetition!" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "markdown", "source": [ "# Make predictions...\n", "\n", "*...and demonstrate argument broadcast*" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "markdown", "source": [ "Apply each model to the correct dataset:" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 30, "outputs": [ { "data": { "text/plain": "BCDict({'apples': array([0.8 , 0.58, 0.77, 0.7 ]), 'pears': array([0.63, 0.43, 0.21, 0.37, 0.63]), 'bananas': array([0.45, 0.58, 0.62])})" }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preds = models.predict(X_test)\n", "preds" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "`models` is a BCDict.\n", "\n", "`X_test` is a dictionary with the same keys as `models`.\n", "\n", "When calling the `predict()` function, the `X_test` argument gets *broadcast*.\n", "\n", "The above line is equivalent to:\n", "\n", "```python\n", "preds = {k: model.predict(X_test[k]) for k, model in models.items()}\n", "```" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "markdown", "source": [ "# Evaluate the predictions" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 31, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'apples': -0.9429580126630726,\n", " 'bananas': -0.9640958793433909,\n", " 'pears': -1.393031427935962}\n" ] } ], "source": [ "# now we pipe all predictions and the\n", "scores = bcdict.apply(r2_score, y_test, preds)\n", "pprint(scores)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "The `apply()` function applies a callable (in this case, `r2_score`) on each element of a BCDict.\n", "\n", "The above line is equivalent to:\n", "\n", "```python\n", "scores = {k: r2_score(y_test[k], preds[k])}\n", "```\n", "\n", "The *first* broadcast dictionary in the arguments determines the keys of the output dictionary. All other arguments are either passed on unmodified, or they are broadcast if they are also a BCDict with the same keys.\n", "\n", "\n", "Conclusion: no single for loop or dict comprehension used to train 3 models predict and evaluate 3 grids :)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "markdown", "id": "92f3860f-e25d-4f21-8cd9-51ff63032d41", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Cross validation\n", "\n", "Of course, we can also apply a cross validation on all our data sets:" ] }, { "cell_type": "code", "execution_count": 32, "id": "b5e77ca2-c303-48ef-bdb6-1c581f8a1504", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'apples': array([-1.99, -1.96, -0.38]),\n", " 'bananas': array([-0.91, -2.28, -1.55]),\n", " 'pears': array([-6.94, -2.62, -0.59])}\n" ] } ], "source": [ "from sklearn.model_selection import cross_val_score\n", "models = bcdict.bootstrap(keys, LinearRegression)\n", "res = bcdict.apply(cross_val_score, models, X, y, cv=3)\n", "pprint(res)" ] }, { "cell_type": "markdown", "id": "51519eea-e34f-415e-a76f-92f3c66c940f", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Conclusion\n", "\n", "We just created a pipeline to train a model, generate predictions *and* validate the model for three datasets.\n", "\n", "And we did that without writing a single for-loop!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }