Github

Formation Scikit-Learn Cheat Sheet

Open In Colab

Paris Saclay Center for Data Science

RAMP on predicting the number of air passengers

Balázs Kégl (LAL/CNRS), Alex Gramfort (Inria), Djalel Benbouzid (UPMC), Mehdi Cherti (LAL/CNRS)

Introduction

The data set was donated to us by an unnamed company handling flight ticket reservations. The data is thin, it contains

  • the date of departure
  • the departure airport
  • the arrival airport
  • the mean and standard deviation of the number of weeks of the reservations made before the departure date
  • a field called log_PAX which is related to the number of passengers (the actual number were changed for privacy reasons)

The goal is to predict the log_PAX column. The prediction quality is measured by RMSE.

The data is obviously limited, but since data and location informations are available, it can be joined to external data sets. The challenge in this RAMP is to find good data that can be correlated to flight traffic.

In [ ]:
%matplotlib inline
import os
import importlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

Load the dataset using pandas

The training and testing data are located in the folder data. They are compressed csv file (i.e. csv.bz2). We can load the dataset using pandas.

In [ ]:
data = pd.read_csv(
    os.path.join('data', 'train.csv.bz2'), parse_dates=[0]
)
In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8902 entries, 0 to 8901
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   DateOfDeparture   8902 non-null   datetime64[ns]
 1   Departure         8902 non-null   object        
 2   Arrival           8902 non-null   object        
 3   WeeksToDeparture  8902 non-null   float64       
 4   log_PAX           8902 non-null   float64       
 5   std_wtd           8902 non-null   float64       
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 417.4+ KB

So as stated earlier, the column log_PAX is the target for our regression problem. The other columns are the features which will be used for our prediction problem. If we focus on the data type of the column, we can see that DateOfDeparture, Departure, and Arrival are of object dtype, meaning they are strings.

In [ ]:
data[['DateOfDeparture', 'Departure', 'Arrival']].head()
Out[ ]:
DateOfDeparture Departure Arrival
0 2012-06-19 ORD DFW
1 2012-09-10 LAS DEN
2 2012-10-05 DEN LAX
3 2011-10-09 ATL ORD
4 2012-02-21 DEN SFO

While it makes Departure and Arrival are the code of the airport, we see that the DateOfDeparture should be a date instead of string. We can use pandas to convert this data.

In [ ]:
data.loc[:, 'DateOfDeparture'] = pd.to_datetime(data.loc[:, 'DateOfDeparture'])
In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8902 entries, 0 to 8901
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   DateOfDeparture   8902 non-null   datetime64[ns]
 1   Departure         8902 non-null   object        
 2   Arrival           8902 non-null   object        
 3   WeeksToDeparture  8902 non-null   float64       
 4   log_PAX           8902 non-null   float64       
 5   std_wtd           8902 non-null   float64       
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 417.4+ KB

When you will create a submission, ramp-workflow will load the data for you and split into a data matrix X and a target vector y. It will also take care about splitting the data into a training and testing set. These utilities are available in the module problem.py which we will load.

In [ ]:
import problem

The function get_train_data() loads the training data and returns a pandas dataframe X and a numpy vector y.

In [ ]:
X, y = problem.get_train_data()
In [ ]:
type(X)
Out[ ]:
pandas.core.frame.DataFrame
In [ ]:
type(y)
Out[ ]:
numpy.ndarray

We can check the information of the data X

In [ ]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8902 entries, 0 to 8901
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   DateOfDeparture   8902 non-null   object 
 1   Departure         8902 non-null   object 
 2   Arrival           8902 non-null   object 
 3   WeeksToDeparture  8902 non-null   float64
 4   std_wtd           8902 non-null   float64
dtypes: float64(2), object(3)
memory usage: 347.9+ KB

Thus, this is important to see that ramp-workflow does not convert the DateOfDeparture column into a datetime format. Thus, keep in mind that you might need to make a conversion when prototyping your machine learning pipeline later on. Let's check some statistics regarding our dataset.

In [ ]:
print(min(X['DateOfDeparture']))
print(max(X['DateOfDeparture']))
2011-09-01
2013-03-05
In [ ]:
X['Departure'].unique()
Out[ ]:
array(['ORD', 'LAS', 'DEN', 'ATL', 'SFO', 'EWR', 'IAH', 'LAX', 'DFW',
       'SEA', 'JFK', 'PHL', 'MIA', 'DTW', 'BOS', 'MSP', 'CLT', 'MCO',
       'PHX', 'LGA'], dtype=object)
In [ ]:
_ = plt.hist(y, bins=50)
In [ ]:
_ = X.hist('std_wtd', bins=50)
In [ ]:
_ = X.hist('WeeksToDeparture', bins=50)
In [ ]:
X.describe()
Out[ ]:
WeeksToDeparture std_wtd
count 8902.000000 8902.000000
mean 11.446469 8.617773
std 2.787140 2.139604
min 2.625000 2.160247
25% 9.523810 7.089538
50% 11.300000 8.571116
75% 13.240000 10.140521
max 23.163265 15.862216
In [ ]:
X.shape
Out[ ]:
(8902, 5)
In [ ]:
print(y.mean())
print(y.std())
10.99904767212102
0.9938894125318564

Preprocessing dates

Getting dates into numerical columns is a common operation when time series data is analyzed with non-parametric predictors. The code below makes the following transformations:

  • numerical columns for year (2011-2012), month (1-12), day of the month (1-31), day of the week (0-6), and week of the year (1-52)
  • number of days since 1970-01-01
In [ ]:
# Make a copy of the original data to avoid writing on the original data
X_encoded = X.copy()

# following http://stackoverflow.com/questions/16453644/regression-with-date-variable-using-scikit-learn
X_encoded['DateOfDeparture'] = pd.to_datetime(X_encoded['DateOfDeparture'])
X_encoded['year'] = X_encoded['DateOfDeparture'].dt.year
X_encoded['month'] = X_encoded['DateOfDeparture'].dt.month
X_encoded['day'] = X_encoded['DateOfDeparture'].dt.day
X_encoded['weekday'] = X_encoded['DateOfDeparture'].dt.weekday
X_encoded['week'] = X_encoded['DateOfDeparture'].dt.isocalendar().week
X_encoded['n_days'] = X_encoded['DateOfDeparture'].apply(lambda date: (date - pd.to_datetime("1970-01-01")).days)
In [ ]:
X_encoded.tail(5)
Out[ ]:
DateOfDeparture Departure Arrival WeeksToDeparture std_wtd year month day weekday week n_days
8897 2011-10-02 DTW ATL 9.263158 7.316967 2011 10 2 6 39 15249
8898 2012-09-25 DFW ORD 12.772727 10.641034 2012 9 25 1 39 15608
8899 2012-01-19 SFO LAS 11.047619 7.908705 2012 1 19 3 3 15358
8900 2013-02-03 ORD PHL 6.076923 4.030334 2013 2 3 6 5 15739
8901 2011-11-26 DTW ATL 9.526316 6.167733 2011 11 26 5 47 15304

We will perform all preprocessing steps within a scikit-learn pipeline which chains together tranformation and estimator steps. This offers offers convenience and safety (help avoid leaking statistics from your test data into the trained model in cross-validation) and the whole pipeline can be evaluated with cross_val_score.

To perform the above encoding within a scikit-learn pipeline we will a function and using FunctionTransformer to make it compatible with scikit-learn API.

In [ ]:
from sklearn.preprocessing import FunctionTransformer
import holidays

def _encode_dates(X):
    # With pandas < 1.0, we wil get a SettingWithCopyWarning
    # In our case, we will avoid this warning by triggering a copy
    # More information can be found at:
    # https://github.com/scikit-learn/scikit-learn/issues/16191
    X_encoded = X.copy()
    us_holidays = holidays.US()
    
    # Make sure that DateOfDeparture is of datetime format
    X_encoded.loc[:, 'DateOfDeparture'] = pd.to_datetime(X_encoded['DateOfDeparture'])
    # Encode the DateOfDeparture
    X_encoded.loc[:, 'year'] = X_encoded['DateOfDeparture'].dt.year
    X_encoded.loc[:, 'month'] = X_encoded['DateOfDeparture'].dt.month
    X_encoded.loc[:, 'day'] = X_encoded['DateOfDeparture'].dt.day
    X_encoded.loc[:, 'weekday'] = X_encoded['DateOfDeparture'].dt.weekday
    X_encoded.loc[:, 'week'] = X_encoded['DateOfDeparture'].dt.isocalendar().week
    X_encoded.loc[:, 'n_days'] = X_encoded['DateOfDeparture'].apply(
        lambda date: (date - pd.to_datetime("1970-01-01")).days
    )
    X_encoded.loc[:, 'hol_days'] = X_encoded['DateOfDeparture'].apply(
        lambda date: 1 if (date in holidays.US()) else 0)
    # Once we did the encoding, we will not need DateOfDeparture
    return X_encoded.drop(columns=["DateOfDeparture"])

date_encoder = FunctionTransformer(_encode_dates)
In [ ]:
tmp = date_encoder.fit_transform(X)
In [ ]:
tmp[tmp["hol_days"]==1]
Out[ ]:
Departure Arrival WeeksToDeparture std_wtd year month day weekday week n_days hol_days
24 EWR BOS 11.000000 8.711759 2012 11 12 0 46 15656 1
54 EWR LAX 12.818182 8.671993 2011 12 25 6 51 15333 1
68 DEN LAX 15.034483 11.890821 2012 12 25 1 52 15699 1
77 DEN ORD 9.631579 7.158604 2013 1 21 0 4 15726 1
119 DEN ORD 10.176471 7.307832 2011 11 24 3 47 15302 1
... ... ... ... ... ... ... ... ... ... ... ...
8817 JFK MIA 15.733333 12.564134 2011 12 25 6 51 15333 1
8819 SFO JFK 16.096774 11.249755 2012 5 28 0 22 15488 1
8843 SFO LAX 17.705882 11.645078 2012 12 25 1 52 15699 1
8844 MCO PHL 9.526316 6.432283 2011 12 25 6 51 15333 1
8857 PHX ORD 8.888889 6.182412 2011 10 10 0 41 15257 1

363 rows × 11 columns

Random Forests

Tree-based algorithms requires less complex preprocessing than linear-models. We will first present a machine-learning pipeline where we will use a random forest. In this pipeline, we will need to:

  • encode the date to numerical values (as presented in the section above);
  • oridinal encode the other categorical values to get numerical number;
  • keep numerical features as they are.

Thus, we want to perform three different processes on different columns of the original data X. In scikit-learn, we can use make_column_transformer to perform such processing.

In [ ]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer

categorical_encoder = OrdinalEncoder()
categorical_cols = ["Arrival", "Departure"]

preprocessor = make_column_transformer(
    (categorical_encoder, categorical_cols),
    remainder='passthrough',  # passthrough numerical columns as they are
)

We can combine our preprocessor with an estimator (RandomForestRegressor in this case), allowing us to make predictions.

In [ ]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline

n_estimators = 10
max_depth = 10
max_features = 10

regressor = RandomForestRegressor(
    n_estimators=n_estimators, max_depth=max_depth, max_features=max_features#,min_samples_leaf=5
)

pipeline = make_pipeline(date_encoder, preprocessor, regressor)

We can cross-validate our pipeline using cross_val_score. Below we will have specified cv=5 meaning KFold cross-valdiation splitting will be used, with 8 folds. The mean squared error regression loss is calculated for each split. The output score will be an array of 5 scores from each KFold. The score mean and standard deviation of the 5 scores is printed at the end.

In [ ]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    pipeline, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)

print(
    f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)
RMSE: 0.6283 +/- 0.0250
In [ ]:
#!ramp-test --submission starting_kit

Linear regressor

When dealing with a linear model, we need to one-hot encode categorical variables instead of ordinal encoding and standardize numerical variables. Thus we will:

  • encode the date;
  • then, one-hot encode all categorical columns, including the encoded date as well;
  • standardize the numerical columns.
In [ ]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

categorical_encoder = OneHotEncoder(handle_unknown="ignore")
categorical_cols = [
    "Arrival", "Departure", "year", "month", "day",
    "weekday", "week", "n_days"
]

numerical_scaler = StandardScaler()
numerical_cols = ["WeeksToDeparture", "std_wtd"]

preprocessor = make_column_transformer(
    (categorical_encoder, categorical_cols),
    (numerical_scaler, numerical_cols)
)

We can now combine our preprocessor with the LinearRegression estimator in a Pipeline:

In [ ]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

pipeline = make_pipeline(date_encoder, preprocessor, regressor)

And we can evaluate our linear-model pipeline:

In [ ]:
scores = cross_val_score(
    pipeline, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)

print(
    f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)
RMSE: 0.6117 +/- 0.0149

Test grid search and Lasso

In [ ]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV, train_test_split
lasso=Lasso()
pipL = make_pipeline( date_encoder, preprocessor, lasso)
pipL_gs = GridSearchCV(
    pipL,
    {"lasso__alpha":[0.00001, 0.0001, 0.01]}
)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.10)
pipL_gs.fit(X_train, y_train)
In [ ]:
lasso=Lasso()
pipL = make_pipeline( date_encoder, preprocessor, lasso)
pipL_gs = GridSearchCV(
    pipL,
    {"lasso__alpha":[0.00001, 0.0001, 0.01]}
)
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.10)
In [ ]:
pipL_gs.fit(X_train, y_train)
/home/peter/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:512: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 320.22821354759185, tolerance: 0.6387185401787862
  model = cd_fast.sparse_enet_coordinate_descent(
/home/peter/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:512: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 331.01463346245384, tolerance: 0.6371123962129355
  model = cd_fast.sparse_enet_coordinate_descent(
/home/peter/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:512: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 316.01636597892013, tolerance: 0.6285972686841081
  model = cd_fast.sparse_enet_coordinate_descent(
/home/peter/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:512: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 319.04254662657195, tolerance: 0.6494295581882294
  model = cd_fast.sparse_enet_coordinate_descent(
/home/peter/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:512: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 319.1290676773982, tolerance: 0.6395733622019132
  model = cd_fast.sparse_enet_coordinate_descent(
/home/peter/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:512: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 14.942635417884276, tolerance: 0.6371123962129355
  model = cd_fast.sparse_enet_coordinate_descent(
Out[ ]:
GridSearchCV(estimator=Pipeline(steps=[('functiontransformer',
                                        FunctionTransformer(func=<function _encode_dates at 0x7fec3149caf0>)),
                                       ('columntransformer',
                                        ColumnTransformer(transformers=[('onehotencoder',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['Arrival',
                                                                          'Departure',
                                                                          'year',
                                                                          'month',
                                                                          'day',
                                                                          'weekday',
                                                                          'week',
                                                                          'n_days']),
                                                                        ('standardscaler',
                                                                         StandardScaler(),
                                                                         ['WeeksToDeparture',
                                                                          'std_wtd'])])),
                                       ('lasso', Lasso())]),
             param_grid={'lasso__alpha': [1e-05, 0.0001, 0.01]})
In [ ]:
pipL_gs.best_params_
Out[ ]:
{'lasso__alpha': 0.0001}
In [ ]:
pipL_gs.best_estimator_.steps[-1][1].alpha
Out[ ]:
0.0001
In [ ]:
pipL_gs.cv_results_
Out[ ]:
{'mean_fit_time': array([7.13713937, 6.03071251, 1.12473645]),
 'std_fit_time': array([0.03533106, 0.28460367, 0.01887271]),
 'mean_score_time': array([0.25607538, 0.2563077 , 0.25720119]),
 'std_score_time': array([0.0008828 , 0.00213678, 0.00135228]),
 'param_lasso__alpha': masked_array(data=[1e-05, 0.0001, 0.01],
              mask=[False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'lasso__alpha': 1e-05},
  {'lasso__alpha': 0.0001},
  {'lasso__alpha': 0.01}],
 'split0_test_score': array([0.62011337, 0.62706323, 0.41771701]),
 'split1_test_score': array([0.64666052, 0.65529856, 0.46811574]),
 'split2_test_score': array([0.60801378, 0.61410272, 0.42689959]),
 'split3_test_score': array([0.60262507, 0.61203399, 0.43874222]),
 'split4_test_score': array([0.61859979, 0.62804545, 0.44002117]),
 'mean_test_score': array([0.61920251, 0.62730879, 0.43829915]),
 'std_test_score': array([0.01519957, 0.01543859, 0.0170045 ]),
 'rank_test_score': array([2, 1, 3], dtype=int32)}
In [ ]:
#!ramp-test --submission starting_kit_lin
In [ ]:
#!ramp-test --submission starting_kit_Ridge

Merging external data

The objective in this RAMP data challenge is to find good data that can be correlated to flight traffic. We will use some weather data (saved in submissions/starting_kit) to provide an example of how to merge external data in a scikit-learn pipeline.

Your external data will need to be included in your submissions folder - see RAMP submissions for more details.

First we will define a function that merges the external data to our feature data.

In [ ]:
# when submitting a kit, the `__file__` variable will corresponds to the
# path to `estimator.py`. However, this variable is not defined in the
# notebook and thus we must define the `__file__` variable to imitate
# how a submission `.py` would work.
__file__ = os.path.join('submissions', 'starting_kit', 'estimator.py')
filepath = os.path.join(os.path.dirname(__file__), 'external_data.csv')
filepath
Out[ ]:
'submissions/starting_kit/external_data.csv'
In [ ]:
pd.read_csv(filepath).tail(20)
Out[ ]:
Date AirPort Max TemperatureC Mean TemperatureC Min TemperatureC Dew PointC MeanDew PointC Min DewpointC Max Humidity Mean Humidity Min Humidity Max Sea Level PressurehPa Mean Sea Level PressurehPa Min Sea Level PressurehPa Max VisibilityKm Mean VisibilityKm Min VisibilitykM Max Wind SpeedKm/h Mean Wind SpeedKm/h Max Gust SpeedKm/h Precipitationmm CloudCover Events WindDirDegrees
11020 2013-02-14 LGA 8 4 1 1 -2 -6 82 60 37 1016 1012 1006 16 16 10 26 13 29.0 T 4 Rain-Snow 295
11021 2013-02-15 LGA 12 7 3 2 0 -3 76 60 44 1016 1015 1013 16 15 11 37 13 42.0 0.00 4 NaN 210
11022 2013-02-16 LGA 6 3 0 1 -4 -10 82 64 45 1014 1012 1007 16 16 13 39 20 52.0 0.51 8 Rain-Snow 2
11023 2013-02-17 LGA 0 -3 -7 -11 -14 -17 49 40 31 1016 1008 1004 16 16 16 58 40 72.0 0.00 5 NaN 321
11024 2013-02-18 LGA 2 -2 -7 -12 -15 -17 51 39 26 1024 1022 1016 16 16 16 50 26 64.0 0.00 1 NaN 278
11025 2013-02-19 LGA 8 4 1 5 -2 -12 89 62 35 1023 1012 1005 16 15 6 40 18 56.0 4.06 6 Rain 195
11026 2013-02-20 LGA 4 1 -3 -7 -11 -14 52 43 33 1015 1011 1006 16 16 16 50 33 64.0 0.00 4 NaN 284
11027 2013-02-21 LGA 2 -1 -4 -10 -11 -13 55 48 41 1025 1018 1015 16 16 16 50 32 64.0 0.00 2 NaN 306
11028 2013-02-22 LGA 4 1 -3 -1 -7 -9 70 63 56 1029 1027 1025 16 16 16 29 14 39.0 T 6 NaN 50
11029 2013-02-23 LGA 4 3 2 2 1 -1 92 84 76 1025 1017 1009 16 7 0 34 18 42.0 6.86 8 Rain 53
11030 2013-02-24 LGA 8 5 2 2 -1 -6 92 73 53 1014 1009 1007 16 14 5 45 22 56.0 0.76 7 Rain-Snow 327
11031 2013-02-25 LGA 7 4 1 -2 -4 -6 69 57 45 1024 1020 1015 16 16 16 29 12 35.0 0.00 4 NaN 335
11032 2013-02-26 LGA 7 4 1 1 -1 -3 92 73 53 1026 1022 1013 16 15 8 37 19 45.0 3.05 6 Rain 65
11033 2013-02-27 LGA 9 6 2 6 3 1 100 80 60 1012 1003 999 16 7 2 47 26 58.0 28.19 8 Rain 31
11034 2013-02-28 LGA 12 8 4 2 0 -2 76 61 46 1003 1002 1000 16 16 16 29 12 39.0 T 7 Rain 285
11035 2013-03-01 LGA 7 5 3 -1 -3 -4 76 63 49 1008 1005 1002 16 16 16 34 21 42.0 0.00 6 NaN 320
11036 2013-03-02 LGA 4 2 0 -2 -5 -6 82 65 48 1008 1007 1006 16 15 6 34 20 40.0 T 8 Snow 317
11037 2013-03-03 LGA 4 2 -1 -5 -8 -9 69 55 40 1008 1006 1004 16 15 8 39 24 50.0 T 6 Snow 314
11038 2013-03-04 LGA 5 2 -2 -7 -8 -9 63 54 44 1012 1009 1008 16 16 16 47 31 60.0 0.00 3 NaN 313
11039 2013-03-05 LGA 9 5 1 -3 -5 -7 61 49 37 1016 1015 1013 16 16 16 39 16 48.0 0.00 2 NaN 5
In [ ]:
def _merge_external_data(X):
    filepath = os.path.join(
        os.path.dirname(__file__), 'external_data.csv'
    )
    
    X = X.copy()  # to avoid raising SettingOnCopyWarning
    # Make sure that DateOfDeparture is of dtype datetime
    X.loc[:, "DateOfDeparture"] = pd.to_datetime(X['DateOfDeparture'])
    # Parse date to also be of dtype datetime
    data_weather = pd.read_csv(filepath, parse_dates=["Date"])

    X_weather = data_weather[['Date', 'AirPort', 'Max TemperatureC', 'Events']]
    X_weather = X_weather.rename(
        columns={'Date': 'DateOfDeparture', 'AirPort': 'Arrival', 'Max TemperatureC': 'MaxTempArriv', 'Events':'EventsArr'})
    X_merged = pd.merge(
        X, X_weather, how='left', on=['DateOfDeparture', 'Arrival'], sort=False
    )
    # add temp of the departure
    X_weather = data_weather[['Date', 'AirPort', 'Min TemperatureC', 'Events']]
    X_weather = X_weather.rename(
        columns={'Date': 'DateOfDeparture', 'AirPort': 'Departure', 'Min TemperatureC': 'MinTempArriv', 'Events':'EventsDep'})
    X_merged = pd.merge(
        X_merged, X_weather, how='left', on=['DateOfDeparture', 'Departure'], sort=False
    )
    X_merged.loc[:,["EventsArr","EventsDep"]]=X_merged.loc[:,["EventsArr","EventsDep"]].fillna(value="good")
    return X_merged

data_merger = FunctionTransformer(_merge_external_data)

Double check that our function works:

In [ ]:
data_merger.fit_transform(X).head()
Out[ ]:
DateOfDeparture Departure Arrival WeeksToDeparture std_wtd MaxTempArriv EventsArr MinTempArriv EventsDep
0 2012-06-19 ORD DFW 12.875000 9.812647 34 good 26 good
1 2012-09-10 LAS DEN 14.285714 9.466734 33 good 27 good
2 2012-10-05 DEN LAX 10.863636 9.035883 22 Fog -1 Rain-Snow
3 2011-10-09 ATL ORD 11.480000 7.990202 27 good 16 good
4 2012-02-21 DEN SFO 11.450000 9.517159 16 good -4 good

Use FunctionTransformer to make our function compatible with scikit-learn API:

We can now assemble our pipeline using the same data_merger and preprocessor as above:

In [ ]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer

categorical_encoder = OrdinalEncoder()
categorical_cols = ["Arrival", "Departure", "EventsArr", "EventsDep"]

mymd=data_merger.fit_transform(X)[["EventsArr", "EventsDep"]]
enc = OrdinalEncoder()
enc.fit(mymd)
print(enc.categories_)

preprocessor = make_column_transformer(
    (categorical_encoder, categorical_cols),
    remainder='passthrough',  # passthrough numerical columns as they are
)
[array(['Fog', 'Fog-Rain', 'Fog-Rain-Hail-Thunderstorm', 'Fog-Rain-Snow',
       'Fog-Rain-Snow-Thunderstorm', 'Fog-Rain-Thunderstorm', 'Fog-Snow',
       'Rain', 'Rain-Hail-Thunderstorm', 'Rain-Snow',
       'Rain-Snow-Thunderstorm', 'Rain-Thunderstorm', 'Snow',
       'Thunderstorm', 'good'], dtype=object), array(['Fog', 'Fog-Rain', 'Fog-Rain-Hail-Thunderstorm', 'Fog-Rain-Snow',
       'Fog-Rain-Snow-Thunderstorm', 'Fog-Rain-Thunderstorm', 'Fog-Snow',
       'Rain', 'Rain-Hail-Thunderstorm', 'Rain-Snow',
       'Rain-Snow-Thunderstorm', 'Rain-Thunderstorm',
       'Rain-Thunderstorm-Tornado', 'Snow', 'Thunderstorm', 'good'],
      dtype=object)]
In [ ]:
n_estimators = 10
max_depth = 10
max_features = 10

regressor = RandomForestRegressor(
    n_estimators=n_estimators, max_depth=max_depth, max_features=max_features
)

pipeline = make_pipeline(data_merger, date_encoder, preprocessor, regressor)
In [ ]:
scores = cross_val_score(
    pipeline, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)

print(
    f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/workspace/dataScience/ca_2020/air_passengers-master/submissions/starting_kit/estimator.py in <module>
----> 1 scores = cross_val_score(
      2     pipeline, X, y, cv=5, scoring='neg_mean_squared_error'
      3 )
      4 rmse_scores = np.sqrt(-scores)
      5 

~/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    399     scorer = check_scoring(estimator, scoring=scoring)
    400 
--> 401     cv_results = cross_validate(estimator=estimator, X=X, y=y, groups=groups,
    402                                 scoring={'score': scorer}, cv=cv,
    403                                 n_jobs=n_jobs, verbose=verbose,

~/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    240     parallel = Parallel(n_jobs=n_jobs, verbose=verbose,
    241                         pre_dispatch=pre_dispatch)
--> 242     scores = parallel(
    243         delayed(_fit_and_score)(
    244             clone(estimator), X, y, scorers, train, test, verbose, None,

~/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
   1046             # remaining jobs.
   1047             self._iterating = False
-> 1048             if self.dispatch_one_batch(iterator):
   1049                 self._iterating = self._original_iterator is not None
   1050 

~/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    864                 return False
    865             else:
--> 866                 self._dispatch(tasks)
    867                 return True
    868 

~/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in _dispatch(self, batch)
    782         with self._lock:
    783             job_idx = len(self._jobs)
--> 784             job = self._backend.apply_async(batch, callback=cb)
    785             # A job can complete so quickly than its callback is
    786             # called before we get here, causing self._jobs to

~/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

~/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

~/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in __call__(self)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    558     else:
    559         fit_time = time.time() - start_time
--> 560         test_scores = _score(estimator, X_test, y_test, scorer)
    561         score_time = time.time() - start_time - fit_time
    562         if return_train_score:

~/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer)
    605         scores = scorer(estimator, X_test)
    606     else:
--> 607         scores = scorer(estimator, X_test, y_test)
    608 
    609     error_msg = ("scoring must return a number, got %s (%s) "

~/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_scorer.py in __call__(self, estimator, *args, **kwargs)
     85         for name, scorer in self._scorers.items():
     86             if isinstance(scorer, _BaseScorer):
---> 87                 score = scorer._score(cached_call, estimator,
     88                                       *args, **kwargs)
     89             else:

~/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_scorer.py in _score(self, method_caller, estimator, X, y_true, sample_weight)
    204         """
    205 
--> 206         y_pred = method_caller(estimator, "predict", X)
    207         if sample_weight is not None:
    208             return self._sign * self._score_func(y_true, y_pred,

~/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_scorer.py in _cached_call(cache, estimator, method, *args, **kwargs)
     51     """Call estimator with method and args and kwargs."""
     52     if cache is None:
---> 53         return getattr(estimator, method)(*args, **kwargs)
     54 
     55     try:

~/anaconda3/lib/python3.8/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    117 
    118         # lambda, but not partial, allows help() to work with update_wrapper
--> 119         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    120         # update the docstring of the returned function
    121         update_wrapper(out, self.fn)

~/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    405         Xt = X
    406         for _, name, transform in self._iter(with_final=False):
--> 407             Xt = transform.transform(Xt)
    408         return self.steps[-1][-1].predict(Xt, **predict_params)
    409 

~/anaconda3/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
    602         # TODO: also call _check_n_features(reset=False) in 0.24
    603         self._validate_features(X.shape[1], X_feature_names)
--> 604         Xs = self._fit_transform(X, None, _transform_one, fitted=True)
    605         self._validate_output(Xs)
    606 

~/anaconda3/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
    456             self._iter(fitted=fitted, replace_strings=True))
    457         try:
--> 458             return Parallel(n_jobs=self.n_jobs)(
    459                 delayed(func)(
    460                     transformer=clone(trans) if not fitted else trans,

~/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
   1046             # remaining jobs.
   1047             self._iterating = False
-> 1048             if self.dispatch_one_batch(iterator):
   1049                 self._iterating = self._original_iterator is not None
   1050 

~/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    864                 return False
    865             else:
--> 866                 self._dispatch(tasks)
    867                 return True
    868 

~/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in _dispatch(self, batch)
    782         with self._lock:
    783             job_idx = len(self._jobs)
--> 784             job = self._backend.apply_async(batch, callback=cb)
    785             # A job can complete so quickly than its callback is
    786             # called before we get here, causing self._jobs to

~/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

~/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

~/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in __call__(self)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in _transform_one(transformer, X, y, weight, **fit_params)
    717 
    718 def _transform_one(transformer, X, y, weight, **fit_params):
--> 719     res = transformer.transform(X)
    720     # if we have a weight for this transformer, multiply output
    721     if weight is None:

~/anaconda3/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py in transform(self, X)
    698             Transformed input.
    699         """
--> 700         X_int, _ = self._transform(X)
    701         return X_int.astype(self.dtype, copy=False)
    702 

~/anaconda3/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
    122                     msg = ("Found unknown categories {0} in column {1}"
    123                            " during transform".format(diff, i))
--> 124                     raise ValueError(msg)
    125                 else:
    126                     # Set the problematic rows to an acceptable value and

ValueError: Found unknown categories ['Rain-Thunderstorm-Tornado'] in column 3 during transform

Feature importances

We can check the feature importances using the function sklearn.inspection.permutation_importances. Since the first step of our pipeline adds the new external feature Max TemperatureC, we want to apply this transformation after adding Max TemperatureC, to check the importances of all features. Indeed, we can perform sklearn.inspection.permutation_importances at any stage of the pipeline, as we will see later on.

The code below:

  • performs transform on the first step of the pipeline (pipeline[0]) producing the transformed train (X_train_augmented) and test (X_test_augmented) data
  • the transformed data is used to fit the pipeline from the second step onwards

Note that pipelines can be sliced. pipeline[0] obtains the first step (tuple) of the pipeline. You can further slice to obtain either the transformer/estimator (first item in each tuple) or column list (second item within each tuple) inside each tuple. For example pipeline[0][0] obtains the transformer of the first step of the pipeline (first item of the first tuple).

In [ ]:
print(pipeline)
Pipeline(steps=[('functiontransformer-1',
                 FunctionTransformer(func=<function _merge_external_data at 0x7fec2cc9e280>)),
                ('functiontransformer-2',
                 FunctionTransformer(func=<function _encode_dates at 0x7fec3149caf0>)),
                ('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('ordinalencoder',
                                                  OrdinalEncoder(),
                                                  ['Arrival', 'Departure'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=10, max_features=10,
                                       n_estimators=10))])
In [ ]:
 
In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)

merger = pipeline[0]
X_train_augmented = merger.transform(X_train)
X_test_augmented = merger.transform(X_test)

predictor = pipeline[1:]
predictor.fit(X_train_augmented, y_train).score(X_test_augmented, y_test)
Out[ ]:
0.617276332720553

With the fitted pipeline, we can now use permutation_importance to calculate feature importances:

In [ ]:
from sklearn.inspection import permutation_importance

feature_importances = permutation_importance(
    predictor, X_train_augmented, y_train, n_repeats=5
)

Here, we plot the permutation importance using the training set. The higher the value, more important the feature is.

In [ ]:
sorted_idx = feature_importances.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(feature_importances.importances[sorted_idx].T,
           vert=False, labels=X_train_augmented.columns[sorted_idx])
ax.set_title("Permutation Importances (train set)")
fig.tight_layout()
plt.show()

We can replicate the same processing on the test set and see if we can observe the same trend.

In [ ]:
from sklearn.inspection import permutation_importance

feature_importances = permutation_importance(
    predictor, X_test_augmented, y_test, n_repeats=10
)
In [ ]:
sorted_idx = feature_importances.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(feature_importances.importances[sorted_idx].T,
           vert=False, labels=X_test_augmented.columns[sorted_idx])
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()

With the current version of scikit-learn, it is not handy but still possible to check the feature importances at the latest stage of the pipeline (once all features have been preprocessed).

The difficult part is to get the name of the features.

In [ ]:
preprocessor = pipeline[:-1]
predictor = pipeline[-1]

X_train_augmented = preprocessor.transform(X_train)
X_test_augmented = preprocessor.transform(X_test)

Let's find out the feature names (in the future, scikit-learn will provide a get_feature_names function to handle this case).

In [ ]:
categorical_cols_name = categorical_cols
passthrough_cols_name = (
    pipeline[:2].transform(X_train[:1])  # Take only one sample to go fast
    .columns[pipeline[2].transformers_[-1][-1]]
    .tolist()
)
feature_names = np.array(
    categorical_cols_name + passthrough_cols_name
)
feature_names
Out[ ]:
array(['Arrival', 'Departure', 'WeeksToDeparture', 'std_wtd',
       'Mean TemperatureC', 'year', 'month', 'day', 'weekday', 'week',
       'n_days', 'hol_days'], dtype='<U17')

We can repeat the previous processing at this finer grain, where the transformed date columns are included.

In [ ]:
from sklearn.inspection import permutation_importance

feature_importances = permutation_importance(
    predictor, X_train_augmented, y_train, n_repeats=5
)

Here, we plot the permutation importance using the training set. Basically, higher the value, more important is the feature.

In [ ]:
sorted_idx = feature_importances.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(feature_importances.importances[sorted_idx].T,
           vert=False, labels=feature_names[sorted_idx])
ax.set_title("Permutation Importances (train set)")
fig.tight_layout()
plt.show()

We can replicate the same processing on the test set and see if we can observe the same trend.

In [ ]:
from sklearn.inspection import permutation_importance

feature_importances = permutation_importance(
    predictor, X_test_augmented, y_test, n_repeats=10
)
In [ ]:
sorted_idx = feature_importances.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(feature_importances.importances[sorted_idx].T,
           vert=False, labels=feature_names[sorted_idx])
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()

Submission

To submit your code, you can refer to the online documentation.

Open In Colab

A Basic Example

In [ ]:
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

iris = datasets.load_iris()
X, y = iris.data[:, :2], iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)

scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
accuracy_score(y_test, y_pred)

Loading The Data

Your data needs to be numeric and stored as NumPy arrays or SciPy sparse matrices. Other types that are convertible to numeric arrays, such as Pandas DataFrame, are also acceptable.

In [ ]:
from sklearn import datasets

iris = datasets.load_iris()
# print(iris)
X, y = iris.data, iris.target

Split Data Training And Test Data : sklearn.model_selection

In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)

Preprocessing The Data : sklearn.preprocessing

Standardization

Standardize features by removing the mean and scaling to unit variance z = (x - u) / s

In [ ]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)
standardized_X[:3]

Normalization

In [ ]:
from sklearn.preprocessing import Normalizer

scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)

Binarization

In [ ]:
from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0.0).fit(X)
binary_X = binarizer.transform(X)

Encoding Categorical Features

In [ ]:
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
y = enc.fit_transform(y)
y

Imputing Missing Values

In [ ]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=0, strategy='mean')
imp.fit_transform(X_train)

Generating Polynomial Features

In [ ]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(5)
poly.fit_transform(X)

Create Your Model


Supervised Learning Estimators

Linear Regression

In [ ]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression(normalize=True)

Ridge regression

In [ ]:
from sklearn import linear_model

reg = linear_model.Ridge(alpha=.5)
reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])

print(reg.coef_)
print(reg.intercept_)

Support Vector Machines (SVM)

In [ ]:
from sklearn.svm import SVC

svc = SVC(kernel='linear')

Naive Bayes

In [ ]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

KNN

In [ ]:
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

Unsupervised Learning Estimators

Principal Component Analysis (PCA)

In [ ]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)

K Means

In [ ]:
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)

Model Fitting


Supervised learning

In [ ]:
lr.fit(X, y)
knn.fit(X_train, y_train)
svc.fit(X_train, y_train)

Unsupervised Learning

In [ ]:
k_means.fit(X_train)
pca_model = pca.fit_transform(X_train)

Prediction


Supervised Estimators

In [ ]:
y_pred = svc.predict(np.random.random((2,4)))
y_pred = lr.predict(X_test)
y_pred = knn.predict_proba(X_test)

Unsupervised Estimators

In [ ]:
y_pred = k_means.predict(X_test)

Evaluate Your Model's Performance


Classification Metrics

Accuracy Score

In [ ]:
knn.score(X_test, y_test)

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

Classification Report

In [ ]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Confusion Matrix

In [ ]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

Regression Metrics

Mean Absolute Error

In [ ]:
from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2]
mean_absolute_error(y_true, y_pred)

Mean Squared Error

In [ ]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

R2 Score

In [ ]:
from sklearn.metrics import r2_score
r2_score(y_true, y_pred)

Clustering Metrics

Adjusted Rand Index

In [ ]:
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_true, y_pred)

Homogeneity

In [ ]:
from sklearn.metrics import homogeneity_score
homogeneity_score(y_true, y_pred)

V-measure

In [ ]:
from sklearn.metrics import v_measure_score
metrics.v_measure_score(y_true, y_pred)

Cross-Validation

In [ ]:
print(cross_val_score(knn, X_train, y_train, cv=4)
print(cross_val_score(lr, X, y, cv=2)

Tune Your Model


In [ ]:
from sklearn.grid_search import GridSearchCV
params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
grid = GridSearchCV(estimator=knn,param_grid=params)
grid.fit(X_train, y_train)
print(grid.best_score_)
print(grid.best_estimator_.n_neighbors)

Randomized Parameter Optimization

In [ ]:
from sklearn.grid_search import RandomizedSearchCV
params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
rsearch = RandomizedSearchCV(
            estimator=knn,
            param_distributions=params,
            cv=4,
            n_iter=8,
            random_state=5)

rsearch.fit(X_train, y_train)
print(rsearch.best_score_)