Feature selection technique for regression case

Le machine learning est devenu un outil indispensable depuis quelques années. IL est utilisé dans de nombreux domaines aussi bien dans les industries, les laboratoires de recherche que les entreprises. Comparativement aux techniques deep learning qui est considéré comme une boîte noire, les méthodes de machine learning laisse apparaître une certaine logique physique dans l’interprétation des résultats. En effet, une bonne prédiction avec des algorithmes de machine learning requière un certains nombre étapes logique permettant une réalisation de cause-effet et pouvant permettre une meilleur interprétation. Parmi les étapes, figure une qui est la clé de son interpretabilité : Feature selection. Plusieurs techniques existent afin de trouver les features les plus pertinents pour le phénomène étudie. Dans cet article, nous abordons quelques de ces techniques.

A l’issue de cet tutoriel, vous serez a mesure de:

Programmer la sélection de features avec scikit-learn ,
Comprendre pourquoi il ne faut jamais se fier à une seule technique,
Sélectionner les meilleurs features par combinaison de méthodes

Set Directory

import os
import pandas as pd
from numpy import *
import numpy as np
import timeit as tm
os.chdir("D:/Cours_ESI/Evaluation")

from sklearn.linear_model import (LinearRegression, Ridge, Lasso)
from sklearn.feature_selection import RFE, f_regression
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor

Problématique

Les données de ce tutoriel sont issues des antennes météo en France. Il s’agira de prédire le taux de grippe en France par région et par semaine. Nous avons au total 19 features et 11484 observations.

Data pre-processing

train=pd.read_csv('_data/train_set.csv',delimiter=',',decimal=',',low_memory=False)
train.drop(["Unnamed: 0"],axis=1,inplace=True)
train=train.astype({"ff":"float64","t":"float64","u":"float64","n":"float64","pression":"float64","precipitation":"float64",
                   "[0-19 ans]":"float64","[20-39 ans]":"float64","[40-59 ans]":"float64","[60-74 ans]":"float64","[75 ans plus]":"float64",
                    "Prop H":"float64","Prop F":"float64"})
train=train.rename(columns={"[0-19 ans]":"0_19ans","[20-39 ans]":"20_39ans","[40-59 ans]":"40_59ans","[60-74 ans]":"60_74ans","[75 ans plus]":"75etplus","Prop H":"Prop_h","Prop F":"Prop_f"})
train=train.loc[:,["week","region_code","ff","t","u","n","pression","precipitation","Year","0_19ans","20_39ans","40_59ans","60_74ans",
        "75etplus","Prop_h","Prop_f","reqgoo1","reqgoo2","reqgoo3","TauxGrippe"]]

train.shape

(11484, 20)

train.isnull().any()

week             False
region_code      False
ff               False
t                False
u                False
n                False
pression         False
precipitation    False
Year             False
0_19ans          False
20_39ans         False
40_59ans         False
60_74ans         False
75etplus         False
Prop_h           False
Prop_f           False
reqgoo1          False
reqgoo2          False
reqgoo3          False
TauxGrippe       False
dtype: bool

sns.pairplot(train)

<seaborn.axisgrid.PairGrid at 0x1c15eb53a48>

Definir features and target

delete=["TauxGrippe"]
features= train.drop(delete,axis=1)
target=train.TauxGrippe

Definir features selection méthode

p power score

import ppscore as pps
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt 
matrix_df = pps.matrix(train)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
plt.figure(figsize=(16,12))
sns.heatmap(matrix_df, cmap=pyplot.get_cmap("coolwarm"), annot=True,fmt='.2f')

<matplotlib.axes._subplots.AxesSubplot at 0x1c17b06bb48>

Vous pouvez constatez qu’avec cette technique, un (1) seul feature semble pertinent pour le modèle ie ‘week’. Toutefois, il n’est pas évident que lui seul permet de bien prédire le taux de grippe en France d’autant plus qu’il est difficile de faire une relation de cause-effet avec le taux de grippe. Toutefois, nous sommes en alerte et pouvons dire que le taux de grippe varie en fonction des semaines.

Correlation coefficient

from matplotlib import pyplot
train.corr(method='kendall').style.format("{:.2}").background_gradient(cmap=pyplot.get_cmap('coolwarm'))

	week	region_code	ff	t	u	n	pression	precipitation	Year	0_19ans	20_39ans	40_59ans	60_74ans	75etplus	Prop_h	Prop_f	reqgoo1	reqgoo2	reqgoo3	TauxGrippe
week	1.0	0.0	-0.069	0.014	-0.027	0.092	-0.017	0.024	0.95	-0.12	-0.32	-0.22	0.32	0.2	-0.021	0.021	-0.11	-0.0019	-0.042	0.028
region_code	0.0	1.0	-0.054	0.1	-0.22	-0.22	-0.1	-0.037	0.0	-0.49	-0.28	0.083	0.46	0.39	-0.06	0.06	0.0087	-0.14	-0.14	0.07
ff	-0.069	-0.054	1.0	-0.11	0.042	0.095	0.076	0.14	-0.066	0.14	0.0088	-0.18	-0.048	-0.053	-0.11	0.11	0.036	0.067	0.077	0.068
t	0.014	0.1	-0.11	1.0	-0.37	-0.27	0.046	-0.022	-0.0012	-0.058	-0.034	-0.013	0.065	0.042	-0.0074	0.0074	-0.23	-0.25	-0.28	-0.41
u	-0.027	-0.22	0.042	-0.37	1.0	0.47	0.033	0.24	-0.038	0.13	0.056	0.056	-0.14	-0.074	0.066	-0.066	0.13	0.14	0.16	0.19
n	0.092	-0.22	0.095	-0.27	0.47	1.0	-0.16	0.33	0.097	0.061	-0.0056	0.03	-0.051	-0.032	0.045	-0.045	0.047	0.11	0.11	0.11
pression	-0.017	-0.1	0.076	0.046	0.033	-0.16	1.0	-0.18	-0.018	0.22	0.071	-0.19	-0.096	-0.13	0.017	-0.017	-0.031	-0.094	-0.097	-0.011
precipitation	0.024	-0.037	0.14	-0.022	0.24	0.33	-0.18	1.0	0.021	0.017	0.0017	0.0079	-0.022	0.0017	0.032	-0.032	-0.0071	-0.004	-0.007	-0.016
Year	0.95	0.0	-0.066	-0.0012	-0.038	0.097	-0.018	0.021	1.0	-0.13	-0.34	-0.23	0.34	0.21	-0.022	0.022	-0.11	0.0073	-0.035	0.044
0_19ans	-0.12	-0.49	0.14	-0.058	0.13	0.061	0.22	0.017	-0.13	1.0	0.61	-0.38	-0.74	-0.75	0.17	-0.17	0.00094	0.13	0.14	-0.036
20_39ans	-0.32	-0.28	0.0088	-0.034	0.056	-0.0056	0.071	0.0017	-0.34	0.61	1.0	-0.086	-0.82	-0.83	0.26	-0.26	0.026	0.036	0.057	-0.013
40_59ans	-0.22	0.083	-0.18	-0.013	0.056	0.03	-0.19	0.0079	-0.23	-0.38	-0.086	1.0	0.088	0.19	0.12	-0.12	0.086	-0.09	-0.086	-0.025
60_74ans	0.32	0.46	-0.048	0.065	-0.14	-0.051	-0.096	-0.022	0.34	-0.74	-0.82	0.088	1.0	0.79	-0.2	0.2	-0.02	-0.096	-0.12	0.038
75etplus	0.2	0.39	-0.053	0.042	-0.074	-0.032	-0.13	0.0017	0.21	-0.75	-0.83	0.19	0.79	1.0	-0.28	0.28	-0.0011	-0.048	-0.064	0.021
Prop_h	-0.021	-0.06	-0.11	-0.0074	0.066	0.045	0.017	0.032	-0.022	0.17	0.26	0.12	-0.2	-0.28	1.0	-1.0	-0.027	-0.086	-0.094	0.007
Prop_f	0.021	0.06	0.11	0.0074	-0.066	-0.045	-0.017	-0.032	0.022	-0.17	-0.26	-0.12	0.2	0.28	-1.0	1.0	0.027	0.086	0.094	-0.007
reqgoo1	-0.11	0.0087	0.036	-0.23	0.13	0.047	-0.031	-0.0071	-0.11	0.00094	0.026	0.086	-0.02	-0.0011	-0.027	0.027	1.0	0.66	0.57	0.32
reqgoo2	-0.0019	-0.14	0.067	-0.25	0.14	0.11	-0.094	-0.004	0.0073	0.13	0.036	-0.09	-0.096	-0.048	-0.086	0.086	0.66	1.0	0.89	0.3
reqgoo3	-0.042	-0.14	0.077	-0.28	0.16	0.11	-0.097	-0.007	-0.035	0.14	0.057	-0.086	-0.12	-0.064	-0.094	0.094	0.57	0.89	1.0	0.32
TauxGrippe	0.028	0.07	0.068	-0.41	0.19	0.11	-0.011	-0.016	0.044	-0.036	-0.013	-0.025	0.038	0.021	0.007	-0.007	0.32	0.3	0.32	1.0

Contrairement à la technique precedente, ici nous avons trois nouveaux features qui semblent pertinents pour le modèle. Ces features sont completement differents avec celui detecté par la première méthode. Ainsi, il devient assez difficile de prendre une decision sachant que l’intersection des deux resultats est nulle.

technique ensembliste

ranks = {}
def ranking(ranks, names, order=1):
    ranks = MinMaxScaler().fit_transform(order*np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x,2), ranks)
    return dict(zip(names, ranks))
colnames = features.columns

#-- Construct our decision Tree 
from sklearn.feature_selection import RFE, f_regression
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor()
Dtree = RFE(tree, n_features_to_select=1, verbose =3 )
Dtree.fit(features,target)
ranks["Dtree"] = ranking(list(map(float, Dtree.ranking_)), colnames, order=-1)

Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.

#-- Construct our ExtraTreesClassifier
from sklearn.ensemble import ExtraTreesRegressor
Xtree = ExtraTreesRegressor()
EXtree = RFE(Xtree, n_features_to_select=1, verbose =3 )
EXtree.fit(features,target)
ranks["EXtree"] = ranking(list(map(float, EXtree.ranking_)), colnames, order=-1)

Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.

#construct our RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
RF = RandomForestRegressor()
RandF = RFE(RF, n_features_to_select=1, verbose =3 )
RandF.fit(features,target)
ranks["RandF"] = ranking(list(map(float, RandF.ranking_)), colnames, order=-1)

Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.

#construct our RandomForestClassifier
from sklearn.ensemble import AdaBoostRegressor
Adb = AdaBoostRegressor()
AdaBoost = RFE(Adb, n_features_to_select=1, verbose =3 )
AdaBoost.fit(features,target)
ranks["AdaBoost"] = ranking(list(map(float, AdaBoost.ranking_)), colnames, order=-1)

Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.

#construct our GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor
GBT = GradientBoostingRegressor(n_estimators=100, learning_rate=1.0,max_depth=1, random_state=0)
GradBoost = RFE(GBT, n_features_to_select=1, verbose =3 )
GradBoost.fit(features,target)
ranks["GradBoost"] = ranking(list(map(float, GradBoost.ranking_)), colnames, order=-1)

Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.

# Construct our Linear Regression model
lr = LinearRegression(normalize=True)
lr.fit(features,target)

#stop the search when only the last feature is left
LinReg = RFE(lr, n_features_to_select=1, verbose =3 )
LinReg.fit(features,target)
ranks["LinReg"] = ranking(list(map(float, LinReg.ranking_)), colnames, order=-1)

Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.

# Using Ridge 
ridge = Ridge(alpha = 7)
ridge.fit(features,target)
ranks['Ridge'] = ranking(np.abs(ridge.coef_), colnames)

# Using Lasso
lasso = Lasso(max_iter=100000,alpha=.05)
lasso.fit(features,target)
ranks["Lasso"] = ranking(np.abs(lasso.coef_), colnames)

xgb = XGBRegressor()
xgb.fit(features,target)
ranks["Xgbt"] = ranking(xgb.feature_importances_, colnames)

# Create empty dictionary to store the mean value calculated from all the scores
r = {}
for name in colnames:
    r[name] = round(np.mean([ranks[method][name] 
                             for method in ranks.keys()]), 2)
methods = sorted(ranks.keys())
ranks["Mean"] = r
methods.append("Mean")

Ridge=[ranks['Ridge'][name] for name in colnames]
Ridge=pd.DataFrame(Ridge,columns=['Ridge'])

LinReg=[ranks['LinReg'][name] for name in colnames]
LinReg=pd.DataFrame(LinReg,columns=['LinReg'])

Xgbt=[ranks['Xgbt'][name] for name in colnames]
Xgbt=pd.DataFrame(Xgbt,columns=['Xgbt'])

Dtree=[ranks['Dtree'][name] for name in colnames]
Dtree=pd.DataFrame(Dtree,columns=['Dtree'])

EXtree=[ranks['EXtree'][name] for name in colnames]
EXtree=pd.DataFrame(EXtree,columns=['EXtree'])

RandF=[ranks['RandF'][name] for name in colnames]
RandF=pd.DataFrame(RandF,columns=['RandF'])

AdaBoost=[ranks['AdaBoost'][name] for name in colnames]
AdaBoost=pd.DataFrame(AdaBoost,columns=['AdaBoost'])

GradBoost=[ranks['GradBoost'][name] for name in colnames]
GradBoost=pd.DataFrame(GradBoost,columns=['GradBoost'])

Mean=[ranks['Mean'][name] for name in colnames]
Mean=pd.DataFrame(Mean,columns=['Mean'])


cols=pd.DataFrame(colnames,columns=['Features'])
ranking_score=pd.concat([cols,Ridge,LinReg,Xgbt,Dtree,EXtree,RandF,AdaBoost,GradBoost,Mean],axis=1)

ranking_score.sort_values(by="Mean",ascending=False,inplace=True)

ranking_score

	Features	Ridge	LinReg	Xgbt	Dtree	EXtree	RandF	AdaBoost	GradBoost	Mean
18	reqgoo3	0.19	0.61	1.00	0.89	0.89	0.89	0.94	0.94	0.73
3	t	0.03	0.56	0.12	0.94	0.94	0.94	1.00	0.89	0.61
0	week	0.01	0.33	0.15	1.00	1.00	1.00	0.89	1.00	0.60
10	20_39ans	0.37	0.94	0.04	0.39	0.33	0.39	0.56	0.56	0.50
16	reqgoo1	0.03	0.50	0.09	0.50	0.83	0.44	0.72	0.83	0.44
1	region_code	0.00	0.17	0.07	0.83	0.61	0.78	0.67	0.78	0.43
9	0_19ans	0.01	0.83	0.16	0.61	0.78	0.56	0.83	0.00	0.42
6	pression	0.00	0.00	0.04	0.72	0.72	0.83	0.61	0.72	0.40
8	Year	1.00	0.39	0.00	0.11	0.67	0.22	0.11	0.06	0.40
2	ff	0.02	0.44	0.02	0.44	0.39	0.50	0.78	0.50	0.34
4	u	0.01	0.22	0.02	0.78	0.44	0.72	0.06	0.67	0.33
13	75etplus	0.22	1.00	0.08	0.22	0.17	0.28	0.50	0.28	0.31
5	n	0.00	0.06	0.02	0.56	0.28	0.61	0.39	0.61	0.28
7	precipitation	0.00	0.28	0.03	0.67	0.50	0.67	0.28	0.11	0.28
12	60_74ans	0.29	0.89	0.03	0.17	0.22	0.33	0.00	0.22	0.24
11	40_59ans	0.06	0.78	0.02	0.28	0.11	0.11	0.44	0.17	0.22
17	reqgoo2	0.00	0.11	0.08	0.33	0.56	0.17	0.22	0.44	0.21
15	Prop_f	0.04	0.72	0.00	0.06	0.06	0.06	0.17	0.39	0.17
14	Prop_h	0.04	0.67	0.04	0.00	0.00	0.00	0.33	0.33	0.16

list(ranking_score[ranking_score['Mean']&gt;0.5].Features)

['reqgoo3', 't', 'week']

# Put the mean scores into a Pandas dataframe
meanplot = pd.DataFrame(list(r.items()), columns= ['Feature','Mean Ranking'])

# Sort the dataframe
meanplot = meanplot.sort_values('Mean Ranking', ascending=False)

# Let's plot the ranking of the features
import warnings
warnings.filterwarnings("ignore")
plot=sns.factorplot(x="Mean Ranking", y="Feature", data = meanplot, kind="bar", 
               size=14, aspect=1.9, palette='coolwarm')

Cette dernière technique est en effet une combinaison de plusieurs méthodes permet de donner en moyenne les résulats et de tirer profit d’un grand nombre de méthodes. Elle reste heuristique dans le sens ou vous devez choisir une valeur de seuil mais elle reste quand bien même plus sûr afin d’éviter de supprimer des features pertinents ou d’inclure des features moins importants.