Python pipeline series
Pipeline are probably one of the most convenient tools in scikit learn, and propose a simple way to write reusable models, for which all the hyperparameters, both of the learning and preprocessing part are in the exact same place. However, I do not see them that often on code snippets or in data science competitions.
The problem
A good practice, when working with factors or categories in a dataframe is to replace values that appear a limited number of time.
A simple reason to do so is that a category appearing just once will be hard to generalize. Decision tree based methods will probably ignore it (as long as the min_sample_size
is larger than the number of occurences of this value), so why bother keeping such variables.
A simple solution
As stressed on this stackoverflow answer, a simple one-liner does the job:
df.loc[df[col].value_counts()[df[col]].values < 10, col] = "RARE_VALUE"
One liners are good. Easy to copy paste. Also easy to make mistakes with them.
Imagine, you are working with a messy dataset, figure out that it would be nice to have a function that takes care of the cleaning.
Copy pasting the above, you end up writing:
def clean_variables(data):
columns = ['Gender', 'Car_Category', 'Subject_Car_Colour',
'Subject_Car_Make', 'LGA_Name', 'State']
for column in columns:
data[column].fillna("empty", inplace=True)
data.loc[data[column].value_counts()[data[column]].values < 10, column] = "RARE_VALUE"
data["Age"] = data["Age"].apply(clip_age)
[...] # other stuff you may do
return data
The issue
And then you forget about it, some day comes a test set and you blindly apply the clean_variables function on it. That’s what functions are for after all, reusing !
So you write:
train = clean_variables(train)
test = clean_variables(test)
And who knows what may happen from there on. If the test set is too small (less than 10 rows), all the factors will be turned into “RARE_VALUE”. Depending on the importance given to these features by the learning algorithms you applied later, the performance on the test set could be good, or very bad.
A better solution
Instead, I would recommend putting all this in a pipeline. As far as I know,there is no simple class in scikit-learn that enable to do the removing, so I ended up writing the following class, which does the job:
class RemoveScarceValuesFeatureEngineer:
def __init__(self, min_occurences):
self._min_occurences = min_occurences
self._column_value_counts = {}
def fit(self, X, y):
for column in X.columns:
self._column_value_counts[column] = X[column].value_counts()
return self
def transform(self, X):
for column in X.columns:
X.loc[self._column_value_counts[column][X[column]].values
< self._min_occurences, column] = "RARE_VALUE"
return X
def fit_transform(self, X, y):
self.fit(X, y)
return self.transform(X)
if __name__ == "__main__":
import pandas as pd
sample_train = pd.DataFrame(
[{"a": 1, "s": "a"}, {"a": 1, "s": "a"}, {"a": 1, "s": "b"}])
rssfe = RemoveScarceValuesFeatureEngineer(2)
print(sample_train)
print(rssfe.fit_transform(sample_train, None))
print(20*"=")
sample_test = pd.DataFrame([{"a": 1, "s": "a"}, {"a": 1, "s": "b"}])
print(sample_test)
print(rssfe.transform(sample_test))
And executing the code:
a s
0 1 a
1 1 a
2 1 b
a s
0 1 a
1 1 a
2 1 RARE_VALUE
====================
a s
0 1 a
1 1 b
a s
0 1 a
1 1 RARE_VALUE
you have the desired behavior: a
is not replaced away with RARE_VALUE
in the test set!