Python replace rarely occuring values in a pipeline

Python pipeline series

Pipeline are probably one of the most convenient tools in scikit learn, and propose a simple way to write reusable models, for which all the hyperparameters, both of the learning and preprocessing part are in the exact same place. However, I do not see them that often on code snippets or in data science competitions.

The problem

A good practice, when working with factors or categories in a dataframe is to replace values that appear a limited number of time.

A simple reason to do so is that a category appearing just once will be hard to generalize. Decision tree based methods will probably ignore it (as long as the min_sample_size is larger than the number of occurences of this value), so why bother keeping such variables.

A simple solution

As stressed on this stackoverflow answer, a simple one-liner does the job:

df.loc[df[col].value_counts()[df[col]].values < 10, col] = "RARE_VALUE"

One liners are good. Easy to copy paste. Also easy to make mistakes with them.

Imagine, you are working with a messy dataset, figure out that it would be nice to have a function that takes care of the cleaning.

Copy pasting the above, you end up writing:

def clean_variables(data):
    columns = ['Gender', 'Car_Category', 'Subject_Car_Colour',
               'Subject_Car_Make', 'LGA_Name', 'State']

    for column in columns:
        data[column].fillna("empty", inplace=True)
        data.loc[data[column].value_counts()[data[column]].values < 10, column] = "RARE_VALUE"


    data["Age"] = data["Age"].apply(clip_age)

    [...] # other stuff you may do

    return data

The issue

And then you forget about it, some day comes a test set and you blindly apply the clean_variables function on it. That’s what functions are for after all, reusing !

So you write:

train = clean_variables(train)
test = clean_variables(test)

And who knows what may happen from there on. If the test set is too small (less than 10 rows), all the factors will be turned into “RARE_VALUE”. Depending on the importance given to these features by the learning algorithms you applied later, the performance on the test set could be good, or very bad.

A better solution

Instead, I would recommend putting all this in a pipeline. As far as I know,there is no simple class in scikit-learn that enable to do the removing, so I ended up writing the following class, which does the job:

class RemoveScarceValuesFeatureEngineer:

    def __init__(self, min_occurences):
        self._min_occurences = min_occurences
        self._column_value_counts = {}

    def fit(self, X, y):
        for column in X.columns:
            self._column_value_counts[column] = X[column].value_counts()
        return self

    def transform(self, X):
        for column in X.columns:
            X.loc[self._column_value_counts[column][X[column]].values
                  < self._min_occurences, column] = "RARE_VALUE"

        return X

    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)


if __name__ == "__main__":
    import pandas as pd

    sample_train = pd.DataFrame(
        [{"a": 1, "s": "a"}, {"a": 1, "s": "a"}, {"a": 1, "s": "b"}])
    rssfe = RemoveScarceValuesFeatureEngineer(2)
    print(sample_train)
    print(rssfe.fit_transform(sample_train, None))
    print(20*"=")

    sample_test = pd.DataFrame([{"a": 1, "s": "a"}, {"a": 1, "s": "b"}])
    print(sample_test)
    print(rssfe.transform(sample_test))

And executing the code:

   a  s
0  1  a
1  1  a
2  1  b
   a           s
0  1           a
1  1           a
2  1  RARE_VALUE
====================
   a  s
0  1  a
1  1  b
   a           s
0  1           a
1  1  RARE_VALUE

you have the desired behavior: a is not replaced away with RARE_VALUE in the test set!

Python replace rarely occuring values in a pipeline

March 28, 2021

Python pipeline series

The problem

A simple solution

The issue

A better solution

OCaml List rev_map vs map

How to optimize PyTorch code ?

Acronyms of deep learning