Introduction
Zindi is hosting the Fossil Demand Forecasting Challenge, where competitors have to predict the amount of units sold for various products.
Note that the rules state that the metric to optimize is not is usual squared error, but instead, the absolute error:
The evaluation metric for this challenge is Mean Absolute Error.
All the models relying on the minimization of least squares (usual regressions, random forests with default parameters) are likely to perform poorly since they will return the mean over subsambles, while minimizing the absolute error returns the mean of the sample.
In a mathematical language:
\[\arg\min_x \sum_{j=1}^n (x_j - x)^2 = \bar{x},\] \[\arg\min_x \sum_{j=1}^n |x_j - x| = \mathrm{med}(x_1, \dots, x_j)\]A simple benchmark
With that knowledge, the benchmark below simply returns, for each product, the median of units sold over the year 2021. The score should be around 192xxx
import numpy as np
import pandas as pd
import random
random.seed(0)
np.random.seed(0)
train = pd.read_csv("../raw_data/Train.csv")
sku_names = train["sku_name"].unique()
train["year_month"] = train["year"].astype(
str) + "/" + train["month"].astype(str)
train["date"] = pd.to_datetime(train["year_month"])
train_recent = train[train["date"] >= "2021/01"]
medians = train_recent.groupby("sku_name")["sellin"].median().to_dict()
test = pd.read_csv("../raw_data/Test.csv")
sku_names_test = test["sku_name"].unique()
missing = {}
for sku_name_test in sku_names_test:
missing[sku_name_test] = 0
test["Target"] = test["sku_name"].replace(medians).replace(missing).astype(int)
test["Item_ID"] = test["sku_name"] + "_" + \
test["month"].astype(str) + "_" + test["year"].astype(str)
test[["Item_ID", "Target"]].to_csv("./submission_.csv", index=False)