Introduction

Zindi is hosting the Fossil Demand Forecasting Challenge, where competitors have to predict the amount of units sold for various products.

Note that the rules state that the metric to optimize is not is usual squared error, but instead, the absolute error:

The evaluation metric for this challenge is Mean Absolute Error.

All the models relying on the minimization of least squares (usual regressions, random forests with default parameters) are likely to perform poorly since they will return the mean over subsambles, while minimizing the absolute error returns the mean of the sample.

In a mathematical language:

\arg min_{x} \sum_{j = 1}^{n} (x_{j} - x)^{2} = \bar{x},

$\arg\min_x \sum_{j=1}^n (x_j - x)^2 = \bar{x},$

\arg min_{x} \sum_{j = 1}^{n} | x_{j} - x | = m e d (x_{1}, \dots, x_{j})

$\arg\min_x \sum_{j=1}^n |x_j - x| = \mathrm{med}(x_1, \dots, x_j)$

A simple benchmark

With that knowledge, the benchmark below simply returns, for each product, the median of units sold over the year 2021. The score should be around 192xxx

import numpy as np
import pandas as pd
import random
random.seed(0)
np.random.seed(0)


train = pd.read_csv("../raw_data/Train.csv")
sku_names = train["sku_name"].unique()
train["year_month"] = train["year"].astype(
    str) + "/" + train["month"].astype(str)
train["date"] = pd.to_datetime(train["year_month"])
train_recent = train[train["date"] >= "2021/01"]


medians = train_recent.groupby("sku_name")["sellin"].median().to_dict()

test = pd.read_csv("../raw_data/Test.csv")
sku_names_test = test["sku_name"].unique()

missing = {}
for sku_name_test in sku_names_test:
    missing[sku_name_test] = 0


test["Target"] = test["sku_name"].replace(medians).replace(missing).astype(int)

test["Item_ID"] = test["sku_name"] + "_" + \
    test["month"].astype(str) + "_" + test["year"].astype(str)
test[["Item_ID", "Target"]].to_csv("./submission_.csv", index=False)

Benchmark Fossil Demand Forecasting Challenge

July 23, 2022

Introduction

A simple benchmark

OCaml List rev_map vs map

How to optimize PyTorch code ?

Acronyms of deep learning