Introduction
Most of the regression problems I dealt with focused on minimizing the L2 norm of the difference between the predictions of a model and the true value. However, from a business point of view, it may make more sense to minimize directly the L1 norm.
Indeed, using a least square approach will mainly penalize large errors. But if the cost of an error is constant (once again, from the business perspective), then, the L1 penalty makes sense as well.
However, note that the estimator may be vastly different, if we want to use a constant model, the values of the intercept differ, one being the mean, the other one, the median.
More formally:
\[\arg\min_x \sum_{j=1}^n (x_j - x)^2 = \bar{x},\] \[\arg\min_x \sum_{j=1}^n |x_j - x| = \mathrm{med}(x_1, \dots, x_j),\]So far, so good. However, things become complex quite quickly: a theoretical advantage of L2 penalty is that it makes the penalty differientiable, and we enjoy many closed formulas for various problems relying on L2 penatlies.
\[\hat{\beta} = \underset{\beta} {\arg \min}\,L\left(D, \beta\right) = \underset{\beta}\arg \min \sum_{i=1}^{n} \left(\beta \cdot x_i - y_i\right)^2\] \[\hat{\beta} = \left(X^\textsf{T}X\right)^{-1}X^\textsf{T}Y\]Algorithm
I will focus on implementing the univariate case, without intercept. Including intercept or multivariate case relies on a (much) more complex optimization.
\[\hat{b} = \arg\min_x \sum_{j=1}^n |y_j - b \cdot x_j|\]Calling \(L(b) = \sum_{j=1}^n |y_j - b \cdot x_j|\) we note that \(L\) is a convex function, whose derivative is not continuous.
It derivative, where it is defined, is:
\[l(b) = - \sum_{j=1}^n x_j \cdot \mathbb{sgn} (y_j - b \cdot x_j)\]Given that \(L\) is convex, \(l\) must be monotonic. Besides \(l(-\infty) = - \sum_{j=1}^n |x_j| < 0\) and \(l(+\infty) = \sum_{j=1}^n |x_j| > 0\).
Therefore, we will look for a zero of \(l\) using dichotomy.
Implementation
As detailed above, one we have the penalty and the dichotomy algorithm implemented, there is nothing else to do:
import numpy as np
def dichotomy(n, f, l, r):
c = (l + r) / 2
if n ==0:
return c
else:
if f(c) < 0 :
return dichotomy(n-1, f, c, r)
else:
return dichotomy(n-1, f, l, c)
def penalty(x, y, b):
return -np.sum(x * np.sign(y - b * x))
class L1Regressor:
def __init__(self, n_iterations=20):
self.n_iterations = n_iterations
self.b = None
def fit(self, x, y):
ratios = y / x
l, r = np.min(ratios), np.max(ratios)
self.b = dichotomy(self.n_iterations, lambda b: penalty(x, y, b), l, r)
return self
Tests
If we append the following:
if __name__ == "__main__":
import matplotlib.pyplot as plt
x = np.random.rand(100)
y = x * 0.5 + 0.1 * np.random.normal(size=100)
slope = L1Regressor().fit(x, y).b
plt.scatter(x, y)
plt.plot(x, x*slope, c = 'r')
plt.savefig("./illustration.svg")
We obtain:
Comparison with L2
We can add some noise to the observations and we expect to have a more robust regression with L1 penalty.
Below are plotted the two slopes: in red, the L1 penalty is minimized, in green, the L2 penalty.
from sklearn.linear_model import LinearRegression
x = np.random.rand(100)
y = x * 0.5 + 0.1 * np.random.normal(size=100)
x[:10] = 0.9 + 0.1 * np.random.rand(10)
y[:10] = 2 + x[:10] * 0.5 + np.random.normal(size=10)
slope = L1Regressor().fit(x, y).b
slopel2 = LinearRegression(fit_intercept=False).fit(x.reshape(-1,1), y).coef_[0]
plt.scatter(x, y)
plt.plot(x, x*slope, c = 'r')
plt.plot(x, x*slopel2, c = 'g')
plt.savefig("./illustration_noise.svg")