What is complexity ? Good question. I should have addressed it right away. It is a notion which is often addressed in algorithmic classes, but not in machine learning classes… Simply put, say you have a model. Training it on \(n\) points takes \(x\) minutes. Now what happens if you train it on \(kn\) points ? If the training time is now \(kx\) then the training time is linear. Sometimes, it is more. The new training time may be \(k^2 x\). In this case, the training time would be called quadratic in the number of points. If you have a long training time for few thousands points, do not expect to be able to run this method on millions of points (kernel SVMs, I miss you in large samples).
It is harder than one would think to evaluate the complexity of a machine learning algorithm, especially as it may be implementation dependent, properties of the data may lead to other algorithms or the training time often depends on some parameters passed to the algorithm. Another caveat is that the learning algorithms are complex and rely on other algorithms.
A theoretical point of view
Some bounds
Here, I propose upper bounds (as the implementation achieving this bound will be described) when the data is dense.
Calling \(n\) the number of training sample, \(p\) the number of features, \(n_{trees}\) the number of trees (for methods based on various trees), \(n_{sv}\), the number of support vectors and \(n_{l_i}\) the number of neurons at layer \(i\) in a neural network, we have the following approximations.
Algorithm | Classification/Regression | Training | Prediction |
---|---|---|---|
Decision Tree | C+R | \(O(n^2p)\) | \(O(p)\) |
Random Forest | C+R | \(O(n^2pn_{trees})\) | \(O(pn_{trees})\) |
Random Forest | R Breiman implementation | \(O(n^2p n_{trees})\) | \(O(pn_{trees})\) |
Random Forest | C Breiman implementation | \(O(n^2 \sqrt p n_{trees})\) | \(O(pn_{trees})\) |
Extremly Random Trees | C+R | \(O(npn_{trees})\) | \(O(npn_{trees})\) |
Gradient Boosting (\(n_{trees}\)) | C+R | \(O(npn_{trees})\) | \(O(pn_{trees})\) |
Linear Regression | R | \(O(p^2n+p^3)\) | \(O(p)\) |
SVM (Kernel) | C+R | \(O(n^2p+n^3)\) | \(O(n_{sv}p)\) |
k-Nearest Neighbours (naive) | C+R | \(-\) | \(O(np)\) |
Nearest centroid | C | \(O(np)\) | \(O(p)\) |
Neural Network | C+R | ? | \(O (pn_{l_1} + n_{l_1} n_{l_2} + ... )\) |
Naive Bayes | C | \(O(np)\) | \(O(p)\) |
Justifications
Decision Tree based models
Obviously, ensemble methods multiply the complexity of the original model by the number of “voters” in the model, and replace the training size by the size of each bag.
When training a decision tree, a split has to be found until a maximum depth \(d\) has been reached.
The strategy for finding this split is to look for each variable (there are \(p\) of them) to the different thresholds (there are up to \(n\) of them) and the information gain that is achieved (evaluation in \(O(n)\))
In the Breiman implementation, and for classification, it is recommanded to use \(\sqrt p\) predictors for each (weak) classifier.
Extremly random trees
The search strategy for the optimal split simply does not take place in the case of ERTs. This makes it much simpler to find a (weaker) split.
However (in my experience), ERTs implementation are not much faster than RFs.
Linear regressions
The problem of finding the vector of weights \(\beta\) in a linear regression boils down to evaluating the following equation: \(\beta=(X'X)^{-1}X'Y\).
The most computationnaly intensive part is to evaluate the product \(X'X\), which is done in \(p^2n\) operations, and then inverting it, which is done in \(p^3\) operations.
Though most implementations prefer to use a gradient descent to solve the system of equations \((X'X)\beta = X'Y\), the complexity remains the same.
Support Vector Machine
For the training part, the classical algorithms require to evaluate the kernel matrix \(K\), the matrix whose general term is \(K(x_i,x_j)\) where \(K\) is the specified kernel.
It is assumed that K can be evaluated with a \(O(p)\) complexity, as it is true for common kernels (Gaussian, polynomials, sigmoid…). This assumption may be wrong for other kernels.
Then, solving the constrained quadratic programm is “morally equivalent to” inverting a square matrix of size \(n\), whose complexity is assumed to be \(O(n^3)\)
k-Nearest Neighbours
In its simplest form, given a new data point \(x\), the kNN algorithm looks for the k closest points to \(x\) in the training data and returns the most common label (or the averaged values of targets for a regression problem).
To achieve this, it is necessary to compare the distance between \(x\) and every point in the data set. This amounts to \(n\) operations. For the common distances (Euclide, Manhattan…) this operation is performed in a \(O(p)\) operations. Not that kernel k Nearest Neighbours have the same complexity (provided the kernels enjoy the same property).
However, many efforts pre-train the kNN, indexing the points with quadtrees, which enable to lower dramatically the number of comparisons to the points in the training set.
Likewise, in the case of a sparse data set, with an inverted indexation of the rows, it is not necessary to compute all the pairwise distances.
The practical point of view
All this is nice, but what about real life examples ? We can focus on sk-learn
implementations below.
The assumptions will be that the complexities take the form of \(O(n^\alpha p^\beta)\) and \(\alpha\) and \(\beta\) will be estimated using randomly generated samples with \(n\) and \(p\) varying. Then, using a log-log regression, the complexities are estimated.
Though this assumption is wrong, it should help to have a better idea of how the algorithms work and it will reveal some implementation details / difference between the default settings of the same algorithm that one may overlook.
The results
Method | \(\alpha\) | \(\beta\) |
---|---|---|
RandomForestRegressor | 1.21 | 0.89 |
ExtraTreesRegressor | 1.03 | 0.88 |
AdaBoostRegressor | 0.71 | 1.01 |
LinearRegression | 0.72 | 1.3 |
SVR | 1.96 | 0.42 |
RandomForestClassifier | 1.09 | 0.38 |
ExtraTreesClassifier | 0.81 | 0.31 |
AdaBoostClassifier | 0.89 | 0.79 |
SVC | 2.05 | 0.52 |
LogisticRegression(solver=liblinear) | 0.9 | 0.88 |
LogisticRegression(solver=sag) | 1.03 | 0.95 |
Surprisingly, some methods are sublinear in \(n\). Perhaps the sample sizes were too small. As expected, the Support Vector show a complexity in \(n\) that does not scale well with the sample size (close to 2).
Another interesting point to note are the complexities in \(p\) for the random forest and extra trees, the component in \(p\) varies according to the fact that we are performing a regression or a classification problem. A short look at the documentation explains it, they have different behaviors for each problem!
For the regression:
max_features : int, float, string or None, optional (default=”auto”)
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
If “auto”, then max_features=n_features.
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
Whereas the classification default behavior is
If “auto”, then max_features=sqrt(n_features).
The code
Fore those who would like to run the code over other algorithms, here is the method I used.
import numpy as np
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
import math
class ComplexityEvaluator:
def __init__(self, nrow_samples, ncol_samples):
self._nrow_samples = nrow_samples
self._ncol_samples = ncol_samples
def _time_samples(self, model, random_data_generator):
rows_list = []
for nrow in self._nrow_samples:
for ncol in self._ncol_samples:
train, labels = random_data_generator(nrow, ncol)
start_time = time.time()
model.fit(train, labels)
elapsed_time = time.time() - start_time
result = {"N": nrow, "P": ncol, "Time": elapsed_time}
rows_list.append(result)
return rows_list
def Run(self, model, random_data_generator):
data = pd.DataFrame(self._time_samples(model, random_data_generator))
print(data)
data = data.applymap(math.log)
linear_model = LinearRegression(fit_intercept=True)
linear_model.fit(data[["N", "P"]], data[["Time"]])
return linear_model.coef_
And, some unit tests (that can just be appended at the bottom of the previous class).
if __name__ == "__main__":
class TestModel:
def __init__(self):
pass
def fit(self, x, y):
time.sleep(x.shape[0] / 1000.)
def random_data_generator(n, p):
return np.random.rand(n, p), np.random.rand(n, 1)
model = TestModel()
complexity_evaluator = ComplexityEvaluator(
[200, 500, 1000, 2000, 3000], [1,5,10])
res = complexity_evaluator.Run(model, random_data_generator)
print(res)
After a small unit test, everything seems consistent.
N P Time
0 200 1 0.200263
1 200 5 0.200421
2 200 10 0.200520
3 500 1 0.500928
4 500 5 0.500711
5 500 10 0.501064
6 1000 1 1.001396
7 1000 5 1.001420
8 1000 10 1.000629
9 2000 1 2.002225
10 2000 5 2.002096
11 2000 10 2.000759
12 3000 1 3.003331
13 3000 5 3.003340
14 3000 10 3.003350
[[ 9.99583664e-01 1.13487892e-05]]
So let’s enjoy the number of algorithms offered by sklearn. The following list may be updated as new algorithms are tested.
import numpy as np
import ComplexityEvaluator
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.svm import SVR, SVC
from sklearn.linear_model import LogisticRegression
def random_data_regression(n, p):
return np.random.rand(n, p), np.random.rand(n)
def random_data_classification(n, p):
return np.random.rand(n, p), np.random.binomial(1, 0.5, n)
regression_models = [RandomForestRegressor(),
ExtraTreesRegressor(),
AdaBoostRegressor(),
LinearRegression(),
SVR()]
classification_models = [RandomForestClassifier(),
ExtraTreesClassifier(),
AdaBoostClassifier(),
SVC(),
LogisticRegression(),
LogisticRegression(solver='sag')]
names = ["RandomForestRegressor",
"ExtraTreesRegressor",
"AdaBoostRegressor",
"LinearRegression",
"SVR",
"RandomForestClassifier",
"ExtraTreesClassifier",
"AdaBoostClassifier",
"SVC",
"LogisticRegression(solver=liblinear)",
"LogisticRegression(solver=sag)"]
complexity_evaluator = ComplexityEvaluator.ComplexityEvaluator(
[500, 1000, 2000, 5000, 10000, 15000, 20000],
[5, 10, 20, 50, 100, 200])
i = 0
for model in regression_models:
res = complexity_evaluator.Run(model, random_data_regression)[0]
print(names[i] + ' | ' + str(round(res[0], 2)) +
' | ' + str(round(res[1], 2)))
i = i + 1
for model in classification_models:
res = complexity_evaluator.Run(model, random_data_classification)[0]
print(names[i] + ' | ' + str(round(res[0], 2)) +
' | ' + str(round(res[1], 2)))
i = i + 1
Learning more
Machine learning
The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article !
Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar and Francis Bach provides a theoretical framework to various machine learning algorithms and a detailed implementation of some of them.
Algorithms and computational complexity
If you want a thorough understanding of computational complexity theory, these books are great resources.
Computational Complexity: A Modern Approach is a clear, detailed analysis of the topic, also covering cryptography and quantum computation.
Randomized Algorithms Though more specialized than the first one, I like the interplay between probabilities and algorithms presented here.