Acronyms of data science

Every field has its acronyms, machine learning does not avoid this rule. I am not a big fan of heavy acronyms usage, but if you happen to be lost in the middle of meeting, this may help!

Below is a list of acronyms, if I missed any, or if you have nice references regarding each topic, feel free to comment, I will do my best to keep this list as up-to-date as possible!

AE Auto encoder

AD Automatic differentiation

ANN Artificial Neural Network

API Application Programming Interface

ARD Automatic Relevance Determination

ASR Automatic Speech Recognition ASR

BPTT Back Propagation Through Time

BPTS Back Propagation Through Structure

BNN Binary Neural Net

COCO Common Objects in Context

CPPN Compositional Pattern-Producing Network

CTC Connectionist Temporal Classification

CNN Convolutional Neural network

CRF Conditional Random Field

CV Cross Validation

DBN Deep Belief Network

DCGAN deep convolutional generative adversarial networks

DNN Deep Neural Network

DT Decision tree

EBM Energy Based Model

ESP Enforced SubPopulations

ELU Exponential Linear Unit

ERF Extremly random forest

GAN Generative Adversarial Network

GBM Gradient Boosting Machine

GMM Gaussian Mixture Model

GRU Gated Recurrent Unit GRU

HMM Hidden Markov Model

NB Naive Bayes

NN Neural Network

KPCA Kernel Principal Component Analysis

KSVM Kernel Support Vector Machine

GA Genetic algorithm GA

HTM Heirarchal temporal memory

HMM Hidden Markov Model

HAM Hierarchical Attentive Memory

KNN k-Nearest Neighbors

LOOCV Leave one out cross validation

LReLU Leaky ReLU

LTU Linear Threshold Unit

LSTM Long Short Term memory

MCMC Markov chain Monte Carlo

MDP Markov Decision Processes

ML Machine Learning

MLP Multi-layer Perceptrons

NLP Natural Language Processing

NTM Neural Turing Machine

NEAT NeuroEvolution of Augmenting Topologies

OLS Ordinary Least Squares Regression

PReLU Paramaterized ReLU

OCR Optical Character Recognition.

PCA Principal Component Analysis.

PAC-MDP Probably Approximately Correct in Markov Decision Processes

RTRL Real Time Recurrent Learning

ReLU Rectified Linear Unit

RNN Recurrent Neural Network

RNTNRecursive Neural Tensor Network

RL Reinforcement Learning

RVM Relevance Vector Machine

ResNet Residual Neural Network

RF Random Forest

RBM Restricted Boltzmann Machines

SIFT Scale-Invariant Feature Transform

SRN Simple Recurrent Network

SVD singular value decomposition

SGD Stochastic Gradient Descent

SVM Support Vector Machine

SANE Symbiotic Adaptive NeuroEvolution

TDA Topological Data Analysis

TF TensorFlow

TFIDF Term Frequency Inverse Document Frequency

VLAD Vector of Locally Aggregated Descriptors

WFST Weighted Finite-State Transducers

Updating Manjaro

A short while ago, I decided to jump from Ubuntu to Manjaro. Ubuntu kept showing me textboxes like “an issue has been detected, close or report ?”, was taking a huge amount of resources and most of the system updates did not solve these. Time for a change !

What is Manjaro ?

It is a distribution of Linux based on the Arch Linux distribution. Manjaro Linux has a focus on user friendliness and accessibility and the system itself is designed to work fully ‘straight out of the box’ with its variety of pre-installed software.

If you are a big fan of comparisons, which I will not detail here, you can find some on slant.co or in french, on citizenz or a top 7 reasons why….

Time for a system update on Manjaro

What could possibly go wrong on a system based on user-friendliness ? Well, the system update can be a pain. As here or here or if you installed VLC/

[user@user-pc ~]$ sudo pacman -Syu
[sudo] password for user: 
:: Synchronizing package databases...
core
extra
community
multilib

[...]

error: unresolvable package conflicts detected
error: failed to prepare transaction (conflicting dependencies)
:: python-nautilus and python2-nautilus are in conflict

Removing packages

To remove a single package, leaving all of its dependencies installed:

pacman -R package_name

To remove a package and its dependencies which are not required by any other installed package:

pacman -Rs package_name

Let’s try getting rid of python2-nautilus. Just in case you wonder what it does.

[user@user-pc ~]$ sudo pacman -Rs python2-nautilus
checking dependencies...
error: failed to prepare transaction (could not satisfy dependencies)
:: nautilus-admin: removing python2-nautilus breaks dependency 'python-nautilus'

Not exactly what I needed… It was actually called nautilus-admin.

[user@user-pc ~]$ sudo pacman -Rs nautilus-admin
checking dependencies...

Packages (3) python2-gobject-3.26.1-1  python2-nautilus-1.1-4  nautilus-admin-1.1.1-1

Total Removed Size:  1.27 MiB

:: Do you want to remove these packages? [Y/n] Y
:: Processing package changes...
(1/3) removing nautilus-admin
(2/3) removing python2-nautilus
(3/3) removing python2-gobject
:: Running post-transaction hooks...
(1/1) Arming ConditionNeedsUpdate...

Let’s try the update once again ! Pay attention not to answer yes to every question…

[user@user-pc ~]$ sudo pacman -Syu

[...]

:: Starting full system upgrade...
:: Replace compositeproto with extra/xorgproto? [Y/n] Y
:: Replace damageproto with extra/xorgproto? [Y/n] y
:: Replace fixesproto with extra/xorgproto? [Y/n] y
:: Replace fontsproto with extra/xorgproto? [Y/n] y
:: Replace gnome-themes-standard with extra/gnome-themes-extra? [Y/n] y
:: Replace gnome-tweak-tool with extra/gnome-tweaks? [Y/n] y
:: Replace inputproto with extra/xorgproto? [Y/n] y
:: Replace kbproto with extra/xorgproto? [Y/n] y
:: Replace manjaro-gnome-extension-settings with community/manjaro-gnome-extension-settings-17.0? [Y/n] n
:: Replace manjaro-gnome-extension-settings with community/manjaro-gnome-extension-settings-18.0? [Y/n] y
:: Replace manjaro-gnome-settings with community/manjaro-gnome-settings-17.0? [Y/n] n
:: Replace manjaro-gnome-settings with community/manjaro-gnome-settings-18.0? [Y/n] y
:: Replace pkg-config with core/pkgconf? [Y/n] y
:: Replace randrproto with extra/xorgproto? [Y/n] y
:: Replace recordproto with extra/xorgproto? [Y/n] y
:: Replace renderproto with extra/xorgproto? [Y/n] y
:: Replace scrnsaverproto with extra/xorgproto? [Y/n] y
:: Replace videoproto with extra/xorgproto? [Y/n] y
:: Replace xextproto with extra/xorgproto? [Y/n] y
:: Replace xf86vidmodeproto with extra/xorgproto? [Y/n] y
:: Replace xineramaproto with extra/xorgproto? [Y/n] y
:: Replace xproto with extra/xorgproto? [Y/n] y

The following can take a while:

zenity-3.28.1-1-x86_64     3.8 MiB   586K/s 00:07 [######################] 100% 
pipewire-0.1.9-3-x86_64 1143.8 KiB   397K/s 00:03 [######################] 100% 
mutter-3.28.2-1-x86_64     2.2 MiB   342K/s 00:07 [######################] 100%

You thought it was done ? No. Too bad.

(729/729) checking keys in keyring                                                                                                                             [##################################################################################################] 100%
(729/729) checking package integrity                                                                                                                           [##################################################################################################] 100%
(729/729) loading package files                                                                                                                                [##################################################################################################] 100%
(729/729) checking for file conflicts                                                                                                                          [##################################################################################################] 100%
error: failed to commit transaction (conflicting files)
python-pip: /usr/lib/python3.6/site-packages/pip/_internal/__init__.py exists in filesystem

[...]

python-pip: /usr/lib/python3.6/site-packages/pip/_internal/vcs/mercurial.py exists in filesystem
python-pip: /usr/lib/python3.6/site-packages/pip/_internal/vcs/subversion.py exists in filesystem
python-pip: /usr/lib/python3.6/site-packages/pip/_internal/wheel.py exists in filesystem
Errors occurred, no packages were upgraded.

There is an issue with pip now… Using :

[user@user-pc ~]$ sudo pacman -S python-pip --force

Enables to update Python. And now, the following should work!

[user@user-pc ~]$ sudo pacman -Syu  

Hope this helped :)

Packages for machine learning

I hope to provide more packages and more informations to this list from times to times. If you have some specific questions regarding a package or have some recommendations, feel free to leave a comment, I will have a look!

Machine learning means so many possible tasks, comes with so many packages and tools that it is hard to have an idea of which one to use. I am just listing the one I find really useful. They are not better than the packages I do not use and I cannot guarantee they are absolutely bug free, but they are robust enough to work with!

Linux tools

A good terminal will be your best friend.

i="1"
for filename in ./pdfs/*.pdf; do
  i="$((i+1))"
  if [ "$i" -gt 20 ]; then
    break
  fi
  echo "Processing $filename file..."
  pdf2txt.py -m 2 "$filename" >> "txts/$(basename "$filename" .pdf).txt"
done
How amazing is that ?

tabview

Probably the best tool to navigate through a CSV file, in the terminal. It is really light, fast, supports many VIM commands. Here is the repo.

csvkit

Install it from pip. It comes with a lot of handy tools:

  • csvstat

  • csvlook though I prefer tabview, csvlook my_data.csv > my_data.md allows to display a csv file in markdown.

Combined with head you can navigate through various files really fast. There actually is whole website dedicated to this.

glances

This allows you to see how busy your machine is when running an algorithm.

PDF to text files

pdf2txt.py

Useful to extract text from pdf files. There is no perfect solution (as far as I know) for this task, but this one is a good starting point.

Tesseract (and ImageMagick)

Another approach to extracting text from pdf files is using OCR (Optical Character Recognition). Tesseract does a great job but importing pdf directly can lead to errors. However, ImageMagick does a great job at turning pdfs to pngs.

echo "Processing $filename file..."
convert -density 300 "$filename[0-1]" -quality 90 "output.png"
tesseract -l fra "output-0.png" "output-0"
tesseract -l fra "output-1.png" "output-1"
cat "output-0.txt" "output-1.txt" > "ocr$(basename "$filename" .pdf).txt"
rm "output-0.txt" "output-1.txt" "output-0.png" "output-1.png"

R

When installing R, and it happened to me many times, I love running the following script. It feels like coming home. I will not go through all the details of each package, it is just that it will be useful for me to have this code around :)


load.lib<-function(libT,l=NULL)
{
  lib.loc <- l
    print(lib.loc)

    if (length(which(installed.packages(lib.loc=lib.loc)[,1]==libT))==0)
    {
      install.packages(libT, lib=lib.loc,repos='http://cran.us.r-project.org')
    }
}

data_reading_libraries <- c(
    "readr",
    "readstata13"
    )

machine_learning_libraries <- c(
    "xgboost",
    "glmnet",
    "randomForest",
    "Rtsne",
    "e1071"
    )

data_libraries <- c(
    "data.table",
    "FeatureHashing",
    "dplyr",
    "Matrix"
    )

string_libraries <- c(
    "stringdist",
    "tm"
    )

plot_libraries <- c(
    "ggplot2",
    "RColorBrewer",
    "fields",
    "akima"
    )

favorite_libs <- c(data_reading_libraries,
    machine_learning_libraries,
    data_libraries,
    string_libraries,
    plot_libraries)

  for(lib in favorite_libs){load.lib(lib)}

General stuff

Reading data

readr

If you have been using the default csv reader in R read.csv, you must be familiar with its drawbacks : slow, many parameters, a parser which sometimes fails… readr on the other hand is super fast, robust and comes with read_csv and read_csv2 depending on the csv standard your file relies on. (The good thing with standard being that there always are many versions of them…)

XML

It allows to read XML files (obviously) but also HTML tables (yes, some people actually use this to transfer data, though it makes the whole file much bigger because of so many HTML tags…)

Machine learning libraries

glmnet

A library that enables to perform elastic net regressions. Has a cross validation method which enjoys nice properties of the path of optimization, which allows to evaluate a path of solutions as fast as a single model.

randomForest

The standard if you want to use random forests with R. Link to the cran page

e1071

I tried its “competitor” (kernlab), but prefered this one.

Rtsne

Wrapper for the C++ implementation of Barnes-Hut t-Distributed Stochastic Neighbor Embedding. Was the fastest tSNE implementation when I tried them.

Data viz

coefplot

Visualizing linear regressions is now simple.

Coefplot

require(coefplot)

N <- 10
P <- 3

X <- matrix(rnorm(n = N*P), nrow = N)

w <- rnorm(P)

Y <- X %*% w + rnorm(P)

my_data <- cbind.data.frame(Y,X)
colnames(my_data) <- c("Y",paste0("Var",1:3))
model <- lm(Y ~ ., data = my_data)

coefplot(model)

corplot

A matrix of correlation can be quite ugly. This one just makes it easier to read, with colors…

forestFloor

Wouldn’t it be great to have something that tells you a little bit more about your random forests models ? This package can.

Python

General stuff

tqdm

tqdm is one of the most useful package I discovered. Though it actually does not perform any operation or handles your dataframes smartly, it shows a progress bar for the loops you want it to. Still not convinced ? With it, you can keep track on every feature engineering job you launch, see which ones are never going to end without the pain of writing all these bars yourself.

pandas

The industry standard for dataframes.

numpy

The industry standard for numeric operations

csv

Easy manipulation of csv files. The method DictReader is particularly useful when one needs to stream from a csv file.

unicodecsv

import unicodecsv as csv

Solves so many issues.

Machine learning libraries

sk-learn

A collection of robust and well optimized methods. A must have.

xgboost, catboost, gbm light

Libraries dedicated to gradient boosting. Amazing performances, amazingly robust.

– layout: post title: “R Kernlab (caret) VS e1071” date: 2018-06-10 14:05:43 +0100 categories: statistics tags: R svm image: feature: kernlab-vs-e1071.jpg comments: true —

A little bit of context

After seeing plenty of “Python VS Java for data science”, questions that seemed ways to broad, I decided to focus on various benchmarks for different tasks. Today will be about Support Vector Machines, in R.

There are two main packages for SVMs in R : kernlab and e1071.

While kernlab implements kernel-based machine learning methods for classification, regression, clustering, e1071 seems to tackle various problems like support vector machines, shortest path computation, bagged clustering, naive Bayes classifier.

Difference between the SVM implementations

Proposed kernels

As kernlab focuses on kernel methods, many of them are implemented:

  • rbfdot Radial Basis kernel function “Gaussian”
  • polydot Polynomial kernel function
  • vanilladot Linear kernel function
  • tanhdot Hyperbolic tangent kernel function
  • laplacedot Laplacian kernel function
  • besseldot Bessel kernel function
  • anovadot ANOVA RBF kernel function
  • splinedot Spline kernel
  • stringdot String kernel

While e1071 proposes the following:

  • linear
  • radial basis
  • polynomial
  • sigmoid

Solver

e1071 relies on libsvm, which last update was released on December 22, 2016.

On the other hand, ksvm uses John Platt’s SMO algorithm for solving the SVM QP problem an most SVM formulations.

These may have an impact on the training time of these models (hopefully, the solutions should be the same).

Testing it

Unfortunately, the names of the parameters are quite different between the two libaries are not exactly the same.

This has been an issue for some users.

The radial basis kernel in e1071 is defined as \(k(u,v) = \exp(-\gamma \Vert u-v \Vert ^2)\) and as \(k(x,x") = \exp(-\sigma \Vert x-x" \Vert ^2)\). This is good. However, note that the \(C\) (cost) parameters are called C and cost.

Besides, there are many models which are available (epsilon regressions, nu regressions…) and the default behavior is not always obvious.

At last, there have been many heuristics developed to chose the best “bandwith” (referred to as \(\gamma\) and \(\sigma\) depending on the package), and the proposed heuristics are not always the same. The code below makes sure the methods match when enough parameters are provided.

require("kernlab")
require("e1071")

N <- 1000
P <- 20
noise <- 3

X <- as.matrix(matrix(rnorm(N * P), nrow = N))
Y <- as.vector(X[, 1] + X[, 2] * X[, 2] + noise * rnorm(P))

model_kernlab <-
  kernlab::ksvm(
      x = X,
      y = Y,
      scaled = TRUE,
      C = 5,
      kernel = "rbfdot",
      kpar = list(sigma = 1),
      type = "eps-svr",
      epsilon = 0.1
      )

model_e1071 <- e1071::svm(x = X,
      y = Y,
      cost = 5,
      scale = TRUE, 
      kernel = "radial",
      gamma = 1,
      type = "eps-regression",
      epsilon = 0.1)

Let”s make sure the model match.

> mean(abs(
+   predict(object = model_e1071, newdata = X) - predict(object = model_kernlab, newdata = X)
+ ))
[1] 1.254188e-14
> 
> sum(abs(predict(object = model_e1071, newdata = X) - Y))
[1] 380.0338
> sum(abs(predict(object = model_kernlab, newdata = X) - Y))
[1] 380.0338

Benchmarking

Results

Now that we know the algorithms propose the same results, we can (safely) compare the time of execution.

Comparison with 20 features

```e1071``` is in red

Comparison with 200 features

Given the success of libsvm, I expected e1071 to be faster than kernlab. The heuristics implementend to optimize the quadratic form with its constraints are not the same, (see [1] and [2]) and they may actually lead to different results on other data sets.

Learning more

[1] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 1–27, Apr. 2011.

[2] J. C. Platt, “Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines,” p. 21.

Applied predictive modelling is also good introduction to predictive modelling, with R.

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman

Code

If you want to reproduce the results, the whole code is below:


require("kernlab")
require("e1071")

fit_e1071 <- function(X, Y) {
  e1071::svm(
    x = X,
    y = Y,
    cost = 5,
    scale = TRUE,
    kernel = "radial",
    gamma = 1,
    type = "eps-regression",
    epsilon = 0.1
  )
}

fit_kernlab <- function(X, Y) {
  kernlab::ksvm(
    x = X,
    y = Y,
    scaled = TRUE,
    C = 5,
    kernel = "rbfdot",
    kpar = list(sigma = 1),
    type = "eps-svr",
    epsilon = 0.1
  )
}

time_e1071 <- c()
time_kernlab <- c()

steps <- c(10, 20, 50, 100, 200, 500, 1000, 2000)

for (n in steps)
{
  noise <- 3
  P <- 20
  
  X <- as.matrix(matrix(rnorm(n * P), nrow = n))
  Y <- as.vector(X[, 1] + X[, 2] * X[, 2] + noise * rnorm(n))
  
  time_e1071 <- c(time_e1071, system.time(fit_e1071(X, Y))[1])
  time_kernlab <- c(time_kernlab, system.time(fit_kernlab(X, Y))[1])
}

plot(
  steps,
  time_e1071,
  type = "l",
  col = "red",
  xlab = "n",
  ylab = "Time",
  main = paste0("Execution time for ", P, " features"))
lines(steps, time_kernlab)
OCaml vs PyPy

Introduction

Following the article about machine learning and ocaml I present another learning algorithm in OCaml and its comparison to a Python (compiled with pypy) equivalent.

First of all, PyPy is a fast, compliant alternative implementation of the Python language (2.7.13 and 3.5.3). As fast, it means orders of magnitude faster than a simple Python execution. As many libraries are not (directly) compatible with it, it remains quite unused. However, when using the default libraries, PyPy works like a charm.

Online logistic regression

A common algorithm, often met on Kaggle, is the online logistic regression. Note that the data used will as well come from the Avazu CTR competition. To make the experience shorter, I will only work with the first 1 000 000 lines of the train.csv file (later referred to as train_small.csv).

This algorithm allows to train data in streaming which means that one does not need to load all the data in memory (which is sometimes infeasible). These methods are usually quite fast, and being simple to implement, this allows a lot of tuning and feature engineering on the fly.

On top of that, it automatically “one-hot encodes” the data. Which means that even if the number of categories per variable is important (which results in an underlying high dimensional space), this will not be an issue for this method (and the memory used by this program will remain constant, randomly projecting some components).

The Python code

I do not take any credit for the awesomeness of this code. I simply paste it for the experiment to be easier to reproduce. Some modifications have been performed to make it more comparable to the ocaml implementation (most of them are commented).

'''
           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
                   Version 2, December 2004

Copyright (C) 2004 Sam Hocevar <sam@hocevar.net>

Everyone is permitted to copy and distribute verbatim or modified
copies of this license document, and changing it is allowed as long
as the name is changed.

           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
  TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

 0. You just DO WHAT THE FUCK YOU WANT TO.
'''


from datetime import datetime
from csv import DictReader
from math import exp, log, sqrt


##############################################################################
# parameters #################################################################
##############################################################################

# A, paths
train = 'train_small.csv'               # path to training file
# B, model
alpha = .01  # learning rate
# C, feature/hash trick
D = 2 ** 20              # number of weights to use
do_interactions = False  # whether to enable poly2 feature interactions

# D, training/validation
epoch = 1      # learn training data for N passes
holdout = 100  # use every N training instance for holdout validation


##############################################################################
# class, function, generator definitions #####################################
##############################################################################

class logistic_regression(object):
    ''' Classical logistic regression
    
        This class (algorithm) is not used in this code, it is putted here
        for a quick reference in hope to make the following (more complex)
        algorithm more understandable.
    '''

    def __init__(self, alpha, D, interaction=False):
        # parameters
        self.alpha = alpha

        # model
        self.w = [0.] * D

    def predict(self, x):
        # parameters
        alpha = self.alpha

        # model
        w = self.w

        # wTx is the inner product of w and x
        wTx = sum(w[i] for i in x)

        # bounded sigmoid function, this is the probability of being clicked
        return 1. / (1. + exp(-max(min(wTx, 35.), -35.)))

    def update(self, x, p, y):
        # parameter
        alpha = self.alpha

        # model
        w = self.w

        # gradient under logloss
        g = p - y

        # update w
        for i in x:
            w[i] -= g * alpha


def logloss(p, y):
    ''' FUNCTION: Bounded logloss

        INPUT:
            p: our prediction
            y: real answer

        OUTPUT:
            logarithmic loss of p given y
    '''

    # p = max(min(p, 1. - 10e-15), 10e-15)
    return -log(p) if y == 1. else -log(1. - p)


def data(path, D):
    ''' GENERATOR: Apply hash-trick to the original csv row
                   and for simplicity, we one-hot-encode everything

        INPUT:
            path: path to training or testing file
            D: the max index that we can hash to

        YIELDS:
            ID: id of the instance, mainly useless
            x: a list of hashed and one-hot-encoded 'indices'
               we only need the index since all values are either 0 or 1
            y: y = 1 if we have a click, else we have y = 0
    '''

    for t, row in enumerate(DictReader(open(path))):
        # process id
        # ID = row['id'] -> not needed
        del row['id']

        # process clicks
        y = 0.
        #if 'click' in row: -> always true in the train set
        if row['click'] == '1':
            y = 1.
        del row['click']

        # turn hour really into hour, it was originally YYMMDDHH
        #row['hour'] = row['hour'][6:]

        # build x
        x = [0]  # 0 is the index of the bias term
        #for key in sorted(row):  # sort is for preserving feature ordering
        for key in row:
	    value = row[key]

            # one-hot encode everything with hash trick
            index = abs(hash(key + '_' + value)) % D
            x.append(index)

        yield t, x, y


##############################################################################
# start training #############################################################
##############################################################################

start = datetime.now()

# initialize ourselves a learner
learner = logistic_regression(alpha, D)

# start training
for e in xrange(epoch):
    loss = 0.
    count = 0

    for t, x, y in data(train, D):  # data is a generator
        #  t: just a instance counter
        # ID: id provided in original data
        #  x: features
        #  y: label (click)

        # step 1, get prediction from learner
        p = learner.predict(x)

        if t % holdout == 0:
            # step 2-1, calculate holdout validation loss
            #           we do not train with the holdout data so that our
            #           validation loss is an accurate estimation of
            #           the out-of-sample error
            loss += logloss(p, y)
            count += 1
        else:
            # step 2-2, update learner with label (click) information
            learner.update(x, p, y)

        if t % 100000 == 0 and t > 1:
            print('encountered: %d\tcurrent logloss: %f' % (
                t, loss/count))


OCaml implementation

open Maths
open Read_tools


let get_indices dict n = Hashtbl.fold (fun k v acc -> ((Hashtbl.hash (k^" "^v)) mod n)  :: acc) dict [] 

let predict indices weights = sigmoid (dot_product indices weights) 

let rec update indices weights p y alpha =  match indices with
	| [] -> ()
	| h::tail -> weights.(h) <- (weights.(h) -. (p -. y) *. alpha) ; update tail weights p y alpha 
	

let n = pow 2 20 
let weights = Array.make n 0. 
let dict_stream = dict_reader "train_small.csv" 
let updater indices weights p y = update indices weights p y 0.01 
let refresh_loss = 100000 

let train weights updater dict_stream = 
	let rec aux weights updater dict_stream t loss n = match (try Some( Stream.next dict_stream) 
								  with _ -> None) with
	| Some dict -> Hashtbl.remove dict "id"; 
			  let y = float_of_string (Hashtbl.find dict "click") in
			  Hashtbl.remove dict "click";
			  let indices = get_indices dict n in
			  let p = predict indices weights in
			  updater indices weights p y;
			  
			  if ((t mod refresh_loss) == 0) && t > 0 then begin 
          print_string "encountered: ";
          print_int t;
          print_string "\t logloss:";
          print_float (loss /. float_of_int(t)); 
          print_endline " "; 
			  end;
			  
			  aux weights updater dict_stream (t + 1) (loss +. (log_loss p y)) n
			  
  | None -> () in aux weights updater dict_stream 0 0. (Array.length weights) ;;	

train weights updater dict_stream;

Where the maths.ml contains few math functions.

let dot_product indices weights =
    let rec aux indices weights acc = 
		match indices with
		| [] -> acc
		| h::tail -> aux tail weights (acc +. weights.(h)) in
	aux indices weights 0.
	
let sigmoid x = 1. /. (1. +. exp(0. -. (max (min x 35.) (-35.))))

let log_loss p y = match y with 1. -> -. log(p) | _ -> -. log(1. -. p)

let rec pow a = function
  | 0 -> 1
  | 1 -> a
  | n -> 
    let b = pow a (n / 2) in
    b * b * (if n mod 2 == 0 then 1 else a) 

And read_tools.ml implements various tools to stream from a csv file.

open Str

let csv_separator = ","

let err_lists_sizes = "Incompatible lists size"

let line_stream_of_channel channel =
    Stream.from (fun _ -> try Some (input_line channel) with End_of_file -> None)
	
let read_lines file_path = line_stream_of_channel (open_in file_path)
	
let read_first_line file_path = Stream.next (read_lines file_path)
	
let split_line line = Str.split (Str.regexp csv_separator) line

let concat_elements list1 list2 = 
	let rec aux list1 list2 acc = match list1,list2 with
	| [],[] -> acc
	| a,[] -> failwith err_lists_sizes
	| [],a -> failwith err_lists_sizes
	| h1::t1,h2::t2 -> aux t1 t2 ((h1^h2)::acc) in List.rev (aux list1 list2 [])
	
let to_dict list1 list2 =
	let rec aux list1 list2 my_hash = match list1,list2 with
	| [],[] -> my_hash
	| a,[] -> failwith err_lists_sizes
	| [],a -> failwith err_lists_sizes
	| h1::t1,h2::t2 -> Hashtbl.add my_hash h1 h2; aux t1 t2 my_hash in aux list1 list2 (Hashtbl.create 15)
	
let dict_reader file_path = 
	let line_stream = read_lines file_path in
	let header = split_line (Stream.next line_stream) in
	Stream.from
      (fun _ ->
         try Some (to_dict header (split_line (Stream.next line_stream))) with End_of_file -> None)

Results

$ time pypy main.py 
encountered: 100000 current logloss: 0.435212
encountered: 200000 current logloss: 0.433022
encountered: 300000 current logloss: 0.416148
encountered: 400000 current logloss: 0.411722
encountered: 500000 current logloss: 0.403127
encountered: 600000 current logloss: 0.405477
encountered: 700000 current logloss: 0.405665
encountered: 800000 current logloss: 0.400225
encountered: 900000 current logloss: 0.397522

real  0m24.649s
user  0m23.964s
sys 0m0.540s
$ ocamlfind opt -linkpkg str.cmxa maths.ml read_tools.ml main.ml -o main.byte
$ time ./main.byte 
encountered: 100000  logloss:0.422601941149 
encountered: 200000  logloss:0.418270062183 
encountered: 300000  logloss:0.409809392923 
encountered: 400000  logloss:0.40026898568 
encountered: 500000  logloss:0.396667944914 
encountered: 600000  logloss:0.398242961764 
encountered: 700000  logloss:0.397423028349 
encountered: 800000  logloss:0.394274547682 
encountered: 900000  logloss:0.391874034523 

real  0m23.187s
user  0m22.980s
sys 0m0.176s

The logloss seems lower with the OCaml code. Don’t be fooled by this. This stems from the fact that Python and OCaml have different hashing functions, therefore, the collisions happening between the features (with one hot encoding) are not the same from one source to the other. As for the training time, there is no sensible difference ! I guess that concludes the story… For this problem, just pick the language you feel most comfortable with :)

Improvements

In terms of code

Actually, OCaml as well has a csv library, available with opam.

In terms of accuracy

There are many variations of this algorithm. The “most famous” being the FTRL

Learning more

Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar and Francis Bach provides a theoretical framework to various machine learning algorithms and a detailed implementation of some of them. The section about streaming algorithms is particularly useful in the context of this article.

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.

Implementing k Nearest Neighbours in OCaml

Introduction

Here and there, people wonder about the possibility to use functionnal programming to approach machine learning. I decided to give it a try with some learning algorithms and noted that there are actually various options to use external libraries (unfortunately, nothing getting close to scikit learn’s maturity)

A quick reminder about the k-nearest neighbours. It was first described in the early 1950s and is often referred to as a “lazy learner”, as it merely stores the data, waiting to be provided with points to classify.

Then, in a metric space \((X,d)\) given a point \(y \in X\), and \(n\) labelled points \(S = (x_i,l_i) \in (\mathbb{R} \times \{0,1\}) ^ n\) it will return the most common labels among the \(k\) closest point to \(y\). Simple, isn’t it ? Actually, one does not need more information to implement it. So let’s get started.

The expressivity of ocaml

For fun, let’s see how easy it is to implement a k-nearest neighbours in ocaml. Note that we only need to retrieve the closest points from one point in an array of points. The method find_nearest_neighbours does this. Note how generic it is : the point can have any type (float array, string list…) as long as the distance operates on this type. Think about all the templates that should be written in other languages. And the compiler will tell me if types are incompatible (when Python would wait until an error appears).

(* Returns the k smallest elements of an array *)
let get_smallest_elements_i input_array k = 
  let n = Array.length input_array in
  let indices = Array.init n (fun x -> x) in
  for i = 0 to (k-1) do
    for j = (n-1) downto 1 do
      if input_array.(indices.(j-1)) > input_array.(indices.(j)) then begin
        let b = indices.(j-1) in
        indices.(j-1) <- indices.(j);
        indices.(j) <- b;
      end
    done;
  done;
  Array.sub indices 0 k

(* Returns the k closest points from current_point in all_points *)
let find_nearest_neighbours current_point all_points k distance = 
  let distances = Array.map (fun x -> distance x current_point) all_points in
  get_smallest_elements_i distances k

(* Returns the most common labels among the neihbours *)
let predict nearest_neighbours labels = 
  let sum a b = a +. b in
  let k = Array.length nearest_neighbours in
  if Array.fold_left sum 0. (Array.init k (fun i -> labels.(nearest_neighbours.(i)))) > 0. then 1. else ~-.1.

Now we need a dataset to try the algorithm. Nothing really funny there.

(* Toy data *)
let max_length = 1.

let chessboard_boundary x y = if ((mod_float x  0.5) -. 0.25) *. ((mod_float y 0.5) -. 0.25) > 0. then 1. else ~-.1.

let circle_boundary x y = if (x**2. +. y**2.) > 0.5 then 1. else ~-.1.

let unlabelled_boundary x y = 2. ;;

(* Given a decision boundary, returns a data set and the associated labels *)
let make_data n_points decision_boundary =
  let output_data = Array.init n_points (fun _ -> (Array.make 2 0.)) in
  let output_label = Array.make n_points 0. in
  for i = 0 to (n_points-1) do
    output_data.(i).(0) <- Random.float max_length;
    output_data.(i).(1) <- Random.float max_length;
    output_label.(i) <- decision_boundary output_data.(i).(0) output_data.(i).(1)
  done;
  output_data, output_label

Now that we defined the points as arrays of floats, we need to implement distances on it.

let sum a b = a +. b in

(* Usual Euclide Distance *)
let euclide_distance x y =
  let squares_diff = Array.init (Array.length x) (fun i -> (x.(i) -. y.(i))**2.) in
  Array.fold_left sum 0. squares_diff

let manhattan_distance x y =
  let squares_diff = Array.init (Array.length x) (fun i -> abs (x.(i) -. y.(i)) ) in
  Array.fold_left sum 0. squares_diff

Gluing up all the pieces together :

open Knn
open Distances
open ToyDataset

(* Number of points in the training set*)
let n_points = int_of_string Sys.argv.(1)  ;; 

(* Parameter k of the kNN algorithm *)
let k = int_of_string(Sys.argv.(2)) ;;

(* Number of points in the training set *)
let n_test_points = 50 ;;

(* Train and test data*)
let train_data, labels = make_data n_points circle_boundary;;
let test_data, pseudo_labels = make_data n_test_points unlabelled_boundary ;;

(* For each point in the test set, stores the indices of the nearest neighbours *)
let nearest_neighbours = Array.map (fun x -> find_nearest_neighbours x train_data k euclide_distance) test_data;;

(* Evaluates and prints the accuracy of the model *)
let mismatches = ref 0. ;;

for l = 0 to (n_test_points-1) do
  pseudo_labels.(l) <- predict nearest_neighbours.(l) labels ; 
  if pseudo_labels.(l) <> (circle_boundary test_data.(l).(0) test_data.(l).(1)) then (mismatches := !mismatches +. 1.) else (); 
done;

print_string ("Error rate : "^string_of_float(100. *. !mismatches /. (float_of_int n_test_points))^"%\n");

Now I recommend using ocamlbuild. It will save you loads of time. Especially with large projects. Assuming the latest part is called main.ml simply enter this in the terminal:

me$ ls
distances.ml  knn.ml  main.ml  toyDataset.ml

me$ ocamlbuild main.byte
Finished, 9 targets (1 cached) in 00:00:00.

Now, you just have to call the produced byte file with the first argument being the number of points to generate and the second one, the parameter \(k\).

me$ ./main.byte 100 5
Error rate : 4.%
me$ ./main.byte 1000 5
Error rate : 2.%
me$ ./main.byte 3000 5
Error rate : 0.%

What about performance ?

I leave this to another post : pypy vs ocaml for streaming learning, coming soon :)

More about knn

If you are interested in this method and further developments, you may find the following articles interesting:

[1]S. Cost and S. Salzberg, “A weighted nearest neighbor algorithm for learning with symbolic features,” Machine Learning, vol. 10, no. 1, pp. 57–78, Jan. 1993.

[2]J. Wang, P. Neskovic, and L. N. Cooper, “Improving nearest neighbor rule with a simple adaptive distance measure,” Pattern Recognition Letters, vol. 28, no. 2, pp. 207–213, Jan. 2007.

[3]K. Yu, L. Ji, and X. Zhang, “Kernel Nearest-Neighbor Algorithm,” Neural Processing Letters, vol. 15, no. 2, pp. 147–156, Apr. 2002.

Speeding up cross-validation

Generic cross-validation

Definitions

k-fold cross-validation

Cross-validation is a process that enables to estimate the out-of-sample performance of a model. There exist many types of cross-validation, but the most common method consists in splitting the training-set in \(k\) “folds” (\(k\) samples of approximately \(n/k\) lines) and train the model \(k\)-times, each time over samples of \(n-n/k\) points. The prediction error is then measured on the predictions of the remaining \(n/k\) points.

The cross-validation process

Illustration of the cross-validation. From Wikipedia.

Leave one out cross-validation (LOOCV)

The leave one out cross-validation is a specialization of the \(k\)-fold cross-validation, with \(k=n\). For each point, you train the model over the \(n-1\) other points, predict the label (or value, in the case of a regression problem) and average the errors.

Hyperparameters tuning

The reason to perform cross-validation is to enable the tuning of hyperparameters (or doing feature selection, but I will focus on hyperparameters tuning). They are parameters that are not directly infered from the data (this definition is quite lose). Per example, in the case of a linear regression, the coefficients of the model are directly infered. In the case of a penalized regression, the penalty parameter is not infered during the training procedure (if there is a way to select it, please let me know in the comments), and the usual way to chose the best penalty parameter is by cross validating many models, with different penalty parameters and chosing the one which achieves the best score (the highest accuracy, the lowest error…).

This is where cross-validation becomes costly : suppose you want to try \(10\) values of your hyper-parameter (this can be much higher), and evaluate the performance of each hyperparameter with a \(5\) fold cross-validation. This will be \(10 \cdot 5 = 50\) times longer than training a single model !

Usually, there is not one single hyper parameter. In the case of the Support Vector Machines with Radial Basis Functions, there are usually two of them which are tuned \(C\) and \(\gamma\). Trying \(10\) values for each hyperparameter now leads to \(100\) models to train (which is in turn multiplied by the number of folds).

This explains the need for smarter approaches to cross-validations.

Generic implementation

Cross-validation is a long process. And it needs to be repeated. Many times. Maybe even more when the number of parameters is large. The leave-one-out cross-validation is the longest one, as you have to train the model as many times as you have data points.

You have two options. Find the best implementation of your algorithm given your problem. Buy more computation power.

Assuming you have already achieved the first step and that the second one is not really satisfactory, there is another option. Change your cross-validation method! Though it needs some efforts (the usual cross-validation pipelines/loops have to be specialized), it will be fun!

Following the article about time complexity and enjoying the fact that, when you train a model on a specific fold during a cross-validation, you may reuse part of your calculations, I will present some tips to make this cross-validation faster.

As always, this improvement has a price : genericity. When using scikit learn, the models have similar signatures. The cross-validation procedure has to be written only once and works for every model. See per example the code below.

import pandas as pd
import numpy as np

from time import time
from sklearn import cross_validation

class Stacker:

    def __init__(self, penalty, n_folds, verbose=True, random_state=1):
        self._penalty = penalty
        self._n_folds = n_folds
        self._verbose = verbose
        self._random_state = random_state

    def Run(self, X, y, model, predict_method):
        kf = cross_validation.KFold(
            y.shape[0], n_folds=self._n_folds, shuffle=True, random_state=self._random_state)
        trscores, cvscores, times = [], [], []
        i = 0
        stack_train = np.zeros(len(y))  # stacked predictions
        for i, (train_fold, validation_fold) in enumerate(kf):
            i = i + 1
            t = time()
            model.fit(X.iloc[train_fold], y.iloc[train_fold])

            tr_pred = predict_method(model, X.iloc[train_fold])

            trscore = self._penalty(y.iloc[train_fold], tr_pred)

            validation_prediction = predict_method(
                model, X.iloc[validation_fold])
            cvscore = self._penalty(
                y.iloc[validation_fold], validation_prediction)

            trscores.append(trscore)
            cvscores.append(cvscore)
            times.append(time() - t)

            stack_train[validation_fold] = validation_prediction

        if self._verbose:
            print("TRAIN %.5f | TEST %.5f | TEST-STD %5f | TIME %.2fm (1-fold)" %
                  (np.mean(trscores), np.mean(cvscores), np.std(cvscores), np.mean(times) / 60))
            print(model.get_params(deep=True))
            print("\n")

        return np.mean(cvscores), stack_train

Not only it performs a cross-validation, but keeps the out-of-sample predictions for each fold. This allows to do model stacking, another topic I will probably discuss in antoher post.

As you see in this code, there is no information regarding the model needed. It will just take a model, that has a fit method, and predict it (the predict method must be passed as a function, it allows post processing of the predictions, per example).

Tailor made implementations

Linear regression

Here, the specialization works so well that one can even use the following closed formula for a leave one out cross-validation

Following the equation of a linear model : \(y = X\beta + \mathbf{e}.\) it is well known that (for an OLS estimate) \(\hat{\beta} = \underset{\beta}{\operatorname{argmin}} \| y-X \beta \|^2\) we have \(\hat{\beta} = (X'X)^{-1}Xy\).

So we can write :

\[\hat{Y} = X\hat{\beta} = X(X'X)^{-1}X'y = \mathbf{H}y\]

Where \(\mathbf{H} = X(X'X)^{-1}X'\)

If the diagonal values of \(\mathbf{H}\) are denoted by \(h_{1},\dots,h_{n}\), then the cross-validation statistic can be computed using:

\[\text{CV} = \frac{1}{n}\sum_{i=1}^n [e_{i}/(1-h_{i})]^2,\]

Where \(e_{i}\) is the residual obtained from fitting the model to all \(n\) observations. See [4] for more details.

Elastic-net

The famous library glmnet [3] solves the following problem, over a whole “path” (i.e. having \(\lambda\) varying) while being as fast as one would normally compute the solution for a single value of \(\lambda\).

\[\hat{\beta} = \underset{\beta}{\operatorname{argmin}} (\| y-X \beta \|^2 + \alpha \lambda \|\beta\|^2 + (1-\alpha) \lambda \|\beta\|_1)\]

The least angle regression (LARS) algorithm, on the other hand, is unique in that it solves the minimzation problem for all \(\lambda\) ∈ [0,∞] […] This is possible because the lasso solution is piecewise linear with respect to \(\lambda\).

The library comes with a cross-validation method supporting parallelization:

cv.glmnet(x, y, weights, offset, lambda, type.measure, nfolds, foldid, grouped, keep,
parallel, ...)

SVMs

The same authors proposed a similar method for calculating solutions to SVMs calibrations, where a path of solution depending on \(C\), the cost parameter is proposed.

An R package svmpath is available. Refering to the article:

It exploits the fact that the “hinge” loss-function is piecewise linear, and the penalty term is quadratic. This means that in the dual space, the lagrange multipliers will be pieceise linear (c.f. lars).

require("svmpath")

N <- 500
P <- 2
sigma <- 0.1

X <- matrix(rnorm(N * P), nrow = N)
Y <- 2 * (X[, 1] + X[, 2] * X[, 2] + sigma * rnorm(P) > 0) - 1

svm_model <-
  svmpath(
    x = X,
    y = Y,
    kernel.function = radial.kernel,
    plot = F
  )
plot(svm_model)

Xtest <- matrix(rnorm(N * P), nrow = N)
Ytest <-
  2 * (Xtest[, 1] + Xtest[, 2] * Xtest[, 2] + sigma * rnorm(P)  > 0) - 1

pred <-
  svmpath::predict.svmpath(svm_model, newx = Xtest, type = "class")

eval_error <- function(predicted) {
  return(mean(abs(Ytest - predicted)))
}

errors <- apply(X = pred, MARGIN = 2, FUN = eval_error)
plot(
  svm_model$lambda,
  errors,
  main = "Out Of Sample error rate",
  xlab = expression(lambda),
  ylab = "Error rate"
)

Illustration Unfortunately, it has a very large memory consumption event for small data sets.

Gradient boosting

With gradient boosting, when cross validating over the number of trees, a simple observation is to note that models are trained sequentially. If \(k\) models are trained over \(k\) folds, each model will predict a new point based on the following equation:

\[\hat{y}=\sum_{i}^{n_{trees}}\gamma_iT_i(x)\]

Now, by simply truncating the sum, one can create the sequence:

\[\big(\hat{y_T}=\sum_{i}^{T}\gamma_iT_i(x)\big)_{T<n_{trees}}\]

For each fold, this step can be performed, so that the performance at each step can be evaluated. XGBoost implements something along these lines (with callbacks), the method can be found on their repository [6].

cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None,
    metrics=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None,
    fpreproc=None, as_pandas=True, verbose_eval=None, show_stdv=True,
    seed=0, callbacks=None, shuffle=True)

A generic framework

All this is nice, but it means that each method requires a specific cross-validation routine, depending on the parameter one is focusing on. It makes it quite complex, forces to write more code (and make mistakes). It would be amazing if there were something that is generic, wouldn’t it ?

There is. All the ideas summarized here come from the paper [1] of M. Izbicki. I really recommand going through it (especially if you have a taste for algebra).

The idea is to consider :

  • the data, being a monoid with a concatenation operation : \(A \clubsuit B\) is just the bindings of the rows of \(A\) with the rows of \(B\) (the neutral element being an empty set).

  • the learner \(m\) which, given some input data, returns a model \(T : A \ \rightarrow (f:x\rightarrow \text{label})\)

  • a morphism \(\diamondsuit\), so that \(m(A \clubsuit B) = m(A) \diamondsuit m(B)\)

Now, if \(\diamondsuit\) can be evaluated in a constant time (independant of the length of \(A\) and \(B\)), one can have cross-validations in \(O(n+k)\) instead of \(O(kn)\). This works for Naive Bayes, nearest centroids, and other methods.

Now the algorithm presented is the Monoid Cross Validation. Keeping \(n\) the number of points and \(k\) the number of folds for the notations, and \(G_i\) the i-th fold for the k-fold cross-validation, \(k\) models are trained on each fold. Now the models are merged in \(k-1\) operations (using the morphism relationship) and the prediction is performed over the last fold. If the prefixes and suffixes are evaluated beforehand (i.e. building the sequences of models trained on the various folds in the two orders) a method enables to obtain the cross-validation in \(O(n+k)\) operations.

This topic is really interesting and I probably will dedicate it a full article. What about a toy ocaml implementation ?

References

[1] M. Izbicki, “Algebraic classifiers: a generic approach to fast cross-validation, online training, and parallel training,” p. 9.

[2] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, “The Entire Regularization Path for the Support Vector Machine,” p. 25.

[3] R. J. Tibshirani and J. Taylor, “The solution path of the generalized lasso,” The Annals of Statistics, vol. 39, no. 3, pp. 1335–1371, Jun. 2011.

[4] “Plane Answers to Complex Questions: The theory of linear models” Ronald Christensen

[5] XGBoost source on Github

Learning more

For those interested about ideas in statistics stemming from algebra (and not only matrix operations) and geometry, I am only aware of two books covering this topic : Algebraic and Geometric Methods in Statistics and Lectures on Algebraic Statistics

Computational complexity of machine learning algorithms

What is complexity ? Good question. I should have addressed it right away. It is a notion which is often addressed in algorithmic classes, but not in machine learning classes… Simply put, say you have a model. Training it on \(n\) points takes \(x\) minutes. Now what happens if you train it on \(kn\) points ? If the training time is now \(kx\) then the training time is linear. Sometimes, it is more. The new training time may be \(k^2 x\). In this case, the training time would be called quadratic in the number of points. If you have a long training time for few thousands points, do not expect to be able to run this method on millions of points (kernel SVMs, I miss you in large samples).

It is harder than one would think to evaluate the complexity of a machine learning algorithm, especially as it may be implementation dependent, properties of the data may lead to other algorithms or the training time often depends on some parameters passed to the algorithm. Another caveat is that the learning algorithms are complex and rely on other algorithms.

A theoretical point of view

Some bounds

Here, I propose upper bounds (as the implementation achieving this bound will be described) when the data is dense.

Calling \(n\) the number of training sample, \(p\) the number of features, \(n_{trees}\) the number of trees (for methods based on various trees), \(n_{sv}\), the number of support vectors and \(n_{l_i}\) the number of neurons at layer \(i\) in a neural network, we have the following approximations.

Algorithm Classification/Regression Training Prediction
Decision Tree C+R \(O(n^2p)\) \(O(p)\)
Random Forest C+R \(O(n^2pn_{trees})\) \(O(pn_{trees})\)
Random Forest R Breiman implementation \(O(n^2p n_{trees})\) \(O(pn_{trees})\)
Random Forest C Breiman implementation \(O(n^2 \sqrt p n_{trees})\) \(O(pn_{trees})\)
Extremly Random Trees  C+R \(O(npn_{trees})\) \(O(npn_{trees})\)
Gradient Boosting (\(n_{trees}\)) C+R \(O(npn_{trees})\) \(O(pn_{trees})\)
Linear Regression R \(O(p^2n+p^3)\) \(O(p)\)
SVM (Kernel) C+R \(O(n^2p+n^3)\) \(O(n_{sv}p)\)
k-Nearest Neighbours (naive) C+R \(-\) \(O(np)\)
Nearest centroid C \(O(np)\) \(O(p)\)
Neural Network C+R  ? \(O (pn_{l_1} + n_{l_1} n_{l_2} + ... )\)
Naive Bayes C \(O(np)\)  \(O(p)\)

Justifications

Decision Tree based models

Obviously, ensemble methods multiply the complexity of the original model by the number of “voters” in the model, and replace the training size by the size of each bag.

When training a decision tree, a split has to be found until a maximum depth \(d\) has been reached.

The strategy for finding this split is to look for each variable (there are \(p\) of them) to the different thresholds (there are up to \(n\) of them) and the information gain that is achieved (evaluation in \(O(n)\))

In the Breiman implementation, and for classification, it is recommanded to use \(\sqrt p\) predictors for each (weak) classifier.

Extremly random trees

The search strategy for the optimal split simply does not take place in the case of ERTs. This makes it much simpler to find a (weaker) split.

However (in my experience), ERTs implementation are not much faster than RFs.

Linear regressions

The problem of finding the vector of weights \(\beta\) in a linear regression boils down to evaluating the following equation: \(\beta=(X'X)^{-1}X'Y\).

The most computationnaly intensive part is to evaluate the product \(X'X\), which is done in \(p^2n\) operations, and then inverting it, which is done in \(p^3\) operations.

Though most implementations prefer to use a gradient descent to solve the system of equations \((X'X)\beta = X'Y\), the complexity remains the same.

Support Vector Machine

For the training part, the classical algorithms require to evaluate the kernel matrix \(K\), the matrix whose general term is \(K(x_i,x_j)\) where \(K\) is the specified kernel.

It is assumed that K can be evaluated with a \(O(p)\) complexity, as it is true for common kernels (Gaussian, polynomials, sigmoid…). This assumption may be wrong for other kernels.

Then, solving the constrained quadratic programm is “morally equivalent to” inverting a square matrix of size \(n\), whose complexity is assumed to be \(O(n^3)\)

k-Nearest Neighbours

In its simplest form, given a new data point \(x\), the kNN algorithm looks for the k closest points to \(x\) in the training data and returns the most common label (or the averaged values of targets for a regression problem).

To achieve this, it is necessary to compare the distance between \(x\) and every point in the data set. This amounts to \(n\) operations. For the common distances (Euclide, Manhattan…) this operation is performed in a \(O(p)\) operations. Not that kernel k Nearest Neighbours have the same complexity (provided the kernels enjoy the same property).

However, many efforts pre-train the kNN, indexing the points with quadtrees, which enable to lower dramatically the number of comparisons to the points in the training set.

Likewise, in the case of a sparse data set, with an inverted indexation of the rows, it is not necessary to compute all the pairwise distances.

The practical point of view

All this is nice, but what about real life examples ? We can focus on sk-learn implementations below.

The assumptions will be that the complexities take the form of \(O(n^\alpha p^\beta)\) and \(\alpha\) and \(\beta\) will be estimated using randomly generated samples with \(n\) and \(p\) varying. Then, using a log-log regression, the complexities are estimated.

Though this assumption is wrong, it should help to have a better idea of how the algorithms work and it will reveal some implementation details / difference between the default settings of the same algorithm that one may overlook.

The results

Method \(\alpha\) \(\beta\)
RandomForestRegressor 1.21 0.89
ExtraTreesRegressor 1.03 0.88
AdaBoostRegressor 0.71 1.01
LinearRegression 0.72 1.3
SVR 1.96 0.42
RandomForestClassifier 1.09 0.38
ExtraTreesClassifier 0.81 0.31
AdaBoostClassifier 0.89 0.79
SVC 2.05 0.52
LogisticRegression(solver=liblinear) 0.9 0.88
LogisticRegression(solver=sag) 1.03 0.95

Surprisingly, some methods are sublinear in \(n\). Perhaps the sample sizes were too small. As expected, the Support Vector show a complexity in \(n\) that does not scale well with the sample size (close to 2).

Another interesting point to note are the complexities in \(p\) for the random forest and extra trees, the component in \(p\) varies according to the fact that we are performing a regression or a classification problem. A short look at the documentation explains it, they have different behaviors for each problem!

For the regression:

max_features : int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
If “auto”, then max_features=n_features.
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

Whereas the classification default behavior is

If “auto”, then max_features=sqrt(n_features).

The code

Fore those who would like to run the code over other algorithms, here is the method I used.

import numpy as np
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
import math


class ComplexityEvaluator:

    def __init__(self, nrow_samples, ncol_samples):
        self._nrow_samples = nrow_samples
        self._ncol_samples = ncol_samples

    def _time_samples(self, model, random_data_generator):
        rows_list = []
        for nrow in self._nrow_samples:
            for ncol in self._ncol_samples:
                train, labels = random_data_generator(nrow, ncol)

                start_time = time.time()
                model.fit(train, labels)
                elapsed_time = time.time() - start_time

                result = {"N": nrow, "P": ncol, "Time": elapsed_time}
                rows_list.append(result)

        return rows_list

    def Run(self, model, random_data_generator):
        data = pd.DataFrame(self._time_samples(model, random_data_generator))
        print(data)
        data = data.applymap(math.log)
        linear_model = LinearRegression(fit_intercept=True)
        linear_model.fit(data[["N", "P"]], data[["Time"]])
        return linear_model.coef_

And, some unit tests (that can just be appended at the bottom of the previous class).

if __name__ == "__main__":
    class TestModel:

        def __init__(self):
            pass

        def fit(self, x, y):
            time.sleep(x.shape[0] / 1000.)

    def random_data_generator(n, p):
        return np.random.rand(n, p), np.random.rand(n, 1)

    model = TestModel()

    complexity_evaluator = ComplexityEvaluator(
            [200, 500, 1000, 2000, 3000], [1,5,10])

    res = complexity_evaluator.Run(model, random_data_generator)

    print(res)

After a small unit test, everything seems consistent.

N   P      Time
0    200   1  0.200263
1    200   5  0.200421
2    200  10  0.200520
3    500   1  0.500928
4    500   5  0.500711
5    500  10  0.501064
6   1000   1  1.001396
7   1000   5  1.001420
8   1000  10  1.000629
9   2000   1  2.002225
10  2000   5  2.002096
11  2000  10  2.000759
12  3000   1  3.003331
13  3000   5  3.003340
14  3000  10  3.003350
[[  9.99583664e-01   1.13487892e-05]]

So let’s enjoy the number of algorithms offered by sklearn. The following list may be updated as new algorithms are tested.

import numpy as np
import ComplexityEvaluator
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.svm import SVR, SVC
from sklearn.linear_model import LogisticRegression


def random_data_regression(n, p):
    return np.random.rand(n, p), np.random.rand(n)


def random_data_classification(n, p):
    return np.random.rand(n, p), np.random.binomial(1, 0.5, n)


regression_models = [RandomForestRegressor(),
                     ExtraTreesRegressor(),
                     AdaBoostRegressor(),
                     LinearRegression(),
                     SVR()]

classification_models = [RandomForestClassifier(),
                         ExtraTreesClassifier(),
                         AdaBoostClassifier(),
                         SVC(),
                         LogisticRegression(),
                         LogisticRegression(solver='sag')]

names = ["RandomForestRegressor",
         "ExtraTreesRegressor",
         "AdaBoostRegressor",
         "LinearRegression",
         "SVR",
         "RandomForestClassifier",
         "ExtraTreesClassifier",
         "AdaBoostClassifier",
         "SVC",
         "LogisticRegression(solver=liblinear)",
         "LogisticRegression(solver=sag)"]

complexity_evaluator = ComplexityEvaluator.ComplexityEvaluator(
    [500, 1000, 2000, 5000, 10000, 15000, 20000],
    [5, 10, 20, 50, 100, 200])

i = 0
for model in regression_models:
    res = complexity_evaluator.Run(model, random_data_regression)[0]
    print(names[i] + ' | ' + str(round(res[0], 2)) +
          ' | ' + str(round(res[1], 2)))
    i = i + 1

for model in classification_models:
    res = complexity_evaluator.Run(model, random_data_classification)[0]
    print(names[i] + ' | ' + str(round(res[0], 2)) +
          ' | ' + str(round(res[1], 2)))
    i = i + 1

Learning more

Machine learning

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article !

Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar and Francis Bach provides a theoretical framework to various machine learning algorithms and a detailed implementation of some of them.

Algorithms and computational complexity

If you want a thorough understanding of computational complexity theory, these books are great resources.

Computational Complexity: A Modern Approach is a clear, detailed analysis of the topic, also covering cryptography and quantum computation.

Randomized Algorithms Though more specialized than the first one, I like the interplay between probabilities and algorithms presented here.

Introduction to q-methodology

“In Q methodology there is little more reason to understand the mathematics involved than there is to understand mechanics in order to drive a car.” S.R. Brown

Context

This methodology is particularly relevant when one wants to assess the order of importance of variables as seen by a sample of individuals. Per example “A good computer… must be fast, must be beautiful, must be resistant to shocks …”. One approach could be the usual Likert scale analysis. However, the issue is that some (many) individuals may think that every factor is very important.

This is why, somehow, q methodology was born. It simply forces the respondant to order their preferences. How ? The assertions have to be placed in a pyramid. Then the position of each element for each participant constitutes the dataset. Each individual will fill a pyramid with different assertions, like the one below.

Illustration of a qsort

Fig. 1: A q sort

Who uses Q-methodology ?

Some will use q-methdology to study the conceptualization of democracy by the individuals [1] the quality of education as seen by teachers [2] or the goals and values of beef farmers in Brazil [3], to name a few published applications. But it can also be used to improve customer behavior analysis, or define relevant (at last!) opinion surveys!

A detailed example

Assertions

Imagine one is asking which assertions describe a good runner.

  • A good runner must have a steady speed (speed)
  • A good runner eats well (food)
  • A good runner trains regularly (training)
  • A good runner does not drink alcohol (alcohol)

In parenthesis are the abbreviations that will be used in the next descriptions.

Encoding the data

Once the data gathered it looks like:

Individual speed food training alcohol
1. 0 0 +1 -1
2. +1 0 0 -1
3. +1 0 0 -1
     
n. 0 +1 0 -1

Per example the row number one would be the encoding of the responses on the first figure. Each line is called a q sort.

The method

Once the data gathered and encoded, a PCA (principal component analysis) is performed on this dataset, followed by an orthogonal rotation (usually varimax). The algebraic details can be found in [5].

In the end, the data looks like:

Factor 1st factor 2nd factor
speed +1 0  
food 0 +1  
training 0 -1  
alcohol -1 0  

Analysis

The analysis is then performed observing the axis after the rotation. Quoting [7]

The most important aspect of the study file will nonetheless be the factor arrays themselves. These will be found in a table, ‘Item or Factor Scores’. […] The interpretative task in Q methodology involves the production of a series of summarizing accounts, each of which explicates the viewpoint being expressed by a particular factor.

Basically each factor will be a point of view (or will represent different similar points of view), shared by many individuals in the sample. This is a very first step to factor analysis and I recommend reading [7] in more details for an analysis with actual data.

Implementations

The “Q sort” data collection procedure is traditionally done using a paper template and the sample of statements or other stimuli printed on individual cards. Wikipedia.

Setting up an online survey

Unless you have a lot of time to write down the assertions on small cardboards, and take pictures of the way they are disposed on a blackboard, I recommend using an online survey ! This github repository does a nice job : easy-htmlq. You really do not need a lot of knowledge of web development to get it running :)

If you do not want to go through the hassle of using your own server, you can use a firebase service (possible with this version of the code) to keep the data and deploy the pages on Netlify or on Github pages (depending on your preferences).

Let’s have a look at the contents.

├── fonts
│   ├── glyphicons-halflings-regular.eot
│   ├── [...] 
│   └── glyphicons-halflings-regular.woff
├── index.html
├── LICENSE
├── logo_center.jpg
├── logo.jpg
├── logo_right.jpg
├── README.md
├── settings
│   ├── config.xml
│   ├── firebaseInfo.js
│   ├── language.xml 
│   ├── map.xml 
│   └── statements.xml 
├── src
│   ├── angular.min.js
│   ├── [...]
│   └── xml2json.min.js
├── stylesheets
│   ├── bootstrap.min.css
│   └── htmlq.css
└── templates
    ├── dropEventText.html
    ├── [...] 
    └── thanks.html

The only things that need to be changed are in settings, which are just xml files.

statements.xml and map.xml

Corresponds to the statements to be sorted by the respondant. In the first figure, we would have used the following configuration.

<?xml version="1.0" encoding="UTF-8"?>
<statements version="1.0" htmlParse="false">
  <statement id="1">
    A good runner must have a steady speed 
  </statement>
  <statement id="2">
    A good runner eats well 
  </statement>
  <statement id="3">
    A good runner trains regularly 
  </statement>
  <statement id="4">
    A good runner does not drink alcohol 
  </statement>
</statements>

Note that map.xml must be modified accordingly:

<map version="1.0" htmlParse="false">
  <column id="-1" colour="FFD5D5">1</column>
  <column id=" 0" colour="E9E9E9">2</column>
  <column id="+1" colour="9FDFBF">1</column>
</map>

config.xml

Corresponds to other questions that can be asked (checkboxes…)

firebaseInfo.js

Where to put you firebase tokens.

// Initialize Firebase
var config = {
  apiKey: "",
  authDomain: "",
  databaseURL: "",
  projectId: "",
  storageBucket: "",
  messagingSenderId: ""
};
firebase.initializeApp(config);
var rootRef = firebase.database().ref();

language.xml

Enables you to change various elements of language.

Analysis and simulations, in R

I do not have data that I can publicly disclose, so the analysis will be on simulated data.

Let’s assume you collected the data and want to analyze it. There is an R package taking charge of the analysis [4]. Let’s have a look at it with simulated data. N will be the number of individuals, target_sort a distribution that is the “actual order of preferences” of the individuals, on which we will swap some elements accross individuals, randomly. The following code does the job.

require("qmethod")

N <- 15
target_sort <- c(-2, 2, 1, -1, 1, 0, 0, -1, 1, 0)

data <- t(replicate(N, target_sort))

for (i in 1:nrow(data))
{
  switch_indices <- sample(x = ncol(data), 2)
    tmp <- data[i, switch_indices[1]]
    data[i, switch_indices[1]] <- data[i, switch_indices[2]]
    data[i, switch_indices[2]] <- tmp
}

data <- t(data)
  rownames(data) <- paste0("assertion_", 1:10)
  colnames(data) <- paste0("individual_", 1:N)

qmethod(data, nfactors = 2)

Now let’s look at the output of qmethod. It is basically a very long console output divided in several blocks. The first block is just a summary of the parameters and the data provided to the method.

Q-method analysis.
Finished on:             Mon Apr 16 18:10:10 2018
Original data:           10 statements, 15 Q-sorts
Forced distribution:     TRUE
Number of factors:       2
Rotation:                varimax
Flagging:                automatic
Correlation coefficient: pearson
Q-method analysis.
Finished on:             Mon Apr 16 18:10:10 2018
Original data:           10 statements, 15 Q-sorts
Forced distribution:     TRUE
Number of factors:       2
Rotation:                varimax
Flagging:                automatic
Correlation coefficient: pearson

Then we have more details about the data sent to the qmethod function.

Original data :
             individual_1 individual_2 individual_3 individual_4 individual_5
assertion_1            -2            0           -2           -2           -2
assertion_2             2            2            2            0            2
assertion_3             1            1            1            1            1
assertion_4            -1           -1           -1           -1            1
assertion_5            -1            1           -1            1            1
assertion_6             0            0            0            0            0
assertion_7             0            0            0            0            0
assertion_8             1           -1            1           -1           -1
assertion_9             1            1            1            1           -1
assertion_10            0           -2            0            2            0
             individual_6 individual_7 individual_8 individual_9 individual_10
assertion_1            -2            1           -2            0            -2
assertion_2             2            2            2            2             2
assertion_3             1            1            1            1            -1
assertion_4            -1           -1           -1           -1             1
assertion_5             1            1           -1            1             1
assertion_6             0            0            0           -2             0
assertion_7            -1            0            0            0             0
assertion_8             0           -1            1           -1            -1
assertion_9             1           -2            1            1             1
assertion_10            0            0            0            0             0

The loadings. In a nutshell, they represent how close someone is to the factor column at the end of the qsort. Note that the individuals where the values -2, 2 were untouched by the random swap are the one with the highest loadings with respect to f1. This may be particularly interesting if the individual are heterogeneous, and to test wether one of them (or some of them) are actually really close to the “consensual preferences” (i.e. the principal component).

Q-sort factor loadings :
                 f1    f2
individual_1   0.93 0.085
individual_2   0.26 0.762
individual_3   0.93 0.085
individual_4   0.56 0.286
individual_5   0.43 0.553
individual_6   0.80 0.517
individual_7  -0.21 0.732
individual_8   0.93 0.085
individual_9   0.27 0.849
individual_10  0.54 0.406
 (...) See item '...$loa' for the full data.

This part is seldomly reported, I ommited some lines on purpose.

Flagged Q-sorts :
              flag_f1 flag_f2
individual_1  " TRUE" "FALSE"
[...]
individual_10 "FALSE" "FALSE"
 (...) See item '...$flagged' for the full data.

This part is seldomly reported, I ommited some lines on purpose.

Statement z-scores :
             zsc_f1 zsc_f2
assertion_1  -1.841 -0.290
[...]
assertion_10 -0.088 -0.382

At last we have the figures that will usually be reported in most papers relying on qsorts. Note that the first column corresponds (almost) to c(-2, 2, 1, -1, 1, 0, 0, -1, 1, 0) as expected! Increasing N in the code allows to recover exactly the original vector.

Statement factor scores :
             fsc_f1 fsc_f2
assertion_1      -2      0
assertion_2       2      2
assertion_3       1      1
assertion_4      -1     -1
assertion_5      -1      1
assertion_6       0     -2
assertion_7       0      0
assertion_8       1     -1
assertion_9       1      1
assertion_10      0      0
Factor characteristics:
   General factor characteristics: 
   av_rel_coef nload eigenvals expl_var reliability se_fscores
f1         0.8     6       6.1       41        0.96        0.2
f2         0.8     6       5.2       35        0.96        0.2

   Correlation between factor z-scores: 
       zsc_f1 zsc_f2
zsc_f1   1.00   0.52
zsc_f2   0.52   1.00

   Standard error of differences between factors: 
     f1   f2
f1 0.28 0.28
f2 0.28 0.28
Distinguishing and consensus statements :
              dist.and.cons f1_f2 sig_f1_f2
assertion_1  Distinguishing -1.55      ****
assertion_2       Consensus -0.17          
assertion_3       Consensus -0.08          
assertion_4       Consensus  0.17          
assertion_5  Distinguishing -1.50      ****
assertion_6  Distinguishing  1.07       ***
assertion_7       Consensus -0.11          
assertion_8  Distinguishing  1.59      ****
assertion_9       Consensus  0.29          
assertion_10      Consensus  0.29          

Final words

That’s it! Q methodology is a vast topic and deserves a whole book ! Covering the theory of principal components, rotations, possibly the mathematic underlying the method, like *varimax** and the other options, detailing various surveys and how the analysis was performed, setting up tests… As I was writing this post I realized how optimistic I was when I thought I could describe the method. Anyway, I hope this will be enough for a reader to set up one’s survey and analyze it.

Sources and external sites

[1] Rune Holmgaard Andersen, Jennie L. Schulze, and Külliki Seppel, “Pinning Down Democracy: A Q-Method Study of Lived Democracy,” Polity 50, no. 1 (January 2018): 4-42.

[2] Grover, Vijay Kumar (2015, August). Developing indicators of quality school education as perceived by teachers using Q-methodology approach. Zenith International Journal of Multidisciplinary Research, 5(8), 54-65.

[3] Pereira, Mariana A., John R. Fairweather, Keith B. Woodford, & Peter L. Nuthall (2016, April). Assessing the diversity of values and goals amongst Brazilian commercial-scale progressive beef farmers using Q-methodology. Agricultural Systems, 144, 1-8.

[4] Aiora Zabala. qmethod: A Package to Explore Human Perspectives Using Q Methodology. The R Journal, 6(2):163-173, Dec 2014.

[5] H. Abdi, “Factor Rotations in Factor Analyses”

[6] A primer on Q methodology - SR Brown - Operant subjectivity, 1993 - researchgate.net

[7] S. Watts and P. Stenner, “Doing Q methodology: theory, method and interpretation,” p. 26.

What to expect from a model when there is nothing to learn ?

An imbalanced binary classification problem

Did it ever happen to you to have a model that have a lower accuracy than a constant guessing (the one that predicts the most common class) ? It happened to me recently and I was quitte puzzled: 200 data points, two classes, 60% of the sample belonged to class one, the remaining part, to class two.

After running a random forest, I observed an accuracy of 50% on the out of bag predictions. This seemed really low, if there were nothing to learn from the data, then why did the model did not predict the most common class and achieved an accuracy of 60% ?

Playing with the parameters (increasing min_sample_leaf) improved the performance (but it was forcing the trees to have a ridiculously low depth).

So I decided to simulate the distribution of the out of sample accuracy of my model with the following snippet! (in R)

library("randomForest")

N <- 100
P <- 0.6
TRIALS <- 10000

random_response <- as.factor(rbinom(n = N, size = 1, p = P))

evaluate_error_rate <- function(blob)
{
  model <-
    randomForest(x = matrix(rnorm(n = N * 5), nrow = N), y = random_response)
    model$err.rate[nrow(model$err.rate), 1]
}

res <- sapply(1:TRIALS, evaluate_error_rate)

hist(res, main = 'Distribution of the out of sample error',
      xlab = 'Out of sample error (percent of mistmatches)')
mean(res) # 0.417235 seems good...
sum(res[res>0.5])/length(res) # 0.010754
sum(res[res>0.4])/length(res) # 0.272972

Illustration

What were the results ? The mean of the sample seems to converge to the accuracy of a constant predictor, which is good. However, the probability that a model makes more mistake than a random guess was not that low.

What is event better is that, given a training procedure, a number of points, a number of variables and an imbalance between classes, this distribution should not change. So now, one could even imagine to create a statistical test “did my model actually learned something”, and give a probability to the rejection of the null hypothesis. I leave this to the reader :)

Learning more

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article !

Applied predictive modelling is also good introduction to predictive modelling, with R (used in the code snippets) and machine learning.