Introduction
Following the article about machine learning and ocaml I present another learning algorithm in OCaml and its comparison to a Python (compiled with pypy) equivalent.
First of all, PyPy is a fast, compliant alternative implementation of the Python language (2.7.13 and 3.5.3). As fast, it means orders of magnitude faster than a simple Python execution. As many libraries are not (directly) compatible with it, it remains quite unused. However, when using the default libraries, PyPy works like a charm.
Online logistic regression
A common algorithm, often met on Kaggle, is the online logistic regression. Note that the data used will as well come from the Avazu CTR competition. To make the experience shorter, I will only work with the first 1 000 000 lines of the train.csv
file (later referred to as train_small.csv
).
This algorithm allows to train data in streaming which means that one does not need to load all the data in memory (which is sometimes infeasible). These methods are usually quite fast, and being simple to implement, this allows a lot of tuning and feature engineering on the fly.
On top of that, it automatically “one-hot encodes” the data. Which means that even if the number of categories per variable is important (which results in an underlying high dimensional space), this will not be an issue for this method (and the memory used by this program will remain constant, randomly projecting some components).
The Python code
I do not take any credit for the awesomeness of this code. I simply paste it for the experiment to be easier to reproduce. Some modifications have been performed to make it more comparable to the ocaml implementation (most of them are commented).
'''
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
Version 2, December 2004
Copyright (C) 2004 Sam Hocevar <sam@hocevar.net>
Everyone is permitted to copy and distribute verbatim or modified
copies of this license document, and changing it is allowed as long
as the name is changed.
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. You just DO WHAT THE FUCK YOU WANT TO.
'''
from datetime import datetime
from csv import DictReader
from math import exp, log, sqrt
##############################################################################
# parameters #################################################################
##############################################################################
# A, paths
train = 'train_small.csv' # path to training file
# B, model
alpha = .01 # learning rate
# C, feature/hash trick
D = 2 ** 20 # number of weights to use
do_interactions = False # whether to enable poly2 feature interactions
# D, training/validation
epoch = 1 # learn training data for N passes
holdout = 100 # use every N training instance for holdout validation
##############################################################################
# class, function, generator definitions #####################################
##############################################################################
class logistic_regression(object):
''' Classical logistic regression
This class (algorithm) is not used in this code, it is putted here
for a quick reference in hope to make the following (more complex)
algorithm more understandable.
'''
def __init__(self, alpha, D, interaction=False):
# parameters
self.alpha = alpha
# model
self.w = [0.] * D
def predict(self, x):
# parameters
alpha = self.alpha
# model
w = self.w
# wTx is the inner product of w and x
wTx = sum(w[i] for i in x)
# bounded sigmoid function, this is the probability of being clicked
return 1. / (1. + exp(-max(min(wTx, 35.), -35.)))
def update(self, x, p, y):
# parameter
alpha = self.alpha
# model
w = self.w
# gradient under logloss
g = p - y
# update w
for i in x:
w[i] -= g * alpha
def logloss(p, y):
''' FUNCTION: Bounded logloss
INPUT:
p: our prediction
y: real answer
OUTPUT:
logarithmic loss of p given y
'''
# p = max(min(p, 1. - 10e-15), 10e-15)
return -log(p) if y == 1. else -log(1. - p)
def data(path, D):
''' GENERATOR: Apply hash-trick to the original csv row
and for simplicity, we one-hot-encode everything
INPUT:
path: path to training or testing file
D: the max index that we can hash to
YIELDS:
ID: id of the instance, mainly useless
x: a list of hashed and one-hot-encoded 'indices'
we only need the index since all values are either 0 or 1
y: y = 1 if we have a click, else we have y = 0
'''
for t, row in enumerate(DictReader(open(path))):
# process id
# ID = row['id'] -> not needed
del row['id']
# process clicks
y = 0.
#if 'click' in row: -> always true in the train set
if row['click'] == '1':
y = 1.
del row['click']
# turn hour really into hour, it was originally YYMMDDHH
#row['hour'] = row['hour'][6:]
# build x
x = [0] # 0 is the index of the bias term
#for key in sorted(row): # sort is for preserving feature ordering
for key in row:
value = row[key]
# one-hot encode everything with hash trick
index = abs(hash(key + '_' + value)) % D
x.append(index)
yield t, x, y
##############################################################################
# start training #############################################################
##############################################################################
start = datetime.now()
# initialize ourselves a learner
learner = logistic_regression(alpha, D)
# start training
for e in xrange(epoch):
loss = 0.
count = 0
for t, x, y in data(train, D): # data is a generator
# t: just a instance counter
# ID: id provided in original data
# x: features
# y: label (click)
# step 1, get prediction from learner
p = learner.predict(x)
if t % holdout == 0:
# step 2-1, calculate holdout validation loss
# we do not train with the holdout data so that our
# validation loss is an accurate estimation of
# the out-of-sample error
loss += logloss(p, y)
count += 1
else:
# step 2-2, update learner with label (click) information
learner.update(x, p, y)
if t % 100000 == 0 and t > 1:
print('encountered: %d\tcurrent logloss: %f' % (
t, loss/count))
OCaml implementation
open Maths
open Read_tools
let get_indices dict n = Hashtbl.fold (fun k v acc -> ((Hashtbl.hash (k^" "^v)) mod n) :: acc) dict []
let predict indices weights = sigmoid (dot_product indices weights)
let rec update indices weights p y alpha = match indices with
| [] -> ()
| h::tail -> weights.(h) <- (weights.(h) -. (p -. y) *. alpha) ; update tail weights p y alpha
let n = pow 2 20
let weights = Array.make n 0.
let dict_stream = dict_reader "train_small.csv"
let updater indices weights p y = update indices weights p y 0.01
let refresh_loss = 100000
let train weights updater dict_stream =
let rec aux weights updater dict_stream t loss n = match (try Some( Stream.next dict_stream)
with _ -> None) with
| Some dict -> Hashtbl.remove dict "id";
let y = float_of_string (Hashtbl.find dict "click") in
Hashtbl.remove dict "click";
let indices = get_indices dict n in
let p = predict indices weights in
updater indices weights p y;
if ((t mod refresh_loss) == 0) && t > 0 then begin
print_string "encountered: ";
print_int t;
print_string "\t logloss:";
print_float (loss /. float_of_int(t));
print_endline " ";
end;
aux weights updater dict_stream (t + 1) (loss +. (log_loss p y)) n
| None -> () in aux weights updater dict_stream 0 0. (Array.length weights) ;;
train weights updater dict_stream;
Where the maths.ml
contains few math functions.
let dot_product indices weights =
let rec aux indices weights acc =
match indices with
| [] -> acc
| h::tail -> aux tail weights (acc +. weights.(h)) in
aux indices weights 0.
let sigmoid x = 1. /. (1. +. exp(0. -. (max (min x 35.) (-35.))))
let log_loss p y = match y with 1. -> -. log(p) | _ -> -. log(1. -. p)
let rec pow a = function
| 0 -> 1
| 1 -> a
| n ->
let b = pow a (n / 2) in
b * b * (if n mod 2 == 0 then 1 else a)
And read_tools.ml
implements various tools to stream from a csv file.
open Str
let csv_separator = ","
let err_lists_sizes = "Incompatible lists size"
let line_stream_of_channel channel =
Stream.from (fun _ -> try Some (input_line channel) with End_of_file -> None)
let read_lines file_path = line_stream_of_channel (open_in file_path)
let read_first_line file_path = Stream.next (read_lines file_path)
let split_line line = Str.split (Str.regexp csv_separator) line
let concat_elements list1 list2 =
let rec aux list1 list2 acc = match list1,list2 with
| [],[] -> acc
| a,[] -> failwith err_lists_sizes
| [],a -> failwith err_lists_sizes
| h1::t1,h2::t2 -> aux t1 t2 ((h1^h2)::acc) in List.rev (aux list1 list2 [])
let to_dict list1 list2 =
let rec aux list1 list2 my_hash = match list1,list2 with
| [],[] -> my_hash
| a,[] -> failwith err_lists_sizes
| [],a -> failwith err_lists_sizes
| h1::t1,h2::t2 -> Hashtbl.add my_hash h1 h2; aux t1 t2 my_hash in aux list1 list2 (Hashtbl.create 15)
let dict_reader file_path =
let line_stream = read_lines file_path in
let header = split_line (Stream.next line_stream) in
Stream.from
(fun _ ->
try Some (to_dict header (split_line (Stream.next line_stream))) with End_of_file -> None)
Results
$ time pypy main.py
encountered: 100000 current logloss: 0.435212
encountered: 200000 current logloss: 0.433022
encountered: 300000 current logloss: 0.416148
encountered: 400000 current logloss: 0.411722
encountered: 500000 current logloss: 0.403127
encountered: 600000 current logloss: 0.405477
encountered: 700000 current logloss: 0.405665
encountered: 800000 current logloss: 0.400225
encountered: 900000 current logloss: 0.397522
real 0m24.649s
user 0m23.964s
sys 0m0.540s
$ ocamlfind opt -linkpkg str.cmxa maths.ml read_tools.ml main.ml -o main.byte
$ time ./main.byte
encountered: 100000 logloss:0.422601941149
encountered: 200000 logloss:0.418270062183
encountered: 300000 logloss:0.409809392923
encountered: 400000 logloss:0.40026898568
encountered: 500000 logloss:0.396667944914
encountered: 600000 logloss:0.398242961764
encountered: 700000 logloss:0.397423028349
encountered: 800000 logloss:0.394274547682
encountered: 900000 logloss:0.391874034523
real 0m23.187s
user 0m22.980s
sys 0m0.176s
The logloss seems lower with the OCaml code. Don’t be fooled by this. This stems from the fact that Python and OCaml have different hashing functions, therefore, the collisions happening between the features (with one hot encoding) are not the same from one source to the other. As for the training time, there is no sensible difference ! I guess that concludes the story… For this problem, just pick the language you feel most comfortable with :)
Improvements
In terms of code
Actually, OCaml as well has a csv library, available with opam
.
In terms of accuracy
There are many variations of this algorithm. The “most famous” being the FTRL
Learning more
Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar and Francis Bach provides a theoretical framework to various machine learning algorithms and a detailed implementation of some of them. The section about streaming algorithms is particularly useful in the context of this article.
Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.