Introduction
Python has an amazing progress bar library tqdm. In a for loop or a map statement, tqdm
allows the user to know how much time is needed for the loop to end. Though simple, this happens to be very useful when performing treatment on a dataset with hundreds of thousands of lines.
76%|████████████████████████████ | 7568/10000 [00:33<00:10, 229.00it/s]
[##############-------------------------] 38.00% ending in : 3.21s
Regarding machine learning pipelines, Python does a great job. However, it lacks the power of compiler. First, pre-treatment of the data is sometimes done using functions written in pure Python, not particularly fast.
But more important, imagine the following pipeline in OCaml:
let new_text = my_text
|> Str.split
|> List.map fix_spelling
|> remove_stopwords
|> List.map vectorize
|> FloatArray.average
Maybe these functions do not really exist, but a sentence (my_text
) is splitted in a list of words, the spelling of each word is fixed (List.map fix_spelling
), the stopwords are removed, the words are vectorized (with something like word2vec) and at last, the list of vectorized words is averaged. It is pretty clear that we input a sentence (a string
) and return a float array
.
Now imagine that for some reason, the pipeline is misspelled in something like this:
let new_text = my_text
|> Str.split
|> List.map fix_spelling
|> List.map vectorize
|> remove_stopwords
|> FloatArray.average
The compiler would immediatly complain that remove_stopwords
does not apply to float array list
(but was expecting a string list
). On the other hand, depending on the implementation, it may take a while for the same program in Pyhton to realize that some steps cannot be performed.
Therefore, I believe that OCaml could be an amazing fit for machine learning: the computing intensive libraries (random forests, SVMs…) can be implemented in another language (which is the case with scikit-learn) and the pipeline would be much more robust in OCaml.
As I love to use tqdm in Python, I wanted an OCaml equivalent.
OCaml functors
A functor is simply a module, which accepts another module as a parameter. Let’s look at the following module signature, a finite collection. It has a map
function, which transforms every element and a length
function. Note that length
could be inferred from map and a ref
. However, modules such as Seq allow infinite collections: they have a map function, but cannot evaluate the length of a sequence.
module type FINITECOLLECTION = sig
type 'a t
val map : ('a -> 'b) -> 'a t -> 'b t
val length : 'a t -> int
end
Given this signature we can now define a module which will operate on finite collections. Without paying much attention to the contents of map
, simply note that the only functions called in map are the ones specified in the signature: Collection.length
and Collection.map
.
(** Creates a new map function, reporting the progress in the terminal *)
module Prokope(Collection : FINITECOLLECTION) = struct
(** Applies f to a finite collection l, while reporting progress in the terminal *)
let map f l =
let open ProkopeUtils in
let starting_time = Unix.gettimeofday() in
let wrapped_f = progress_wrapper default_progress_bar (Collection.length l) starting_time f in
let res = add_indices_to_collection Collection.map l
|> Collection.map wrapped_f in
print_string "\n";
res
end
Now, we would like to be able to use this new map method on array
and list
types. For this, we just have to reproduce the signature of the FINITECOLLECTION
. The following does it.
module UsualArray = struct
type 'a t = 'a array
include Array
end
module Array = Prokope(UsualArray)
module UsualList = struct
type 'a t = 'a list
include List
end
module List = Prokope(UsualList)
Now this function can simply be invoked appending Prokope before
let dummy a =
ignore (Unix.select [] [] [] 0.1);
a
let double x = 2 * x
let id x = x
let () =
let sample = List.init 50 id in
assert(List.map double sample = Prokope.List.map double sample);
let _ = Prokope.List.map dummy sample in
()
And you would see:
[#######################################] 100.00% ending in : 0.00s
[##############-------------------------] 38.00% ending in : 3.21s
The implementation
A couple of tricks
A couple of things to know is that OCaml may not flush to the standard output as you may expect. But this can be forced by calling
flush stdout;
Now to write a line on the actual line of the terminal, the only thing to do is to call
print_string ("\r"^(progress_bar_string config percentage)^"ending in : "^remaining_time_str^"s");
You may find more informations on \n
and \r
on StackOverflow though.
The implementation
I decided to call this library Prokope. Maybe on opam someday ;)
Prokope utils
Let’s start with the helpers needed to display and configure a progress bar.
prokopeUtils.ml
(** Various helpers for Prokope *)
(** Graphical parameters for the progress bar *)
type progress_bar_config = {
width: int;
cell_filled: string;
cell_empty: string
}
(** The width of the terminal *)
let default_width =
match Terminal_size.get_columns () with
| Some(a) -> a
| None -> 80
(** A default progress bar *)
let default_progress_bar = {
width = default_width;
cell_filled = "#";
cell_empty = "-"
}
(** For a collection 'a t returns a collection of type (int, 'a) t *)
let add_indices_to_collection map collection =
let index = ref (-1) in
let add_index_to_element e =
index := !index + 1;
(!index, e)
in
map add_index_to_element collection
(** Repeats the string s n times *)
let repeat_string s n =
List.init n (fun _ -> ())
|> List.fold_left (fun e _ -> s^e) ""
(** Returns the string represented by the progress bar configuration and the percentage *)
let progress_bar_string config percentage =
let gauge_string n_steps inner_width =
let remaining_steps = inner_width - n_steps in
"["^
(repeat_string config.cell_filled n_steps)^
(repeat_string config.cell_empty remaining_steps)^
"]"
in
let inner_width = max (config.width - 40) 0 in
let n_steps = int_of_float (percentage *. (float_of_int inner_width)) in
let rounded_percentage_str = Printf.sprintf "%.2f" (100. *. percentage) in
(gauge_string n_steps inner_width)^" "^rounded_percentage_str^"% "
(** For a fonction ('a -> 'b) returns a function ( ('a, int) -> 'b )
* "aware" of its starting time and operations to perform *)
let progress_wrapper config list_length starting_time f a =
let i, x = a in
let percentage = (float_of_int i +. 1.) /. (float_of_int list_length) in
let time_taken = (Unix.gettimeofday() -. starting_time) in
let speed = (float_of_int i) /. time_taken in
let remaining_time_str = (float_of_int (list_length - i)) /. speed
|> Printf.sprintf "%.2f"
in
print_string ("\r"^(progress_bar_string config percentage)^"ending in : "^remaining_time_str^"s");
flush stdout;
f x
The functor
Why a functor ? To enable a user to generate a progress bar for every finite collection ! Imagine that your data is stored on a binary tree for some reason. Having a map
and a length
function over trees is trivial (and it may also belong to your tree module). Calling Prokope(MyTree)
would therefore return a new module, for which map
shows a progress bar.
prokope.ml
(** User intended functor *)
(** Any collection whose elements can be counted and mapped.
Note that length has to be implemented (this disqualifies
modules such as Seq). *)
module type FINITECOLLECTION = sig
type 'a t
val map : ('a -> 'b) -> 'a t -> 'b t
val length : 'a t -> int
end
(** Creates a new map function, reporting the progress in the terminal *)
module Prokope(Collection : FINITECOLLECTION) = struct
(** Applies f to a finite collection l, while reporting progress in the terminal *)
let map f l =
let open ProkopeUtils in
let starting_time = Unix.gettimeofday() in
let wrapped_f = progress_wrapper default_progress_bar (Collection.length l) starting_time f in
let res = add_indices_to_collection Collection.map l
|> Collection.map wrapped_f in
print_string "\n";
res
end
module UsualArray = struct
type 'a t = 'a array
include Array
end
module Array = Prokope(UsualArray)
module UsualList = struct
type 'a t = 'a list
include List
end
module List = Prokope(UsualList)
Testing and displaying
run.ml
let dummy a =
ignore (Unix.select [] [] [] 0.1);
a
let double x = 2 * x
let id x = x
let () =
let sample = List.init 50 id in
assert(List.map double sample = Prokope.List.map double sample);
let _ = Prokope.List.map dummy sample in
()
run.sh
ocamlfind opt -package unix,terminal_size -linkpkg -o run.byte prokopeUtils.ml prokope.ml run.ml
./run.byte
Hope you liked it! If you feel you may need it some day, or have improvements suggestions, please let me know, I would improve it and publish it!