============================
Chapter 1: Related work
============================

.. toctree::
   :maxdepth: 2

.. admonition:: Summary
   :class: hint

   In this chapter, we recall some concepts of machine learning. First, we review the
   principles, the learning framework and the evaluation protocol in supervised learning.
   Then, we present the algorithms used in our work: :math:`k`-Nearest Neighbors (:math:`k`-NN) and 
   Support Vector Machines (`svm`).


1.1 Classification Regression
=========================================

In this section, we review some terminology used in machine learning. First, we recall the
principle of machine learning. Then, we detail how to design a framework for supervised
learning. After that, we present model evaluation. Finally, we review data normalization.

1.1.1 Machine learning principle
----------------------------------------

The idea of machine learning (also known as pattern learning or pattern recognition) is to
imitate with algorithms executed on computers, the ability of living beings to learn from
examples. For instance, to teach a child how to read letters, we show him during a training
phase, labeled examples of letters ('A', 'B', 'C', etc.) written in different styles and fonts.
We don't give him a complete and analytic description of the topology of the characters but
labeled examples. Then, during a testing phase, we want the child to be able to recognize and
to label correctly the letters that have been seen during the training, and also to generalize
to new instances :cite:`Mei2015`

Let :math:`X=\{\textbf{x}_i,y_i\}_{i=1}^n` be a training set of n vector samples :math:`\textbf{x}_i \in \mathbb{R}^p` and :math:`y_i` their corresponding
labels. The aim of supervised machine learning is to learn a relationship (model) :math:`f` between
the samples :math:`\textbf{x}_i` and their labels :math:`y_i` based on examples :cite:`Bishop2006,Dreyfus2006,Duda1973`. After the
training phase based on labeled examples :math:`(\textbf{x}_i,y_i)`, the model :math:`f` has to be able to generalize on
the testing phase, *i.e.*, to give a correct prediction :math:`y_j` for new instances :math:`\textbf{x}_j` that haven't been
seen during the training.

When :math:`y_i` are class labels (*e.g.*, class 'A', 'B', 'C' in the case of child's reading), learning the
model :math:`f` is a classification problem; when :math:`y_i` is a continuous value (*e.g.*, the energy consumption
in a building), learning :math:`f` is a regression problem. For both problems, when a part of the
labels :math:`y_i` are known and another part of :math:`y_i` is unknown during training, learning :math:`f` is a semi-
supervised problem :cite:`Zhu2007`. Note that when the labels :math:`y_i` are totally unknown, learning :math:`f`
refers to a clustering problem (unsupervised learning) :cite:`Jain1999,Chen1996`. Semi-supervised
and unsupervised learning problems are out of the scope of this work.


1.1.2 Model selection in supervised learning
--------------------------------------------------

A key objective of supervised learning algorithms is to build models $f$ with good generalization
capabilities, *i.e.*, models f that correctly predict the labels :math:`y_j` of new unknown samples :math:`(\textbf{x}_j`.
There exist two types of errors committed by a classification or regression model $f$: training
error and generalization error. Training error is the error on the training set and generalization 
error is the error on the testing set. A good supervised model $f$ must not only
fit the training data $X$ well, it must also accurately classify records it has never seen before
(test set :math:`X_{Test}`). In other words, a good model $f$ must have low training error as well as
low generalization error. This is important because a model that fits the training data too
much can have a poorer generalization error than a model with a higher training error. Such
situation is known as model overfitting (:numref:`Overfitting`). In general, the complexity in learning can
be measured through 2 measures: the **information complexity** and the **computational complexity**. 
The information complexity concerns the generalization performances of the learner:
how many samples are needed? How much time the learner will take to converge to its optimal
solution? Etc. The computational complexity deals with the computational resources needed
to make a new prediction based on the training data.
In most cases, learning algorithms require to tune some hyper-parameters. A first approach
could consist in trying all the possible combinations of hyper-parameters values and keep the
one with the lowest training error. However, as discussed above, the model with the lowest
training error is not always the one with the best generalization error. To avoid overfitting,
the training set can be divided into 2 sets: a learning and a validation set. Suppose that we
have two hyper-parameters to tune: $C$ and $γ$. We make a grid search for each combination
($C$, $γ$) of the hyper-parameters, that is in this case a 2-dimensional grid (:numref:`Overfitting`). For each
combination (a cell of the grid), the model is learned on the learning set and evaluated on the
validation set. At the end, the hyper-parameters with the lowest error on the validation set are
retained and the model $f$ is learned on all training data using these optimal hyper-parameters.
This process is referred to as the model selection.

.. _Overfitting:

.. figure:: /docs/related_work/images/Overfitting.png
   :width: 300px
   :align: center
   
   An example of overfitting in the case of classification. The objective is to separate blue points from red points. Black line shows a classifier $f_1$ with low complexity where as green line illustrates a classifier $f_2$ with high complexity. On training examples (blue and red points), the model $f_2$ separates all the classes perfectly but may lead to poor generalization on new unseen examples. Model $f_1$ is often preferred.


In most cases, learning algorithms require to tune some hyper-parameters. A first approach could 
consist in trying all the possible combinations of hyper-parameters values and keep the one with 
the lowest training error. However, as discussed above, the model with the lowest training error 
is not always the one with the best generalization error. To avoid overfitting, the training set 
can be divided into 2 sets: a learning and a validation set. Suppose that we have two hyper-parameters 
to tune: $C$ and $γ$. We make a grid search for each combination $(C,γ)$ of the hyper-parameters, 
that is in this case a 2-dimensional grid (:numref:`GridSearch`). For each combination (a cell of 
the grid), the model is learned on the learning set and evaluated on the validation set. At the end, 
the hyper-parameters with the lowest error on the validation set are retained and the model $f$ is 
learned on all training data using these optimal hyper-parameters. This process is referred to as 
the **model selection**. 

.. _GridSearch:

.. figure:: /docs/related_work/images/GridSearch.png
   :width: 250px
	
   Example of a 2 dimensional grid search for parameters $C$ and $γ$. 
   It defines a grid where each cell of the grid contains a combination 
   ($C$, $γ$). Each combination is used to learn the model and is evaluated 
   on the validation set.

An alternative is **cross-validation** with $v$ folds, illustrated in 
:numref:`Cross_validation2`. In this approach, we partition the training data into 
$v$ equal-sized subsets. The objective is to evaluate the error for each combination of 
hyper-parameters. For each run, one fold is chosen for validation, while the $v-1$ remaining 
folds are used as the learning set. We repeat the process for each fold, thus $v$ times. 
Each fold gives one validation error and thus we obtain $v$ errors. The total error for the 
current combination of hyper-parameters is obtained by summing up the errors for all $v$ folds. 
When $v=n$, the size of training set, this approach is called leave-one-out or Jackknife. 
Each test set contains only one sample. The advantage is that as much data as possible are used 
for training. Moreover, the validation sets are exclusive and they cover the entire data set. 
The drawback is that it is computationally expensive to repeat the procedure $n$ times. Furthermore, 
since each validation set contains only one record, the variance of the estimated performance metric 
is usually high. This procedure is often used when $n$, the size of the training set, is small. 
There exist other methods such as sub-sampling or bootstrap :cite:`Duda1973,Dreyfus2006`. We only 
use cross-validation in our experiments.

.. _Cross_validation2:

.. figure:: /docs/related_work/images/Cross_validation2.jpg
   :width: 800px

   $v$-fold Cross-validation for one combination of parameters. For each of $v$ experiments, 
   use $v-1$ folds for training and a different fold for Testing, then the training error for 
   this combination of parameter is the mean of all testing errors. This procedure is 
   illustrated for $v=4$


.. figure:: /docs/related_work/images/LearningFramework.png
   :width: 800px
	
   General framework for building a supervised (classification/regression) model. Example with 3 features and 2 classes ('Yes' and 'No')

.. figure:: /docs/related_work/images/Dataset.png
   :width: 800px

   Division of a dataset into 3 datasets: training, test and operational