/*!

@file kmeans.txt
@author Ryan Curtin
@brief Tutorial for how to use k-means in mlpack.

@page kmtutorial K-Means tutorial (kmeans)

@section intro_kmtut Introduction

The popular k-means algorithm for clustering has been around since the late
1950s, and the standard algorithm was proposed by Stuart Lloyd in 1957.  Given a
set of points \f$ X \f$, k-means clustering aims to partition each point \f$ x_i
\f$ into a cluster \f$ c_j \f$ (where \f$ j \le k \f$ and \f$ k \f$, the number
of clusters, is a parameter).  The partitioning is done to minimize the
objective function

\f[
\sum_{j = 1}^{k} \sum_{x_i \in c_j} \| x_i - \mu_j \|^2
\f]

where \f$\mu_j\f$ is the centroid of cluster \f$c_j\f$.  The standard algorithm
is a two-step algorithm:

 - \b Assignment \b step.  Each point \f$x_i\f$ in \f$X\f$ is assigned to the
   cluster whose centroid it is closest to.

 - \b Update \b step.  Using the new cluster assignments, the centroids of each
   cluster are recalculated.

The algorithm has converged when no more assignment changes are happening with
each iteration.  However, this algorithm can get stuck in local minima of the
objective function and is particularly sensitive to the initial cluster
assignments.  Also, situations can arise where the algorithm will never converge
but reaches steady state -- for instance, one point may be changing between two
cluster assignments.

There is vast literature on the k-means algorithm and its uses, as well as
strategies for choosing initial points effectively and keeping the algorithm
from converging in local minima.  \b mlpack does implement some of these,
notably the Bradley-Fayyad algorithm (see the reference below) for choosing
refined initial points.  Importantly, the C++ \c KMeans class makes it very easy
to improve the k-means algorithm in a modular way.

@code
@inproceedings{bradley1998refining,
  title={Refining initial points for k-means clustering},
  author={Bradley, Paul S. and Fayyad, Usama M.},
  booktitle={Proceedings of the Fifteenth International Conference on Machine
      Learning (ICML 1998)},
  volume={66},
  year={1998}
}
@endcode

\b mlpack provides:

 - a \ref cli_kmtut "simple command-line executable" to run k-means
 - a \ref kmeans_kmtut "simple C++ interface" to run k-means
 - a \ref kmeans_template_kmtut "generic, extensible, and powerful C++ class"
   for complex usage

@section toc_kmtut Table of Contents

A list of all the sections this tutorial contains.

 - \ref intro_kmtut
 - \ref toc_kmtut
 - \ref cli_kmtut
   - \ref cli_ex1_kmtut 
   - \ref cli_ex2_kmtut 
   - \ref cli_ex3_kmtut 
   - \ref cli_ex4_kmtut 
   - \ref cli_ex5_kmtut 
   - \ref cli_ex6_kmtut 
 - \ref kmeans_kmtut
   - \ref kmeans_ex1_kmtut 
   - \ref kmeans_ex2_kmtut 
   - \ref kmeans_ex3_kmtut 
   - \ref kmeans_ex4_kmtut 
   - \ref kmeans_ex5_kmtut 
   - \ref kmeans_ex6_kmtut 
   - \ref kmeans_ex7_kmtut 
 - \ref kmeans_template_kmtut
   - \ref kmeans_metric_kmtut 
   - \ref kmeans_initial_partition_kmtut 
   - \ref kmeans_empty_cluster_kmtut 
 - \ref further_doc_kmtut

@section cli_kmtut Command-Line 'kmeans'

\b mlpack provides a command-line executable, \c kmeans, to allow easy execution
of the k-means algorithm on data.  Complete documentation of the executable can
be found by typing

@code
$ kmeans --help
@endcode

Below are several examples demonstrating simple use of the \c kmeans executable.

@subsection cli_ex1_kmtut Simple k-means clustering

We want to find 5 clusters using the points in the file dataset.csv.  By
default, if any of the clusters end up empty, that cluster will be reinitialized
to contain the point furthest from the cluster with maximum variance.  The
cluster assignments of each point will be stored in assignments.csv.  Each row
in assignments.csv will correspond to the row in dataset.csv.

@code
$ kmeans -c 5 -i dataset.csv -v -o assignments.csv
@endcode

@subsection cli_ex2_kmtut Saving the resulting centroids

Sometimes it is useful to save the centroids of the clusters found by k-means;
one example might be for plotting the points.  The \c -C (\c --centroid_file)
option allows specification of a file into which the centroids will be saved
(one centroid per line, if it is a CSV or other text format).

@code
$ kmeans -c 5 -i dataset.csv -v -o assignments.csv -C centroids.csv
@endcode

@subsection cli_ex3_kmtut Allowing empty clusters

If you would like to allow empty clusters to exist, instead of reinitializing
them, simply specify the \c -e (\c --allow_empty_clusters) option.  Note that
when you save your clusters, some of the clusters may be filled with NaNs.  This
is expected behavior -- if a cluster has no points, the concept of a centroid
makes no sense.

@code
$ kmeans -c 5 -i dataset.csv -v -o assignments.csv -C centroids.csv
@endcode

@subsection cli_ex4_kmtut Limiting the maximum number of iterations

As mentioned earlier, the k-means algorithm can often fail to converge.  In such
a situation, it may be useful to stop the algorithm by way of limiting the
maximum number of iterations.  This can be done with the \c -m (\c
--max_iterations) parameter, which is set to 1000 by default.  If the maximum
number of iterations is 0, the algorithm will run until convergence -- or
potentially forever.  The example below sets a maximum of 250 iterations.

@code
$ kmeans -c 5 -i dataset.csv -v -o assignments.csv -m 250
@endcode

@subsection cli_ex5_kmtut Setting the overclustering factor

The \b mlpack k-means implementation allows "overclustering", which is when the
k-means algorithm is run with more than the requested number of clusters.  Upon
convergence, the clusters with the nearest centroids are merged until only the
requested number of centroids remain.  This can provide better clustering
results.  The overclustering factor, specified with \c -O or \c
--overclustering, determines how many more clusters are found than were
requested.  For instance, with \c k set to 5 and an overclustering factor of 2,
10 clusters will be found.  Note that the overclustering factor does not need to
be an integer.

The following code snippet finds 5 clusters, but with an overclustering factor
of 2.4 (so 12 clusters are found and then merged together to produce 5 final
clusters).

@code
$ kmeans -c 5 -O 2.4 -i dataset.csv -v -o assignments.csv
@endcode

@subsection cli_ex6_kmtut Using Bradley-Fayyad "refined start"

The method proposed by Bradley and Fayyad in their paper "Refining initial
points for k-means clustering" is implemented in \b mlpack.  This strategy
samples points from the dataset and runs k-means clustering on those points
multiple times, saving the resulting clusters.  Then, k-means clustering is run
on those clusters, yielding the original number of clusters.  The centroids of
those resulting clusters are used as initial centroids for k-means clustering on
the entire dataset.

This technique generally gives better initial points than the default random
partitioning, but depending on the parameters, it can take much longer.  This
initialization technique is enabled with the \c -r (\c --refined_start) option.
The \c -S (\c --samplings) parameter controls how many samplings of the dataset
are performed, and the \c -p (\c --percentage) parameter controls how much of
the dataset is randomly sampled for each sampling (it must be between 0.0 and
1.0).  For more information on the refined start technique, see the paper
referenced in the introduction of this tutorial.

The example below performs k-means clustering, giving 5 clusters, using the
refined start technique, sampling 10% of the dataset 25 times to produce the
initial centroids.

@code
$ kmeans -c 5 -i dataset.csv -v -o assignments.csv -r -S 25 -p 0.2
@endcode

@section kmeans_kmtut The 'KMeans' class

The \c KMeans<> class (with default template parameters) provides a simple way
to run k-means clustering using \b mlpack in C++.  The default template
parameters for \c KMeans<> will initialize cluster assignments randomly and
disallow empty clusters.  When an empty cluster is encountered, the point
furthest from the cluster with maximum variance is set to the centroid of the
empty cluster.

@subsection kmeans_ex1_kmtut Running k-means and getting cluster assignments

The simplest way to use the \c KMeans<> class is to pass in a dataset and a
number of clusters, and receive the cluster assignments in return.  Note that
the dataset must be column-major -- that is, one column corresponds to one
point.  See \ref matrices "the matrices guide" for more information.

@code
#include <mlpack/methods/kmeans/kmeans.hpp>

using namespace mlpack::kmeans;

// The dataset we are clustering.
extern arma::mat data;
// The number of clusters we are getting.
extern size_t clusters;

// The assignments will be stored in this vector.
arma::Col<size_t> assignments;

// Initialize with the default arguments.
KMeans<> k;
k.Cluster(data, clusters, assignments);
@endcode

Now, the vector \c assignments holds the cluster assignments of each point in
the dataset.

@subsection kmeans_ex2_kmtut Running k-means and getting centroids of clusters

Often it is useful to not only have the cluster assignments, but the centroids
of each cluster.  Another overload of \c Cluster() makes this easily possible:

@code
#include <mlpack/methods/kmeans/kmeans.hpp>

using namespace mlpack::kmeans;

// The dataset we are clustering.
extern arma::mat data;
// The number of clusters we are getting.
extern size_t clusters;

// The assignments will be stored in this vector.
arma::Col<size_t> assignments;
// The centroids will be stored in this matrix.
arma::mat centroids;

// Initialize with the default arguments.
KMeans<> k;
k.Cluster(data, clusters, assignments, centroids);
@endcode

Note that the centroids matrix has columns equal to the number of clusters and
rows equal to the dimensionality of the dataset.  Each column represents the
centroid of the according cluster -- \c centroids.col(0) represents the
centroid of the first cluster.

@subsection kmeans_ex3_kmtut Limiting the maximum number of iterations

The first argument to the constructor allows specification of the maximum number
of iterations.  This is useful because often, the k-means algorithm does not
converge, and is terminated after a number of iterations.  Setting this
parameter to 0 indicates that the algorithm will run until convergence -- note
that in some cases, convergence may never happen.  The default maximum number of
iterations is 1000.

@code
// The first argument is the maximum number of iterations.  Here we set it to
// 500 iterations.
KMeans<> k(500);
@endcode

Then you can run \c Cluster() as normal.

@subsection kmeans_ex4_kmtut Setting the overclustering factor

For a description of what overclustering is, see
@ref cli_ex5_kmtut "the command-line interface tutorial about overclustering".

The overclustering factor, which by default is 1.0 (this indicates that no
overclustering is happening), is specified in the second argument to the
constructor.

@code
// We will keep the default maximum iterations of 1000, but set the
// overclustering factor to 2.5.
KMeans<> k(1000, 2.5);
@endcode

Then you can run \c Cluster() as normal.

@subsection kmeans_ex5_kmtut Setting initial cluster assignments

If you have an initial guess for the cluster assignments for each point, you can
fill the assignments vector with the guess and then pass an extra boolean
(initialAssignmentGuess) as true to the \c Cluster() method.  Below are examples
for either overload of \c Cluster().

@code
#include <mlpack/methods/kmeans/kmeans.hpp>

using namespace mlpack::kmeans;

// The dataset we are clustering on.
extern arma::mat dataset;
// The number of clusters we are obtaining.
extern size_t clusters;

// A vector pre-filled with initial assignment guesses.
extern arma::Col<size_t> assignments;

KMeans<> k;

// The boolean set to true indicates that our assignments vector is filled with
// initial guesses.
k.Cluster(dataset, clusters, assignments, true);
@endcode

@code
#include <mlpack/methods/kmeans/kmeans.hpp>

using namespace mlpack::kmeans;

// The dataset we are clustering on.
extern arma::mat dataset;
// The number of clusters we are obtaining.
extern size_t clusters;

// A vector pre-filled with initial assignment guesses.
extern arma::Col<size_t> assignments;

// This will hold the centroids of the finished clusters.
arma::mat centroids;

KMeans<> k;

// The boolean set to true indicates that our assignments vector is filled with
// initial guesses.
k.Cluster(dataset, clusters, assignments, centroids, true);
@endcode

@note
If you have a heuristic or algorithm which makes initial guesses, a
more elegant solution is to create a new class fulfilling the
InitialPartitionPolicy template policy.  See \ref kmeans_initial_partition_kmtut
"the section about changing the initial partitioning strategy" for more details.

@par

@note
If you set the InitialPartitionPolicy parameter to something other than the
default but give an initial cluster assignment guess, the InitialPartitionPolicy
will not be used to initialize the algorithm.  See \ref kmeans_initial_partition_kmtut
"the section about changing the initial partitioning strategy"
for more details.

@subsection kmeans_ex6_kmtut Setting initial cluster centroids

An equally important option to being able to make initial cluster assignment
guesses is to make initial cluster centroid guesses without having to assign
each point in the dataset to an initial cluster.  This is similar to the
previous section, but now you must pass two extra booleans -- the first
(initialAssignmentGuess) as false, indicating that there are not initial cluster
assignment guesses, and the second (initialCentroidGuess) as true, indicating
that the centroids matrix is filled with initial centroid guesses.

This, of course, only works with the overload of \c Cluster() that takes a
matrix to put the resulting centroids in.  Below is an example.

@code
#include <mlpack/methods/kmeans/kmeans.hpp>

using namespace mlpack::kmeans;

// The dataset we are clustering on.
extern arma::mat dataset;
// The number of clusters we are obtaining.
extern size_t clusters;

// A matrix pre-filled with guesses for the initial cluster centroids.
extern arma::mat centroids;

// This will be filled with the final cluster assignments for each point.
arma::Col<size_t> assignments;

KMeans<> k;

// Remember, the first boolean indicates that we are not giving initial
// assignment guesses, and the second boolean indicates that we are giving
// initial centroid guesses.
k.Cluster(dataset, clusters, assignments, centroids, false, true);
@endcode

@note
If you have a heuristic or algorithm which makes initial guesses, a
more elegant solution is to create a new class fulfilling the
InitialPartitionPolicy template policy.  See \ref kmeans_initial_partition_kmtut
"the section about changing the initial partitioning strategy" for more details.

@par

@note
If you set the InitialPartitionPolicy parameter to something other than the
default but give an initial cluster centroid guess, the InitialPartitionPolicy
will not be used to initialize the algorithm.  See \ref kmeans_initial_partition_kmtut
"the section about changing the initial partitioning strategy" for more details.

@subsection kmeans_ex7_kmtut Running sparse k-means

The \c Cluster() function can work on both sparse and dense matrices, so all of
the above examples can be used with sparse matrices instead.  Below is a simple
example.  Note that the centroids are returned as a sparse matrix also.

@code
// The sparse dataset.
extern arma::sp_mat sparseDataset;
// The number of clusters.
extern size_t clusters;

// The assignments will be stored in this vector.
arma::Col<size_t> assignments;
// The centroids of each cluster will be stored in this sparse matrix.
arma::sp_mat sparseCentroids;

// No template parameter modification is necessary.
KMeans<> k;
k.Cluster(sparseDataset, clusters, assignments, sparseCentroids);
@endcode

@section kmeans_template_kmtut Template parameters for the 'KMeans' class

The \c KMeans<> class also takes three template parameters, which can be
modified to change the behavior of the k-means algorithm.  There are three
template parameters:

 - \c MetricType: controls the distance metric used for clustering (by
   default, the squared Euclidean distance is used)
 - \c InitialPartitionPolicy: the method by which initial clusters are set; by
   default, \ref mlpack::kmeans::RandomPartition "RandomPartition" is used
 - \c EmptyClusterPolicy: the action taken when an empty cluster is encountered;
   by default, \ref mlpack::kmeans::MaxVarianceNewCluster "MaxVarianceNewCluster"
   is used

The class is defined like below:

@code
template<
  typename DistanceMetric = mlpack::metric::SquaredEuclideanDistance,
  typename InitialPartitionPolicy = RandomPartition,
  typename EmptyClusterPolicy = MaxVarianceNewCluster
>
class KMeans;
@endcode

In the following sections, each policy is described further, with examples of
how to modify them.

@subsection kmeans_metric_kmtut Changing the distance metric used for k-means

Most machine learning algorithms in \b mlpack support modifying the distance
metric, and \c KMeans<> is no exception.  Similar to \ref
mlpack::neighbor::NeighborSearch "NeighborSearch" (see \ref
metric_type_doc_nstut "the section in the NeighborSearch tutorial"), any class
in mlpack::metric can be given as an argument.  The mlpack::metric::LMetric
class is a good example implementation.

A class fulfilling the MetricType policy must provide the following two
functions:

@code
// Empty constructor is required.
MetricType();

// Computer the distance between two points.
template<typename VecType>
double Evaluate(const VecType& a, const VecType& b);
@endcode

Most of the standard metrics that could be used are stateless and therefore the
\c Evaluate() method is implemented statically.  However, there are metrics,
such as the Mahalanobis distance (mlpack::metric::MahalanobisDistance), that
store state.  To this end, an instantiated MetricType object is stored within the
\c KMeans class.  The example below shows how to pass an instantiated
MahalanobisDistance in the constructor.

@code
// The initialized Mahalanobis distance.
extern mlpack::metric::MahalanobisDistance distance;

// We keep the default arguments for the maximum number of iterations and
// overclustering factor, but pass our instantiated metric.
KMeans<mlpack::metric::MahalanobisDistance> k(1000, 1.0, distance);
@endcode

@note
While the MetricType policy only requires two methods, one of which is an empty
constructor, more can always be added.  mlpack::metric::MahalanobisDistance also
has constructors with parameters, because it is a stateful metric.

@subsection kmeans_initial_partition_kmtut Changing the initial partitioning strategy used for k-means

There have been many initial cluster strategies for k-means proposed in the
literature.  Fortunately, the \c KMeans<> class makes it very easy to implement
one of these methods and plug it in without needing to modify the existing
algorithm code at all.

By default, the \c KMeans<> class uses mlpack::kmeans::RandomPartition, which
randomly partitions points into clusters.  However, writing a new policy is
simple; it needs to only implement the following functions:

@code
// Empty constructor is required.
InitialPartitionPolicy();

// This function is called to initialize the clusters.
template<typename MatType>
void Cluster(MatType& data,
             const size_t clusters,
             arma::Col<size_t> assignments);
@endcode

The templatization of the \c Cluster() function allows both dense and sparse
matrices to be passed in.  If the desired policy does not work with sparse (or
dense) matrices, then the method can be written specifically for one type of
matrix -- however, be warned that if you try to use \c KMeans with that policy
and the wrong type of matrix, you will get many ugly compilation errors!

@code
// The Cluster() function specialized for dense matrices.
void Cluster(arma::mat& data,
             const size_t clusters,
             arma::Col<size_t> assignments);
@endcode

One alternate to the default RandomPartition policy is the RefinedStart policy,
which is an implementation of the Bradley and Fayyad approach for finding
initial points detailed in "Refined initial points for k-means clustering" and
other places in this document.  Also see the documentation for
mlpack::kmeans::RefinedStart for more information.

The \c Cluster() method must return valid initial assignments for every point in
the dataset.

As with the MetricType template parameter, an initialized InitialPartitionPolicy
can be passed to the constructor of \c KMeans as a fourth argument.

@subsection kmeans_empty_cluster_kmtut Changing the action taken when an empty cluster is encountered

Sometimes, during clustering, a situation will arise where a cluster has no
points in it.  The \c KMeans class allows easy customization of the action to be
taken when this occurs.  By default, the point furthest from the centroid of the
cluster with maximum variance is taken as the centroid of the empty cluster;
this is implemented in the mlpack::kmeans::MaxVarianceNewCluster class.  Another
alternate choice is the mlpack::kmeans::AllowEmptyClusters class, which simply
allows empty clusters to persist.

A custom policy can be written and it must implement the following methods:

@code
// Empty constructor is required.
EmptyClusterPolicy();

// This function is called when an empty cluster is encountered.  emptyCluster
// indicates the cluster which is empty, and then the clusterCounts and
// assignments are meant to be modified by the function.  The function should
// return the number of modified points.
template<typename MatType>
size_t EmptyCluster(const MatType& data,
                    const size_t emptyCluster,
                    const MatType& centroids,
                    arma::Col<size_t>& clusterCounts,
                    arma::Col<size_t>& assignments);
@endcode

The \c EmptyCluster() function is called for each cluster that is empty at each
iteration of the algorithm.  As with InitialPartitionPolicy, the \c
EmptyCluster() function does not need to be generalized to support both dense
and sparse matrices -- but usage with the wrong type of matrix will cause
compilation errors.

Like the other template parameters to \c KMeans, EmptyClusterPolicy
implementations that have state can be passed to the constructor of \c KMeans as
a fifth argument.  See the kmeans::KMeans documentation for further details.

@section further_doc_kmtut Further documentation

For further documentation on the KMeans class, consult the \ref
mlpack::kmeans::KMeans "complete API documentation".

*/
