idl.tmt.clusterers
Class EnhancedEM

java.lang.Object
  |
  +--weka.clusterers.Clusterer
        |
        +--weka.clusterers.DensityBasedClusterer
              |
              +--idl.tmt.clusterers.EnhancedEM
All Implemented Interfaces:
java.lang.Cloneable, NumberOfClustersRequestable, OptionHandler, java.io.Serializable, WeightedInstancesHandler

public class EnhancedEM
extends DensityBasedClusterer
implements NumberOfClustersRequestable, OptionHandler, WeightedInstancesHandler

Simple EM (expectation maximisation) class.

EM assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters. EM can decide how many clusters to create by cross validation, or you may specify apriori how many clusters to generate.


The cross validation performed to determine the number of clusters is done in the following steps:
1. the number of clusters is set to 1
2. the training set is split randomly into 10 folds.
3. EM is performed 10 times using the 10 folds the usual CV way.
4. the loglikelihood is averaged over all 10 results.
5. if loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2.

The number of folds is fixed to 10, as long as the number of instances in the training set is not smaller 10. If this is the case the number of folds is set equal to the number of instances.

Valid options are:

-V
Verbose.

-N
Specify the number of clusters to generate. If omitted, EM will use cross validation to select the number of clusters automatically.

-I
Terminate after this many iterations if EM has not converged.

-S
Specify random number seed.

-M
Set the minimum allowable standard deviation for normal density calculation.

Version:
$Revision: 1.4 $
Author:
Mark Hall (mhall@cs.waikato.ac.nz), Eibe Frank (eibe@cs.waikato.ac.nz)
See Also:
Serialized Form

Field Summary
private  EMInitializer m_initializer
          Initializer object responsibe for initializing EM model
private  int m_initialNumClusters
          the initial number of clusters requested by the user--- -1 if xval is to be used to find the number of clusters
private  double m_loglikely
          the loglikelihood of the data
private  int m_max_clusterers_to_build
          maxumum clusterers to build when seeking the best one.
private  int m_max_iterations
          maximum iterations to perform
private  double[] m_maxValues
          attribute max values
private  double m_minStdDev
          default minimum standard deviation
private  double[] m_minValues
          attribute min values
private  double[][][] m_modelNormal
          hold the normal estimators for each cluster
private static double m_normConst
          Constant for normal distribution.
private  int m_num_attribs
          number of attributes
private  int m_num_clusters
          number of clusters selected by the user or cross validation
private  int m_num_instances
          number of training instances
private  double[] m_priors
          the prior probabilities for clusters
private  java.util.Random m_rr
          random numbers and seed
private  int m_rseed
           
private  Instances m_theInstances
          training instances
private  boolean m_verbose
          Verbose?
private  double[][] m_weights
          hold the weights of each instance for each cluster
 
Constructor Summary
EnhancedEM()
          Constructor.
 
Method Summary
 void buildClusterer(Instances data)
          Generates a clusterer.
 double[] clusterPriors()
          Returns the cluster priors.
private  void CVClusters()
          estimate the number of clusters by cross validation on the training data.
private  void doEM()
          Perform the EM algorithm
private  double E(Instances inst, boolean change_weights)
          The E step of the EM algorithm.
private  void EM_Init(Instances inst)
          Initialise estimators and storage.
private  void EM_Report(Instances inst)
          verbose output for debugging
private  void estimate_priors(Instances inst)
          calculate prior probabilites for the clusters
 double[][][] getClusterModelsNumericAtts()
          Return the normal distributions for the cluster models
 double[] getClusterPriors()
          Return the priors for the clusters
 boolean getDebug()
          Get debug mode
 EMInitializer getInitializer()
           
 double getLogLikely()
           
 int getMaxClusterersToBuild()
          Get the maximum number of clusterers to build when seeking the best one.
 int getMaxIterations()
          Get the maximum number of iterations
 double getMinStdDev()
          Get the minimum allowable standard deviation.
 int getNumClusters()
          Get the number of clusters
 java.lang.String[] getOptions()
          Gets the current settings of EM.
 int getSeed()
          Get the random number seed
 java.lang.String globalInfo()
          Returns a string describing this clusterer
private  double iterate(Instances inst, boolean report)
          iterates the E and M steps until the log likelihood of the data converges.
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options..
 double[] logDensityPerClusterForInstance(Instance inst)
          Computes the log of the conditional density (per cluster) for a given instance.
private  double logNormalDens(double x, double mean, double stdDev)
          Density function of normal distribution.
private  void M(Instances inst)
          The M step of the EM algorithm.
static void main(java.lang.String[] argv)
          Main method for testing this class.
 java.lang.String maxIterationsTipText()
          Returns the tip text for this property
 java.lang.String minStdDevTipText()
          Returns the tip text for this property
private  void new_estimators()
          New probability estimators for an iteration
 int numberOfClusters()
          Returns the number of clusters.
 java.lang.String numClustersTipText()
          Returns the tip text for this property
protected  void resetOptions()
          Reset to default options
 java.lang.String seedTipText()
          Returns the tip text for this property
 void setDebug(boolean v)
          Set debug mode - verbose output
 void setInitializer(EMInitializer m_initializer)
           
 void setInitializerName(java.lang.String initializerClassName)
           
 void setMaxClusterersToBuild(int i)
          Set the maximum number of clusterers to build when seeking the best one
 void setMaxIterations(int i)
          Set the maximum number of iterations to perform
 void setMinStdDev(double m)
          Set the minimum value for standard deviation when calculating normal density.
 void setNumClusters(int n)
          Set the number of clusters (-1 to select by CV).
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setSeed(int s)
          Set the random number seed
 java.lang.String toString()
          Outputs the generated clusters into a string.
private  void updateMinMax(Instance instance)
          Updates the minimum and maximum values for all the attributes based on a new instance.
 
Methods inherited from class weka.clusterers.DensityBasedClusterer
distributionForInstance, logDensityForInstance, logJointDensitiesForInstance
 
Methods inherited from class weka.clusterers.Clusterer
clusterInstance, forName, makeCopies
 
Methods inherited from class java.lang.Object
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, wait, wait, wait
 

Field Detail

m_modelNormal

private double[][][] m_modelNormal
hold the normal estimators for each cluster

m_minStdDev

private double m_minStdDev
default minimum standard deviation

m_weights

private double[][] m_weights
hold the weights of each instance for each cluster

m_priors

private double[] m_priors
the prior probabilities for clusters

m_loglikely

private double m_loglikely
the loglikelihood of the data

m_theInstances

private Instances m_theInstances
training instances

m_num_clusters

private int m_num_clusters
number of clusters selected by the user or cross validation

m_initialNumClusters

private int m_initialNumClusters
the initial number of clusters requested by the user--- -1 if xval is to be used to find the number of clusters

m_num_attribs

private int m_num_attribs
number of attributes

m_num_instances

private int m_num_instances
number of training instances

m_max_iterations

private int m_max_iterations
maximum iterations to perform

m_max_clusterers_to_build

private int m_max_clusterers_to_build
maxumum clusterers to build when seeking the best one. defaults to 1, in which case only a single cluster is built

m_minValues

private double[] m_minValues
attribute min values

m_maxValues

private double[] m_maxValues
attribute max values

m_rr

private java.util.Random m_rr
random numbers and seed

m_rseed

private int m_rseed

m_verbose

private boolean m_verbose
Verbose?

m_initializer

private EMInitializer m_initializer
Initializer object responsibe for initializing EM model

m_normConst

private static double m_normConst
Constant for normal distribution.
Constructor Detail

EnhancedEM

public EnhancedEM()
Constructor.
Method Detail

globalInfo

public java.lang.String globalInfo()
Returns a string describing this clusterer
Returns:
a description of the evaluator suitable for displaying in the explorer/experimenter gui

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options..

Valid options are:

-V
Verbose.

-N
Specify the number of clusters to generate. If omitted, EM will use cross validation to select the number of clusters automatically.

-I
Terminate after this many iterations if EM has not converged.

-S
Specify random number seed.

-M
Set the minimum allowable standard deviation for normal density calculation.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options.

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options.
Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

minStdDevTipText

public java.lang.String minStdDevTipText()
Returns the tip text for this property
Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setMinStdDev

public void setMinStdDev(double m)
Set the minimum value for standard deviation when calculating normal density. Reducing this value can help prevent arithmetic overflow resulting from multiplying large densities (arising from small standard deviations) when there are many singleton or near singleton values.
Parameters:
m - minimum value for standard deviation

getMinStdDev

public double getMinStdDev()
Get the minimum allowable standard deviation.
Returns:
the minumum allowable standard deviation

seedTipText

public java.lang.String seedTipText()
Returns the tip text for this property
Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setSeed

public void setSeed(int s)
Set the random number seed
Parameters:
s - the seed

getSeed

public int getSeed()
Get the random number seed
Returns:
the seed

numClustersTipText

public java.lang.String numClustersTipText()
Returns the tip text for this property
Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setNumClusters

public void setNumClusters(int n)
                    throws java.lang.Exception
Set the number of clusters (-1 to select by CV).
Specified by:
setNumClusters in interface NumberOfClustersRequestable
Parameters:
n - the number of clusters
Throws:
java.lang.Exception - if n is 0

getNumClusters

public int getNumClusters()
Get the number of clusters
Returns:
the number of clusters.

maxIterationsTipText

public java.lang.String maxIterationsTipText()
Returns the tip text for this property
Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setMaxIterations

public void setMaxIterations(int i)
                      throws java.lang.Exception
Set the maximum number of iterations to perform
Parameters:
i - the number of iterations
Throws:
java.lang.Exception - if i is less than 1

getMaxIterations

public int getMaxIterations()
Get the maximum number of iterations
Returns:
the number of iterations

setMaxClusterersToBuild

public void setMaxClusterersToBuild(int i)
                             throws java.lang.Exception
Set the maximum number of clusterers to build when seeking the best one
Parameters:
i - the number of clusterers to build
Throws:
java.lang.Exception - if i is less than 1

getMaxClusterersToBuild

public int getMaxClusterersToBuild()
Get the maximum number of clusterers to build when seeking the best one.
Returns:
the number of clusterers to buld

setDebug

public void setDebug(boolean v)
Set debug mode - verbose output
Parameters:
v - true for verbose output

getDebug

public boolean getDebug()
Get debug mode
Returns:
true if debug mode is set

getOptions

public java.lang.String[] getOptions()
Gets the current settings of EM.
Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()

EM_Init

private void EM_Init(Instances inst)
              throws java.lang.Exception
Initialise estimators and storage.
Parameters:
inst - the instances

estimate_priors

private void estimate_priors(Instances inst)
                      throws java.lang.Exception
calculate prior probabilites for the clusters
Parameters:
inst - the instances
Throws:
java.lang.Exception - if priors can't be calculated

logNormalDens

private double logNormalDens(double x,
                             double mean,
                             double stdDev)
Density function of normal distribution.
Parameters:
x - input value
mean - mean of distribution
stdDev - standard deviation of distribution

new_estimators

private void new_estimators()
New probability estimators for an iteration
Parameters:
num_cl - the numbe of clusters

M

private void M(Instances inst)
        throws java.lang.Exception
The M step of the EM algorithm.
Parameters:
inst - the training instances

E

private double E(Instances inst,
                 boolean change_weights)
          throws java.lang.Exception
The E step of the EM algorithm. Estimate cluster membership probabilities.
Parameters:
inst - the training instances
Returns:
the average log likelihood

resetOptions

protected void resetOptions()
Reset to default options

getClusterModelsNumericAtts

public double[][][] getClusterModelsNumericAtts()
Return the normal distributions for the cluster models
Returns:
a double[][][] value

getClusterPriors

public double[] getClusterPriors()
Return the priors for the clusters
Returns:
a double[] value

toString

public java.lang.String toString()
Outputs the generated clusters into a string.
Overrides:
toString in class java.lang.Object

EM_Report

private void EM_Report(Instances inst)
verbose output for debugging
Parameters:
inst - the training instances

CVClusters

private void CVClusters()
                 throws java.lang.Exception
estimate the number of clusters by cross validation on the training data.

numberOfClusters

public int numberOfClusters()
                     throws java.lang.Exception
Returns the number of clusters.
Overrides:
numberOfClusters in class Clusterer
Returns:
the number of clusters generated for a training dataset.
Throws:
java.lang.Exception - if number of clusters could not be returned successfully

updateMinMax

private void updateMinMax(Instance instance)
Updates the minimum and maximum values for all the attributes based on a new instance.
Parameters:
instance - the new instance

buildClusterer

public void buildClusterer(Instances data)
                    throws java.lang.Exception
Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.
Overrides:
buildClusterer in class Clusterer
Parameters:
data - set of instances serving as training data
Throws:
java.lang.Exception - if the clusterer has not been generated successfully

clusterPriors

public double[] clusterPriors()
Returns the cluster priors.
Overrides:
clusterPriors in class DensityBasedClusterer

logDensityPerClusterForInstance

public double[] logDensityPerClusterForInstance(Instance inst)
                                         throws java.lang.Exception
Computes the log of the conditional density (per cluster) for a given instance.
Overrides:
logDensityPerClusterForInstance in class DensityBasedClusterer
Parameters:
instance - the instance to compute the density for
Returns:
the density.
Throws:
java.lang.Exception - if the density could not be computed successfully

doEM

private void doEM()
           throws java.lang.Exception
Perform the EM algorithm

iterate

private double iterate(Instances inst,
                       boolean report)
                throws java.lang.Exception
iterates the E and M steps until the log likelihood of the data converges.
Parameters:
inst - the training instances.
num_cl - the number of clusters.
report - be verbose.
Returns:
the log likelihood of the data

main

public static void main(java.lang.String[] argv)
Main method for testing this class.
Parameters:
argv - should contain the following arguments:

-t training file [-T test file] [-N number of clusters] [-S random seed]


getLogLikely

public double getLogLikely()

setInitializer

public void setInitializer(EMInitializer m_initializer)

setInitializerName

public void setInitializerName(java.lang.String initializerClassName)
                        throws java.lang.Exception

getInitializer

public EMInitializer getInitializer()