Clustering Program

Introduction
Usage
History
File Formats

There is a tutorial on how to use XCluster.

Introduction:

XCluster grew out of the desire to make clustering software that was far less memory intensive, faster, and smarter when joining two nodes together, such that most similar outermost expression patterns of said nodes are placed next to each other.

This software, and the underlying source, are freely available at cluster.tar.gz.

Please email if you have any questions/feature requests etc.

Most of the files that are output by the clustering program are readable by TreeView (.cdt, .gtr, .atr), which allows you to browse clusters. TreeView is available free to academic users here. Commercial users should direct enquiries to Jessica Smith. Some additional files created by XCluster, eg in generating SOMs, and k-means clustering are for informational purposes. The overall SOM and k-means structures are not viewable in TreeView, but the individual clusters, which comprise .cdt, .gtr, and .atr files are.

Usage:

Mac or PC

Simply double-click the application, and answer the questions as they come up. Your input file must conform to the expected file format, described below.

UNIX

You can either simply type cluster, and answer the questions as they come up, or use the following command line arguments:


 * -f filename
 * -g 0|1|2 ; 0 indicates no gene clustering, 1 indicates non-centered metric, 2 indicates centered metric when clustering genes.  1 is default.
 * -e 0|1|2 ; 0 indicates no experiment clustering, see above for 1 and 2.  0 is the default.
 * -p 0|1 ; whether to use pearson correlation (1), or Euclidean distance (0).  1 is the default.
 * -s 0|1 ; whether to make a SOM (1). 0 is the default. 
 * -x specify x dimension of SOM
 * -y specify y dimension of SOM
 * -r 0|1 ; whether to seed the random number generator with the time when making a SOM.  1 is the default.
 * -k num ; how many k-means clusters to make.  num is an integer, greater than 1, and preferably less than a gazillion. 
 * -l 0|1 ; whether to log transform the data.
 * -u string ; a unique identifier by which to name the output files, instead of basing their names on the input file.  eg -u 888 will generate 888.cdt as an output file.


History:

9-21-99

- Finished version 2.51
  Minor changes to code, to streamline parsing of user input.
  
- Made it so that genes don't have to be clustered.

7-19-99

- Finished version 2.5
  Has the ability to generate k-means clusters.

- Added in code to log transform the data (in base 2) in case it
  wasn't already log transformed.

- Optimized SOM code somewhat.

- Changed parsing code so that it will compile and run unmodified on
  all platforms.

5-13-99

- Finished Version 2.
  Has the ability to generate Self Organizing Maps.

- The Input data is filtered when making a SOM, such that all
  expression profiles that do not vary at least 2 fold from their max
  to min are removed.  These are printed out into a file with a .fil
  extension

- The user has the ability to decide whether to not seed the random
  number generator with the current time.  If they want to compare
  SOMs with and without clustering by experiment, this is useful,
  because you can guarantee the same organization if you don't seed
  the random number generator.
 
- Fixed bug where using relative pathways gave the wrong output
  filenames.  Now can use:

  1. /home/users/sherlock/sample.txt  OR,
  2. ../../sample.txt

  equally well.

- Did some optimisations to the CalculateCorrelation function to speed
  it up.

5-6-99

- Added in the ability to choose between using Euclidean distance, or
  pearson correlation for use as a similarity metric.  Can be set via
  the command line option -p, where 0 means use Euclidean, and 1 means
  use pearson.  1 is the default.

5-1-99

- Found stupid and nasty bug such that even though the code was
  correctly switching the node orders to bring together similar
  genes/experiments, these switches were not reflected in the tree
  files.  This bug was fixed, and the fix tested, and it's now better!

4-16-99

- Was storing each correlation in two places, which Paul pointed out
  was unnecessary, so now only store it in one place.  Makes it
  slightly faster.

4-12-99

- Fixed a couple of small bugs eg. forgetting to malloc the extra
  byte for the '\0' in ** all ** the strings!

- made the code somewhat more readable, since:

  *(darray+x*y) is equivalent to darray[x*y]

  the second version is a lot more readable I think.

- Made linked list implementation more efficient, such that when
  a correlation is to be inserted into a full list, then the memory
  of the lowest correlationRec is reused, and the pointers manipulated
  accordingly.  Saves a call to malloc and free.

- Added the ability to take command line arguments, so can make automatable:

  Usage: cluster -f filename /* filename (duh!) */
                 -g 1|2      /* genes: 2=use centered metric
                                       1 is default */
                 -e 0|1|2    /* experiments: 0=don't cluster
                                             1=cluster         
                                             2=use centered metric
                                             0 is default */ 

  eg. cluster -f sample.txt -g 2 -e 2

      cluster -f sample.txt

      cluster -e 2 -f sample.txt  ...you get the idea

  If you supply any command line arguments, you must supply the filename.

4-12-99

The clustering program is implemented in C.  To use resources more
efficiently it only stores the 13 highest correlations for any
particular gene (I played around with this number on a set of ~6200
genes by 80 experiments, and 13 was the fastest).  Also, since sqrt(a) *
sqrt(b)=sqrt(a*b), I was able to save a call to sqrt.  The program,
was compiled using:

gcc -Wall latest_cluster.c -o cluster -lm -O3 -ansi

As a couple of benchmarking exercises, cluster takes:

11 seconds: 800 genes x 82 experiments (alberich)
8 minutes:  6200 genes x 82 experiments (alberich)
1 hour:     16384 genes x 250 experiments (daisy)

The program prompts the user for an input file, asks whether the data
should be clustered by experiments, in addition to genes, and asks
whether to cluster the data using a centered correlation metric.  The
progress is given by indicating every 100 correlations that have been
calculated, and every 100 nodes that have been clustered (for genes)
or every node that has been clustered (for experiments).  The program
then outputs 2 or 3 files, identically named as the input file, but
with the extensions .cdt, .gtr, and .atr (if experiments were
clustered).

File formats:

The program expects a specific input file format.  If this is violated
the program will give unexpected results.  If the input format
specified here is unsatisfactory, or too limiting, we can discuss
making changes.

Line 1: UID\tNAME\tGWEIGHT\tExp1\tExp2...Expn
Line 2: EWEIGHT\t\t\t1\t1\t...1
Line 3: ORF1\tNAME1\t1\t0.23\t0.45\t...0.43
Line 4: ORF2\tNAME2\t1\t0.56\t-0.10\t..0.34

ie: The first line contains three columns (they can be named
anything), followed by the names of all the experiments (tab
delimited).  The second lines contains 3 columns (they can be named
anything), followed by the EWEIGHTS for each experiment (usually have
the value 1, but treated internally as a float).  The third, and
subsequent lines have the ORF name as the first column, the gene name
as the second column, and the GWEIGHT as the third column (again,
whose value is usually 1, but is treated internally as a float).  The
ORF name and GENE name fields can be up to 1024 characters in length,
so can be descriptive data instead.  The remainder of the line is made
up of tab delimited experimental data.  Where there is a null data
point, there will be two tabs next to each other.

an example file is :

UID	NAME	GWEIGHT	spo0	spo30	spo2	spo5	spo7	spo9	spo11
EWEIGHT			1	1	1	1	1	1	1
YAL003W	EFB1	1	0.23	-1.79	-1.29	-1.56		-0.27	
YAL004W	YAL004W	1	0.41	-0.38	-0.89	-1.06	-1.6	-1.84	-1.6
YAL005C	SSA1	1	0.61	-0.07	-1.29	-1.29	-2	-1.84	-2.25
YAL010C	MDM10	1	0.16	-0.15	-0.76	-1.25	-1.89	-1.74	-1.6
YAL012W	CYS3	1	0.03	1.39	-0.84	-1.64	-2.84	-2.47	-2.4
YAL015C	NTG1	1	-0.18	-0.18	-0.62	-1.32	-1.69	-1.43	-1.79
...etc

Ouput files:

The .cdt file looks like this:

GID	YORF	NAME	GWEIGHT	spo0	spo2	spo30	spo11	spo9	spo7	spo5
AID				ARRY0X	ARRY2X	ARRY1X	ARRY6X	ARRY5X	ARRY4X	ARRY3X
GENE37X	YBR025C	YBR025C	1	0.50	-1.94	-2.64	0.73	0.51	-0.18	-1.00	
GENE26X	YBL042C	YBL042C	1	0.26	-1.06	-2.06	1.58	1.33	0.54	-0.43
...etc

where everything is tab delimited.  The second line is only printed if
experiments, as well as genes were clustered.  The ARRY0X, and GENE37X
are used by the TreeViewer program to determine how to draw the binary
trees, in conjunction with the .gtr and .atr files (below):

.gtr:

NODE1X	GENE51X	GENE24X	0.997965
NODE2X	GENE41X	GENE40X	0.997813
NODE3X	GENE90X	GENE89X	0.997782
NODE4X	GENE84X	GENE32X	0.996293
NODE5X	GENE8X	GENE3X	0.995884
NODE6X	GENE91X	GENE54X	0.995747
NODE7X	GENE80X	GENE42X	0.995286
NODE8X	GENE70X	GENE31X	0.994681
NODE9X	GENE63X	GENE36X	0.994233
...etc

.atr

NODE1X	ARRY4X	ARRY3X	0.948027
NODE2X	ARRY6X	ARRY5X	0.940996
NODE3X	NODE2X	NODE1X	0.902631
NODE4X	ARRY2X	ARRY1X	0.785001
NODE5X	NODE4X	NODE3X	0.662225
NODE6X	NODE5X	ARRY0X	-0.082621

If there's any questions over:

File formats
Software
Anything else!

Please ask me.

The program doesn't appear to have any bugs per se, but it probably won't cope well with badly formatted files. It also expects every gene to have an Orf, or UID in the first column. It will error out if this is not the case. For the second (name) column, this is not fatal, as long as there is an empty column. The code itself is pretty well commented, though some of the comments are slightly out of date, as I changed functions during development.

I'm also well aware that each function has probably too many arguments, and that I should probably subsume a lot of things into the clusterRec, to only have to pass one thing around, although the increased over head of having to dereference an extra pointer each time may or may not be trivial. I also realize that the ClusterNodes and ClusterExperiments functions should be unified to make code maintenance easier. I'll try and get round to this. The source code is available and the same code should essentially run unmodified under any system (UNIX, Mac, PC, Linux) with a simple recompilation. Let me know if you run into problems. Gavin