About the author

  Zeehasham Rasheed
  Department of Computer Science
  George Mason University
  Email: zrasheed [at] gmu.edu

  Huzefa Rangwala
  Department of Computer Science
  George Mason University
  Email: rangwala [at] cs.gmu.edu

MC-MinH Algorithm

MC-MinH algorithm uses the min-wise hashing approach, along with a greedy clustering algorithm to group 16S and whole metagenomic sequences. We represent unequal length sequences using contiguous subsequences or k-mers, and then approximate the computation of pairwise similarity using independent min-wise hashing. The algorithm is written in C and is available using the GNU GPL license.


Linux Version (download package)

Windows Version (download package)

Input Parameters

Sequence File: Input file format must contain one sequence per line. Sample file can be downloaded from here (sequence file and tag file). We are working on the format so that MC-MinH will take fasta file as input. Datatype is string.

Output File: To redirect the output and  generate cluster file. Datatye is string.

k-mer size: Length of sub-sequence for k-mer representation of complete sequence. Datatype is integer.

Number of Hash Functions: Number of hash functions for the calculation of min hash values. Datatype is integer.

Threshold: For comparing sequence similarity.  Sequences passing this threshold will go into the same cluster. Datatype is decimal/float.

Div: A value used in the calculation of universal hasing function. For correct results, this value must be prime number greater than the number of sequences in an input file. 

Output Files

<Output File> contains cluster labels for each sequence. Each row in the output file corresponds to each row in the sequence file.

<Output File> . out contains summary of the clustering results.  This file contains the redirection of your screen/terminal/console.

How to Run

Linux Version

./bin/mw1.0 <input file> <output file> <kmer> <num hash> <threshold> <div>

Windows Version

MC-MinH.exe <input file> <output file> <kmer> <num hash> <threshold> <div>

Possible Errors

To avoid errors, follow the same ordering of parameters

Wrong parameters will give errors like Segmentation Fault, Null and so on.

Species Diversity Estimation

We use MC-MinH clustering results to estimate different species richness metrics. Clusters (groups) represent taxonomy specific groups called operational taxonomical units (OTUs). MC-MinH supports Chao1 index, Shannon Diversity index and ACE (Abundance-based Coverage Estimator) index.

Python script for Species Diversity Estimation can be downloaded from here (download file)


SpeciesDiversity.py -i <cluster label file>

where <cluster label file> is the same output file generated by MC-MinH algorithm in the previous steps.


<cluster label file> . estimate contains all the species diversity estimates

Free counter and web stats