MC-MinH Algorithm
MC-MinH algorithm uses the min-wise hashing approach, along with a greedy clustering algorithm to group 16S and whole metagenomic sequences. We represent unequal length sequences using contiguous subsequences or k-mers, and then approximate the computation of pairwise similarity using independent min-wise hashing. The algorithm is written in C and is available using the GNU GPL license.
Downloads
Linux Version (download package)
Windows Version (download package)
Input Parameters
Sequence File: Input file format must contain one sequence per line. Sample file can be downloaded from here (sequence file and tag file). We are working on the format so that MC-MinH will take fasta file as input. Datatype is string.
Output File: To redirect the output and generate cluster file. Datatye is string.
k-mer size: Length of sub-sequence for k-mer representation of complete sequence. Datatype is integer.
Number of Hash Functions: Number of hash functions for the calculation of min hash values. Datatype is integer.
Threshold: For comparing sequence similarity. Sequences passing this threshold will go into the same cluster. Datatype is decimal/float.
Div: A value used in the calculation of universal hasing function. For correct results, this value must be prime number greater than the number of sequences in an input file.
Output Files
<Output File> contains cluster labels for each sequence. Each row in the output file corresponds to each row in the sequence file.
<Output File> . out contains summary of the clustering results. This file contains the redirection of your screen/terminal/console.
How to Run
Linux Version
./bin/mw1.0 <input file> <output file> <kmer> <num hash> <threshold> <div>
Windows Version
MC-MinH.exe <input file> <output file> <kmer> <num hash> <threshold> <div>
Possible Errors
To avoid errors, follow the same ordering of parameters
Wrong parameters will give errors like Segmentation Fault, Null and so on.
Species Diversity Estimation
We use MC-MinH clustering results to estimate different species richness metrics. Clusters (groups) represent taxonomy specific groups called operational taxonomical units (OTUs). MC-MinH supports Chao1 index, Shannon Diversity index and ACE (Abundance-based Coverage Estimator) index.
Python script for Species Diversity Estimation can be downloaded from here (download file)
Usage:
SpeciesDiversity.py -i <cluster label file>
where <cluster label file> is the same output file generated by MC-MinH algorithm in the previous steps.
Output:
<cluster label file> . estimate contains all the species diversity estimates