MrMC-MinH Algorithm

MrMC-MinH is a Map-Reduce based algorithm for metagenome clustering using minwise hashing. It is an extension of our previously developed, greedy clustering algorithm MC-MinH (http://www.cs.gmu.edu/~mlbio/MC-MinH/). The algorithm is written in Java and Pig programming language.

The key contributions include (i) development of the distributed map-reduce based implementation of clustering algorithm and (ii) ability to perform hierarchical agglomerative clustering instead of a greedy clustering approach.


Downloads

Pig Script Files (download)

Jar Files contain both source and class files (download)

S-Space Package for Hierarchical Clustering in Java (download)


Requirements


Hadoop 0.20.2 or higher

Pig 0.10.0 or higher


How To Run

register all the jar files with correct paths in pig script

Greedy Version:
pig -param INPUT=<fasta file> -param OUTPUT=<output dir> -param P=5 -param KMER=7 -param NUMHASH=15 -param DIV=<prime number> -param CUTOFF=0.8 -param MrMCMinHGreedy.pig

Agglomerative Hierarchical Clustering Version:
pig -param INPUT=<fasta file> -param OUTPUT=<output dir> -param P=5 -param KMER=7 -param NUMHASH=15 -param DIV=<prime number> -param CUTOFF=0.8 -param LINK=1 MrMCMinHCutoffHierarchical.pig

INPUT: Fasta file containing sequences
OUTPUT: Path of HDFS directory to store clustering output
P: Sets the number of reducers for all MapReduce jobs generated by Pig
KMER: Size of contiguous subsequence for feature extraction
NUMHASH: Number of hash functions
DIV: A prime number greater than number of records. Required for minhash calculation
CUTOFF: Sequence similarity threshold
LINK: Linkage method for Agglomerative Hierarchical Clustering
    1 - Complete Linkage
    2 - Mean Linkage
    3 - Median Linkage
    4 - Single Linkage
Free counter and web stats