MrMC-MinH is a Map-Reduce based algorithm for metagenome clustering using minwise hashing. It is an extension of our previously developed, greedy clustering algorithm MC-MinH (http://www.cs.gmu.edu/~mlbio/MC-MinH/). The algorithm is written in Java and Pig programming language.

The key contributions include (i) development of the distributed map-reduce based implementation of clustering algorithm and (ii) ability to perform hierarchical agglomerative clustering instead of a greedy clustering approach.

Downloads

Pig Script Files (download)

Jar Files contain both source and class files (download)

S-Space Package for Hierarchical Clustering in Java (download)

Requirements

Hadoop 0.20.2 or higher

Pig 0.10.0 or higher

How To Run

register all the jar files with correct paths in pig script

Greedy Version:

pig -param INPUT=<fasta file> -param OUTPUT=<output dir> -param P=5 -param KMER=7 -param NUMHASH=15 -param DIV=<prime number> -param CUTOFF=0.8 -param MrMCMinHGreedy.pig

Agglomerative Hierarchical Clustering Version:

pig -param INPUT=<fasta file> -param OUTPUT=<output dir> -param P=5 -param KMER=7 -param NUMHASH=15 -param DIV=<prime number> -param CUTOFF=0.8 -param LINK=1 MrMCMinHCutoffHierarchical.pig

INPUT: Fasta file containing sequences

OUTPUT: Path of HDFS directory to store clustering output

P: Sets the number of reducers for all MapReduce jobs generated by Pig

KMER: Size of contiguous subsequence for feature extraction

NUMHASH: Number of hash functions

DIV: A prime number greater than number of records. Required for minhash calculation

CUTOFF: Sequence similarity threshold

LINK: Linkage method for Agglomerative Hierarchical Clustering

1 - Complete Linkage

2 - Mean Linkage

3 - Median Linkage

4 - Single Linkage