LSHDiv algorithm
groups sequences into Operational Taxonomic Units (OTUs) using the LSH
function within a greedy, iterative clustering framework. LSHDiv
reports the standard species richness metrics such as Chao1 Index,
Shannon Diversity Index and Abundancebased Coverage Estimator (ACE)
Index after assigning sequences within a sample to different OTUs (or
clusters)
Availability and
Implementation
LSHDiv is currently
distributed in a Python script. The source code is available the GNU
GPL license.
Description of Source Code
LSHDiv_SouceCode.zip contains
the following scripts.
StatsFasta.py 
Display
the statistics of fasta file such as Number of sequences, minimum
sequence length, maximum sequence length, mean and standard deviation
of sequence lengths 
FilterFasta.py 
Generates
an output fasta file contains filtered sequences based on given minimum
and maximum range of sequence length 
EqualLengthFasta.py

Generates
an output fasta file contains equal length sequences. All the sequences
in output fasta file have same length equal to minimum sequence length
in the input file. 
LSHDIV.py 
Estimates
the OTUs in a given sample with standard species richness metrics. 
How To Use
Here
are some examples how to use the LSHDiv
scripts.
StatsFasta.py
Usage: StatsFasta.py
i <inputfile.fasta>
Input:
<inputfile.fasta>
is any sequence file in fasta format
Output: Displays the statistics about
fasta file
Reading time: 1.17 seconds
Number of
Sequences: 55592
Minimum
Sequence Length: 53
Maximum
Sequence Length: 100
Mean
Sequence Length: 61
Standard
Deviation: 2
Done 
FilterFasta.py
Usage:
FilterFasta.py i <inputfile.fasta> o
<outputfile> l <min length> u <max
length>
Input: 
<inputfile.fasta>
is any sequence file in fasta format 

<outputfile.fasta>
is the name of your output file (could be any name) 

<min
length> is the minimum length of the sequence in the output
filtered file 

<max
length> is the maximum length of the sequence in the output
filtered file 
Output: Generates a
fasta file that contains sequences having lengths greater than
<min length> and less than <max length>
Number
of Sequences: 55592
Minimum
Sequence Length: 53
Maximum
Sequence Length: 100
Mean
Sequence Length: 61
Standard
Deviation: 2
Writing
Output to a fasta file
Writing
Time: 1.21 seconds
Number of
Sequences: 372
Minimum
Sequence Length: 70
Maximum
Sequence Length: 100
Mean
Sequence Length: 72
Standard
Deviation: 5
Done 
EqualLengthfasta.py
Usage:
EqualLengthFasta.py i <inputfile.fasta> o
<outputfile>
Input: 
<inputfile.fasta>
is any sequence file in fasta format 

<outputfile.fasta>
is the name of your output file (could be any name) 
Output: Generates a
fasta file that contains sequences of equal length. Equal length is the
minimum sequence length in the <inputfile.fasta>
Number
of Sequences: 372
Minimum Sequence Length: 70
Maximum Sequence Length: 100
Mean Sequence Length: 72
Standard Deviation: 5
Writing Fasta File for Equal Length
Writing Time: 0.14 seconds
Number of Sequences: 372
Minimum Sequence Length: 70
Maximum Sequence Length: 70
Mean Sequence Length: 70
Standard Deviation: 0
Done 
LSHDIV.py
Usage: LSHDIV.py i
<inputfile.fasta> o
<outputfile> l <min length> s <num
sampled indices> w <wmer length> p
<percentage mismatch> n <num iterations>
Input: 
<inputfile.fasta>
is any sequence file of equal sequence lengths in fasta format. 

<outputfile>
is the name of your output file (could be any name) 

<min
length> is the minimum length of the sequence 

<num
sampled indices> is the number of sampled indices to be
chosen from sequences. This number must be less than or equal to
<min length> 

<wmer
length> is the length of wmer per index 

<percentage
mismatch> is the percentage of mismacth allowed in order to
assign the same OTU. 

<num
iterations> is the number of iterations LSHDiv algorithm runs,
each time with different set of sampled indices 
Output: 
<outputfile_log.txt>
is a log file containing all the parameters, number of OTUs, species
richness metrics and other statistics 

<outputfile_OTUFasta>
is a fasta file with each sequence tag contains its OTU label 

<outputfile_OTULabels>
is index file containing the OTU labels. The number in row one is the
OTU label of sequence one in input file.

Reading
time: 0.04 seconds
Mean Sequence Length: 70
Standard Deviation: 0
Initializing LSHDiv
Clustering Sequences and Estimating OTUs
Total Number of Sequences: 372
Number of OTUs in Iteration 1 is 214
Time taken by LSHDiv is 0.65 seconds
Total Time upto iteration 1 is 0.65 seconds
Number of Singleton OTUs are 177
Number of Doubleton OTUs are 16
Chao1 Estimate is 1130.24
Chao1 LCI 95% is 747.37
Chao1 UCI 95% is 1787.93
Shannon Index is 4.87
Shannon LCI 95% is 4.74
Shannon UCI 95% is 5.00
ACE is 2088.20
Writing Output to a fasta file
Done 
Supplementary Paper
Supplementary
paper for LSHDiv
can be downloaded from here.
This paper contains those results which
are not included in the main paper.
Copyright and License
Information
The
source code for LSHDiv
is available under GNU General Public
License (GNU GPL)
© COPYRIGHT 2012 by Zeehasham
Rasheed and Huzefa
Rangwala (George Mason University). All Rights Reserved
