LSH-Div algorithm
groups sequences into Operational Taxonomic Units (OTUs) using the LSH
function within a greedy, iterative clustering framework. LSH-Div
reports the standard species richness metrics such as Chao1 Index,
Shannon Diversity Index and Abundance-based Coverage Estimator (ACE)
Index after assigning sequences within a sample to different OTUs (or
clusters)
Availability and
Implementation
LSH-Div is currently
distributed in a Python script. The source code is available the GNU
GPL license.
Description of Source Code
LSHDiv_SouceCode.zip contains
the following scripts.
StatsFasta.py |
Display
the statistics of fasta file such as Number of sequences, minimum
sequence length, maximum sequence length, mean and standard deviation
of sequence lengths |
FilterFasta.py |
Generates
an output fasta file contains filtered sequences based on given minimum
and maximum range of sequence length |
EqualLengthFasta.py
|
Generates
an output fasta file contains equal length sequences. All the sequences
in output fasta file have same length equal to minimum sequence length
in the input file. |
LSHDIV.py |
Estimates
the OTUs in a given sample with standard species richness metrics. |
How To Use
Here
are some examples how to use the LSH-Div
scripts.
StatsFasta.py
Usage: StatsFasta.py
-i <inputfile.fasta>
Input:
<inputfile.fasta>
is any sequence file in fasta format
Output: Displays the statistics about
fasta file
Reading time: 1.17 seconds
Number of
Sequences: 55592
Minimum
Sequence Length: 53
Maximum
Sequence Length: 100
Mean
Sequence Length: 61
Standard
Deviation: 2
Done |
FilterFasta.py
Usage:
FilterFasta.py -i <inputfile.fasta> -o
<outputfile> -l <min length> -u <max
length>
Input: |
<inputfile.fasta>
is any sequence file in fasta format |
|
<outputfile.fasta>
is the name of your output file (could be any name) |
|
<min
length> is the minimum length of the sequence in the output
filtered file |
|
<max
length> is the maximum length of the sequence in the output
filtered file |
Output: Generates a
fasta file that contains sequences having lengths greater than
<min length> and less than <max length>
Number
of Sequences: 55592
Minimum
Sequence Length: 53
Maximum
Sequence Length: 100
Mean
Sequence Length: 61
Standard
Deviation: 2
Writing
Output to a fasta file
Writing
Time: 1.21 seconds
Number of
Sequences: 372
Minimum
Sequence Length: 70
Maximum
Sequence Length: 100
Mean
Sequence Length: 72
Standard
Deviation: 5
Done |
EqualLengthfasta.py
Usage:
EqualLengthFasta.py -i <inputfile.fasta> -o
<outputfile>
Input: |
<inputfile.fasta>
is any sequence file in fasta format |
|
<outputfile.fasta>
is the name of your output file (could be any name) |
Output: Generates a
fasta file that contains sequences of equal length. Equal length is the
minimum sequence length in the <inputfile.fasta>
Number
of Sequences: 372
Minimum Sequence Length: 70
Maximum Sequence Length: 100
Mean Sequence Length: 72
Standard Deviation: 5
Writing Fasta File for Equal Length
Writing Time: 0.14 seconds
Number of Sequences: 372
Minimum Sequence Length: 70
Maximum Sequence Length: 70
Mean Sequence Length: 70
Standard Deviation: 0
Done |
LSHDIV.py
Usage: LSHDIV.py -i
<inputfile.fasta> -o
<outputfile> -l <min length> -s <num
sampled indices> -w <wmer length> -p
<percentage mismatch> -n <num iterations>
Input: |
<inputfile.fasta>
is any sequence file of equal sequence lengths in fasta format. |
|
<outputfile>
is the name of your output file (could be any name) |
|
<min
length> is the minimum length of the sequence |
|
<num
sampled indices> is the number of sampled indices to be
chosen from sequences. This number must be less than or equal to
<min length> |
|
<wmer
length> is the length of w-mer per index |
|
<percentage
mismatch> is the percentage of mismacth allowed in order to
assign the same OTU. |
|
<num
iterations> is the number of iterations LSH-Div algorithm runs,
each time with different set of sampled indices |
Output: |
<outputfile_log.txt>
is a log file containing all the parameters, number of OTUs, species
richness metrics and other statistics |
|
<outputfile_OTUFasta>
is a fasta file with each sequence tag contains its OTU label |
|
<outputfile_OTULabels>
is index file containing the OTU labels. The number in row one is the
OTU label of sequence one in input file.
|
Reading
time: 0.04 seconds
Mean Sequence Length: 70
Standard Deviation: 0
Initializing LSH-Div
Clustering Sequences and Estimating OTUs
Total Number of Sequences: 372
Number of OTUs in Iteration 1 is 214
Time taken by LSH-Div is 0.65 seconds
Total Time upto iteration 1 is 0.65 seconds
Number of Singleton OTUs are 177
Number of Doubleton OTUs are 16
Chao1 Estimate is 1130.24
Chao1 LCI 95% is 747.37
Chao1 UCI 95% is 1787.93
Shannon Index is 4.87
Shannon LCI 95% is 4.74
Shannon UCI 95% is 5.00
ACE is 2088.20
Writing Output to a fasta file
Done |
Supplementary Paper
Supplementary
paper for LSH-Div
can be downloaded from here.
This paper contains those results which
are not included in the main paper.
Copyright and License
Information
The
source code for LSH-Div
is available under GNU General Public
License (GNU GPL)
© COPYRIGHT 2012 by Zeehasham
Rasheed and Huzefa
Rangwala (George Mason University). All Rights Reserved
|