TAC-ELM (supplementary
material)
TAC-ELM (Taxonomic Classification with Extreme Learning Machines) is
a new taxonomy classification scheme that extracts composition-based
features (oligonucleotides and GC content) from the the short sequence
reads and develop a neural network-based model. To train the parameters
of the model we use an analytical framework, called extreme learning
machine (ELM) to learn the parameters of the models.
Data Files
Following are the data files used in the TAC-ELM experiments. You can download the compressed file here.
1- Complete Information:
spreadsheet and text file containing all levels (Phylum, Class, Order, Family,
Genus, Species) information used in the experiments.
2- Complete Accession: spreadsheet
containing all the class labels used to train the model. It also
contains the accession number for each sample.
3- 100 bp Train Files: MetaSim generated 100 bp fasta files from original genome NCBI RefSeq repository.
4- 100 bp Test Files: A set of 10 test files from PhymmBL paper.
Source Code
TAC-ELM is written in Matlab. The compressed source code of TAC-EML and other files can be downloaded from here. Each file is described below.
1- Generate_Data.m:
This file takes three input arguments; an information text file, path
of a folder which contains fasta files and an integer which defines the
taxonomy level. For example 1 is for Phylum level, 2 is for Class level
and so on. This file will generate same number of files as in a fasta
file folder and appends a class label for each sequence.
Example: Generate_Data 'Complete_Info.txt' 'FastaFiles\' '2'.
2- Feature_Extraction.m: This file takes two input arguments; path
of a folder which contains labeled files (from Generate_Data.m) and an
integer. For example. 1 is for calculating simple tetra oligonucleotide
frequency (k-mer 4) and 2 is for Markov chain terta oligonucleotide
frequency. This file will generate .mat files for each labeled
file in a folder.
Example: Features_Extraction 'LabelledSequenceFiles\' '1'.
3- Train_TACELM_All.m:
This file takes three arguments; an information text file, a train mat
file folder path and an integer value which defines the taxonomy level. This file generates a model file for TAC-ELM.
Example: Train_TACELM_All 'Complete_Info.txt' 'TrainFeatures\' '1'
4- Test_TACELM_All.m:
This file takes five arguments; an information text file, a train mat
file folder path, a test mat
file folder path,
a true label text file and level information files contains
phyla,class,order,family,genus names. This file generates a prediction
file for TAC-ELM.
Example: Test_TACELM_All 'Complete_Info.txt' 'ModelFiles\' 'TestFeatures\' 'TestTrueLabels.txt' 'Level_Info.txt' '1'.
Example
A simple example of TAC-ELM execution is as follows.
matlab -r Generate_Data 'Complete_Info.txt' '<FastaFilesDir\>' '1'
matlab -r Features_Extraction '<LabelledSequenceFilesDir\>' '1'
matlab -r Train_TACELM_All 'Complete_Info.txt' '<TrainFeaturesDir\>' '1'
matlab
-r Test_TACELM_All 'Complete_Info.txt' '<ModelFilesDir\>'
'<TestFeaturesDir\>' 'TestTrueLabels.txt' 'Level_Info.txt' '1'
Evaluation code for BLAST and TAC-ELM
Some python scripts to evaluate BLAST and TAC-ELM results in this paper. The compressed file can be downloaded from here.
1- Run
BLAST on you local machine as "blastn -db nt -query
all.5x100bp.1.txt -evalue 0.001 -outfmt 6 -out all_100_1_blast.txt".
You can download the output file
2- Parse_Blast_Result.py:
This script takes two arguments. An input file (which is the output
from BLAST) and output file. It will generate the taxonomy of each
sample.
3- Sort_Accession_BLAST.m:
This script takes three arguments. A parsed BLAST result output file
from Parse_Blast_Result.py script, a TAC-ELM result output file and an
integer value which defines taxonomy level. This script sorts the
accession numbers of TAC-ELM and BLAST results to make them consistent.
Example: matlab -r Sort_Accession_BLAST '<ParseBlastResult>' '<TacElmResults>' '1'
4- Evaluate_All.m:
This script takes three arguments. A
TAC-ELM result output file and an integer value which defines taxonomy
level, a sorted BLAST result output file from Sort_Accession_BLAST.m script and an integer value which defines taxonomy level. This scrips outputs the classification performance of all the methods.
Example: matlab -r Evaluate_All '<TacElmResults>' '<SortedBlastResults>' '1'.
Supplementary paper
Supplementary paper is available here.
Contact
If you have any questions, please contact zrasheed[at]gmu.edu, hrangwal@gmu.edu
Cite
Please use the following reference in citing TAC-ELM: (BibTex)
Zeehasham
Rasheed and Huzefa Rangwala. Metagenomic taxonomic
classification with extreme learning machines. (Under Review)