TAC-ELM (supplementary material)

TAC-ELM (Taxonomic Classification with Extreme Learning Machines) is a new taxonomy classification scheme that extracts composition-based features (oligonucleotides and GC content) from the the short sequence reads and develop a neural network-based model. To train the parameters of the model we use an analytical framework, called extreme learning machine (ELM) to learn the parameters of the models.


Data Files


Following are the data files used in the TAC-ELM experiments. You can download the compressed file here.

1- Complete Information:  spreadsheet and text file containing all levels (Phylum, Class, Order, Family, Genus, Species) information used in the experiments.

2- Complete Accession: spreadsheet containing all the class labels used to train the model. It also contains the accession number for each sample.

3- 100 bp Train Files: MetaSim generated 100 bp fasta files from original genome NCBI RefSeq repository. 

4- 100 bp Test Files: A set of 10 test files from PhymmBL paper.


Source Code

TAC-ELM is written in Matlab. The compressed source code of TAC-EML and other files can be downloaded from here. Each file is described below.

1- Generate_Data.m: This file takes three input arguments; an information text file, path of a folder which contains fasta files and an integer which defines the taxonomy level. For example 1 is for Phylum level, 2 is for Class level and so on. This file will generate same number of files as in a fasta file folder and appends a class label for each sequence.
Example: Generate_Data 'Complete_Info.txt' 'FastaFiles\' '2'. 


2- Feature_Extraction.m: This file takes two input arguments;
path of a folder which contains labeled files (from Generate_Data.m) and an integer. For example. 1 is for calculating simple tetra oligonucleotide frequency (k-mer 4) and 2 is for Markov chain terta oligonucleotide frequency. This file will generate  .mat files for each labeled file in a folder.
Example: Features_Extraction 'LabelledSequenceFiles\' '1'.


3- Train_TACELM_All.m: This file takes three arguments; an information text file, a train mat file folder path and an integer value which defines the taxonomy level. This file generates a model file for TAC-ELM.
Example: Train_TACELM_All 'Complete_Info.txt' 'TrainFeatures\' '1'


4- Test_TACELM_All.m: This file takes five arguments; an information text file, a train mat file folder path,  a test mat file folder path, a true label text file and level information files contains phyla,class,order,family,genus names. This file generates a prediction file for TAC-ELM.
Example: Test_TACELM_All 'Complete_Info.txt' 'ModelFiles\' 'TestFeatures\' 'TestTrueLabels.txt' 'Level_Info.txt' '1'.



Example

A simple example of TAC-ELM execution is as follows.

matlab -r Generate_Data 'Complete_Info.txt' '<FastaFilesDir\>' '1'

matlab -r 
Features_Extraction '<LabelledSequenceFilesDir\>' '1'

matlab -r
Train_TACELM_All 'Complete_Info.txt' '<TrainFeaturesDir\>' '1'

matlab -r Test_TACELM_All 'Complete_Info.txt' '<ModelFilesDir\>' '<TestFeaturesDir\>' 'TestTrueLabels.txt' 'Level_Info.txt' '1'


Evaluation code for  BLAST and TAC-ELM

Some python scripts to evaluate BLAST and TAC-ELM results in this paper. The compressed file can be downloaded from here.

1- Run BLAST on you local machine as "blastn -db nt -query all.5x100bp.1.txt -evalue 0.001 -outfmt 6 -out all_100_1_blast.txt". You can download the output file

2- Parse_Blast_Result.py:
This script takes two arguments. An input file (which is the output from BLAST) and output file. It will generate the taxonomy of each sample.

3- Sort_Accession_BLAST.m: This script takes three arguments. A parsed BLAST result output file from Parse_Blast_Result.py script, a TAC-ELM result output file and an integer value which defines taxonomy level. This script sorts the accession numbers of TAC-ELM and BLAST results to make them consistent.  
Example: matlab -r Sort_Accession_BLAST '<ParseBlastResult>' '<TacElmResults>' '1'


4- Evaluate_All.m: This script takes three arguments. A TAC-ELM result output file and an integer value which defines taxonomy level, a sorted BLAST result output file from Sort_Accession_BLAST.m script and an integer value which defines taxonomy level. This scrips outputs the classification performance of all the methods.
Example: matlab -r Evaluate_All '<TacElmResults>' '<SortedBlastResults>' '1'.



Supplementary paper

Supplementary paper is available here.


Contact

If you have any questions, please contact zrasheed[at]gmu.edu, hrangwal@gmu.edu


Cite

Please use the following reference in citing TAC-ELM: (BibTex)
Zeehasham Rasheed and Huzefa Rangwala. Metagenomic taxonomic classification with extreme learning machines. (Under Review)