Huzefa Rangwala @ Computer Science, George Mason University

My research has lead to the development of several software and web servers. These are made available to the academic research community.

DMGrader: Data Mining Grader

DMGrader: is a customizable web-based data analytics competition hosting framework written in Django and Python. Rangwala has used this in teaching data science at George Mason University and hosting data analytics hackathon. This software is under continuous development.

NTSGP: Next-Term Student Grade Prediction Toolkit

NTSGP funded by NSF BIG Data project No. 1447489 is a comprehensive set of Python-based programs for implementing an educational recommender system. Please see paper here for details.

HierCost: Hierarchical Cost Sensitive Learning

HierCost toolkit is a set of programs for supervised classification for single-label and multi-label hierarchical classification using cost sensitive logistic regression based classifier written in python.

MC-MinH and MrMC-MinH: Metagenome Clustering using Minwise based Hashing

MC-MinH is a metagenome analysis toolkit. MC-MinH algorithm uses the min-wise hashing approach, along with a greedy clustering algorithm to group 16S and whole metagenomic sequences. We represent unequal length sequences using contiguous subsequences or k-mers, and then approximate the computation of pairwise similarity using independent min-wise hashing. The algorithm is written in C and is available using the GNU GPL license.

MrMC-MinH is a Map-Reduce based algorithm for metagenome clustering using minwise hashing. It is an extension of our previously developed, greedy clustering algorithm MC-MinH (http://www.cs.gmu.edu/~mlbio/MC-MinH/). The algorithm is written in Java and Pig programming language.

LSH-Div: Species Diversity Estimation using Locality Sensitive Hashing

LSH-Div is a metagenome analysis toolkit. The clustering algorithm groups sequences into Operational Taxonomic Units (OTUs) using the LSH function within a greedy, iterative clustering framework. LSH-Div reports the standard species richness metrics such as Chao1 Index, Shannon Diversity Index and Abundance-based Coverage Estimator (ACE) Index after assigning sequences within a sample to different OTUs (or clusters)

TAC-ELM: Taxonomic Classification with Extreme Learning Machines

TAC-ELM is a metagenome analysis toolkit. It is a new taxonomy classification scheme that extracts composition-based features (oligonucleotides and GC content) from the the short sequence reads and develop a neural network-based model. To train the parameters of the model we use an analytical framework, called extreme learning machine (ELM) to learn the parameters of the models.

svmPRAT: svm-Based Protein Residue Annotation Toolkit

svmPRAT: is a general purpose protein residue annotation toolkit to allow biologists to formulate residue-wise prediction problems. svmPRAT formulates annotation problem as a classification or regression problem using support vector machines. The key features of svmPRAT are its ease of use to incorporate any user-provided information in the form of feature matrices. For every residue svmPRAT captures local information around the reside to create fixed length feature vectors. svmPRAT implements accurate and fast kernel functions, and also introduces a flexible window-based encoding scheme that allows better capture of signals for certain prediction problems.

MONSTER: Minnesota prOteiN Sequence annoTation servER

MONSTER is a server for predicting the local structure and function properties of protein residues. MONSTER provides residue-wise annotation services, that include secondary structure, transmembrane-helix region, disorder region, protein-DNA binding site, \red {ligand-binding site}, local structure alphabet, solvent accessibility surface area, and residue-wise contact order prediction. MONSTER uses sequence-derived information (in the form of PSI-BLAST profiles), a window-based encoding scheme with an accurate kernel function to perform the classification or estimation. The user provides an amino acid sequence and selects the desired predictions, and submits a job to the MONSTER server. The results are emailed to the user as a link directing the user to a well formatted HTML output page.

MARINER: MinnesotA pRotein modelINg servER

MARINER: is a server for predicting the three-dimensional structure of proteins using homology modeling based techniques. This server is always under development, and was used for participation in the CASP 8 protein structure prediction competition. Watch this space for a future version of this server. Also students at George Mason interested in the competition, please get in touch with me.

Profile-based Kernel Compute Package

kernel-compute is a package that computes pairwise profile-based similarity matrix. This matrix can then be converted into a valid kernel matrix with an eigen value transformation. This scoring matrix has shown to be the best performing method for developing remote homology detection and fold recognition models.