MLBio+Laboratory

# Bioinformatics & Data Mining

### PrePrint: Parsing Facades with Shape Grammars and Reinforcement Learning

TPAMI - Sun, 03/31/2013 - 18:59
In this work we use Shape Grammars for Facade Parsing, which amounts to segmenting 2D building facades into balconies, walls, windows and doors in an architecturally meaningful manner. The main thrust of our work is the introduction of Reinforcement Learning (RL) techniques to deal with the computational complexity of the problem. RL provides us with efficient tools such as Q-learning and state aggregation which we exploit to accelerate facade parsing. We initially phrase the 1D parsing problem in terms of Markov Decision Processes, paving the way for the application of RL-based tools. We then develop novel techniques for the 2D shape parsing problem that take into account the specificities of the facade parsing problem. Specifically, we use state aggregation to enforce the symmetry of facade floors and also demonstrate that RL can seamlessly exploit bottom-up, image-based guidance during optimization. We provide systematic results on the Paris building dataset and obtain state-of-the-art results in a fraction of the time required by previous methods. We validate our method under diverse imaging conditions and make our software and results available online.

### PrePrint: Color Invariants for Person Re-Identification

TPAMI - Sun, 03/31/2013 - 18:59
We revisit the problem of specific object recognition using color distributions. In some applications - such as specific person identification - it is highly likely that the color distributions will be multimodal and hence contain a special structure. Although the color distribution changes under different lighting conditions, some aspects of its structure turn out to be invariants. We refer to this structure as an intra-distribution structure, and show that it is invariant under a wide range of imaging conditions while being discriminative enough to be practical. Our signature uses shape context descriptors to represent the intra-distribution structure. Assuming the widely used diagonal model, we validate that our signature is invariant under certain illumination changes. Experimentally, we use color information as the only cue to obtain good recognition performance on publicly available databases covering both indoors and outdoors conditions. Combining our approach with the complementary covariance descriptor, we demonstrate results exceeding the state of the art performance on the challenging VIPeR and CAVIAR4REID databases.

### PrePrint: Bayesian Estimation of Turbulent Motion

TPAMI - Sun, 03/31/2013 - 18:59
Based on physical laws describing the multi-scale structure of turbulent flows, this article proposes a regularizer for fluid motion estimation from an image sequence. Regularization is achieved by imposing some scale invariance property between histograms of motion increments computed at different scales. By reformulating this problem from a Bayesian perspective, an algorithm is proposed to jointly estimate motion, regularization hyper-parameters, and to select the most likely physical prior among a set of models. Hyper-parameter and model inference is conducted by posterior maximization, obtained by marginalizing out non-Gaussian motion variables. The Bayesian estimator is assessed on several image sequences depicting synthetic and real turbulent fluid flows. Results obtained with the proposed approach exceed the state of the art results in fluid flow estimation.

### PrePrint: Automatic Relevance Determination in Nonnegative Matrix Factorization with the {beta}-Divergence

TPAMI - Sun, 03/31/2013 - 18:59
This paper addresses the estimation of the latent dimensionality in nonnegative matrix factorization (NMF) with the $\beta$-divergence. The $\beta$-divergence is a family of cost functions that includes the squared Euclidean distance, Kullback-Leibler and Itakura-Saito divergences as special cases. Learning the model order is important as it is necessary to strike the right balance between data fidelity and overfitting. We propose a Bayesian model based on {\em automatic relevance determination} in which the columns of the dictionary matrix and the rows of the activation matrix are tied together through a common scale parameter in their prior. A family of majorization-minimization algorithms is proposed for maximum a posteriori (MAP) estimation. A subset of scale parameters is driven to a small lower bound in the course of inference, with the effect of pruning the corresponding spurious components. We demonstrate the efficacy and robustness of our algorithms by performing extensive experiments on synthetic data, the "swimmer" dataset, a music decomposition example and a stock price prediction task.

### PrePrint: Coupled Gaussian Processes for Pose-Invariant Facial Expression Recognition

TPAMI - Sun, 03/31/2013 - 18:59
We propose a method for head-pose invariant facial expression recognition that is based on a set of characteristic facial points. To achieve head-pose invariance, we propose the Coupled Scaled Gaussian Process Regression (CSGPR) model for head-pose normalization. In this model, we first learn independently the mappings between the facial points in each pair of (discrete) non-frontal poses and the frontal pose, and then perform their coupling in order to capture dependencies between them. During inference, the outputs of the coupled functions from different poses are combined using a gating function, devised based on the head-pose estimation for the query points. The proposed model outperforms state-of-the-art regression-based approaches to head-pose normalization, 2D and 3D Point Distribution Models (PDMs), and Active Appearance Models (AAMs), especially in cases of unknown poses and imbalanced training data. To the best of our knowledge, the proposed method is the first one that is able to deal with expressive faces in the range from $-45^\circ$ to $+45^\circ$ pan rotation and $-30^\circ$ to $+30^\circ$ tilt rotation, and with continuous changes in head pose, despite the fact that training was conducted on a small set of discrete poses. We evaluate the proposed method on synthetic and real images depicting acted and spontaneously displayed facial expressions.

### PrePrint: Learning Hierarchical Features for Scene Labeling

TPAMI - Sun, 03/31/2013 - 18:59
Scene labeling consists in labeling each pixel in an image with the category of the object it belongs to. We propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features, and produces a powerful representation that captures texture, shape and contextual information. We report results using multiple post-processing methods to produce the final labeling. Among those, we propose a technique to automatically retrieve, from a pool of segmentation components, an optimal set of components that best explain the scene; these components are arbitrary, e.g. they can be taken from a segmentation tree, or from any family of over-segmentations. The system yields record accuracies on the Sift Flow Dataset (33 classes) and the Barcelona Dataset (170 classes) and near-record accuracy on Stanford Background Dataset (8 classes), while being an order of magnitude faster than competing approaches, producing a 320x240 image labeling in less than a second, including feature extraction.

### PrePrint: Heterogeneous Face Recognition using Kernel Prototype Similarities

TPAMI - Sun, 03/31/2013 - 18:59
Heterogeneous face recognition (HFR) involves matching two face images from alternate imaging modalities, such as an infrared image to a photograph, or a sketch to a photograph. Accurate HFR systems are of great value in various applications (e.g., forensics and surveillance), where the gallery databases are populated with photographs (e.g. mug shot or passport photographs) but the probe images are often limited to some alternate modality. A generic HFR framework is proposed in which both probe and gallery images are represented in terms of non-linear similarities to a collection of prototype face images. The prototype subjects (i.e., the training set) have an image in each modality (probe and gallery), and the similarity of an image is measured against the prototype images from the corresponding modality. The accuracy of this non-linear prototype representation is is improved by projecting the features into a linear discriminant subspace. Random sampling is introduced into the HFR framework to better handle challeges arising from the small sample size problem. The merits of the proposed approach, called Prototype Random Subspace (P-RS), is demostrated on four different heterogeneous scenarios: (i) near infrared to photograph, (ii) thermal to photograph, (iii) viewed sketch to photograph, and (iv) forensic sketch to photograph.

### PrePrint: Fourier Lucas-Kanade Algorithm

TPAMI - Sun, 03/31/2013 - 18:59
In this paper we propose a framework for both gradient descent image and object alignment in the Fourier domain. Our method centers upon the classical Lucas & Kanade (LK) algorithm where we represent the source and template/model in the complex 2D Fourier domain rather than in the spatial 2D domain. We refer to our approach as the Fourier LK (FLK) algorithm. The FLK formulation is advantageous when one pre-processes the source image and template/model with a bank of filters (e.g. oriented edges, Gabor, etc.) as: (i) it can handle substantial illumination variations, (ii) the inefficient pre-processing filter bank step can be subsumed within the FLK algorithm as a sparse diagonal weighting matrix, (iii) unlike traditional LK the computational cost is invariant to the number of filters and as a result far more efficient, and (iv) this approach can be extended to the inverse compositional form of the LK algorithm where nearly all steps (including Fourier transform and filter bank pre-processing) can be pre-computed leading to an extremely efficient and robust approach to gradient descent image matching. Further, these computational savings translate to non-rigid object alignment tasks that are considered extensions of the LK algorithm such as those found in Active Appearance Models (AAMs).

### PrePrint: Guided Image Filtering

TPAMI - Sun, 03/31/2013 - 18:59
In this paper we propose a novel explicit image filter called guided filter. Derived from a local linear model, the guided filter computes the filtering output by considering the content of a guidance image, which can be the input image itself or another different image. The guided filter can be used as an edge-preserving smoothing operator like the popular bilateral filter, but has better behaviors near edges. The guided filter is also a more generic concept beyond smoothing: it can transfer the structures of the guidance image to the filtering output, enabling new filtering applications like dehazing and guided feathering. Moreover, the guided filter naturally has a fast and non-approximate linear time algorithm, regardless of the kernel size and the intensity range. Currently it is one of the fastest edge-preserving filters. Experiments show that the guided filter is both effective and efficient in a great variety of computer vision and computer graphics applications including edge-aware smoothing, detail enhancement, HDR compression, image matting/feathering, dehazing, joint upsampling, etc.

### IEEE Transactions on Pattern Analysis and Machine Intelligence - May 2013 (Vol. 35, No. 5)

TPAMI - Sun, 03/31/2013 - 18:59
IEEE Transactions on Pattern Analysis and Machine Intelligence

### Fast simulation of reconstructed phylogenies under global, time-dependent birth-death processes

Bioinformatics - Sat, 03/30/2013 - 01:10

Motivation: Diversification rates and patterns may be inferred from reconstructed phylogenies. Both the time-dependent as well as the diversity-dependent birth-death process can produce the same observed patterns of diversity over time. To develop and test new models describing the macro-evolutionary process of diversification, generic and fast algorithms to simulate under these models are necessary. Simulations are not only important for testing and developing models but play an influential role in the assessment of model fit.

Results: In the present paper I consider as the model a global, time-dependent birth-death process where each species has the same rates but rates may vary over time. For this model I derive the likelihood of the speciation times from a reconstructed phylogenetic tree and show that each speciation event is independent and identically distributed. This fact can be used to simulate efficiently reconstructed phylogenetic trees when conditioning on the number of species, the time of the process or both. I show the usability of the simulation by approximating the posterior predictive distribution of a birth-death process with decreasing diversification rates applied on a published bird phylogeny (family Cettiidae).

Availability: The methods described in this manuscript are implement in the R package TESS, available from the repository CRAN (http://cran.r-project.org/web/packages/TESS/).

Contact: hoehna@math.su.se

### FunFrame: functional gene ecological analysis pipeline

Bioinformatics - Fri, 03/29/2013 - 08:22

Summary: Pyrosequencing of 16S rDNA is widely used to study microbial communities, and a rich set of software tools support this analysis. Pyrosequencing of protein-coding genes, which can help elucidate functional differences among microbial communities, significantly lags behind 16S rDNA in availability of sequence analysis software. In both settings, frequent homopolymer read errors inflate the estimation of microbial diversity, and de-noising is required to reduce that bias. Here we describe FunFrame, an R-based data-analysis pipeline that uses recently described algorithms to de-noise functional gene pyrosequences and performs ecological analysis on de-noised sequence data. The novelty of this pipeline is that it provides users a unified set of tools, adapted from disparate sources and designed for different applications, that can be used to examine a particular protein coding gene of interest. We evaluated FunFrame on functional genes from four PCR-amplified clones with sequence depths ranging from 9084 to 14 494 sequences. FunFrame produced from one to nine OTUs for each clone, resulting in an error rate ranging from 0 to 0.18%. Importantly, FunFrame reduced spurious diversity while retaining more sequences than a commonly used de-noising method that discards sequences with frameshift errors.

Availability: Software, documentation and a complete set of sample data files are available at http://faculty.www.umb.edu/jennifer.bowen software/FunFrame.zip.

Contact: Jennifer.Bowen@umb.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

### NTFD--a stand-alone application for the non-targeted detection of stable isotope-labeled compounds in GC/MS data

Bioinformatics - Fri, 03/29/2013 - 08:22

Summary: Most current stable isotope-based methodologies are targeted and focus only on the well-described aspects of metabolic networks. Here, we present NTFD (non-targeted tracer fate detection), a software for the non-targeted analysis of all detectable compounds derived from a stable isotope-labeled tracer present in a GC/MS dataset. In contrast to traditional metabolic flux analysis approaches, NTFD does not depend on any a priori knowledge or library information. To obtain dynamic information on metabolic pathway activity, NTFD determines mass isotopomer distributions for all detected and labeled compounds. These data provide information on relative fluxes in a metabolic network. The graphical user interface allows users to import GC/MS data in netCDF format and export all information into a tab-separated format.

Availability: NTFD is C++- and Qt4-based, and it is freely available under an open-source license. Pre-compiled packages for the installation on Debian- and Redhat-based Linux distributions, as well as Windows operating systems, along with example data, are provided for download at http://ntfd.mit.edu/.

Contact: gregstep@mit.edu

### Enabling interspecies epigenomic comparison with CEpBrowser

Bioinformatics - Fri, 03/29/2013 - 08:22

Summary: We developed the Comparative Epigenome Browser (CEpBrowser) to allow the public to perform multi-species epigenomic analysis. The web-based CEpBrowser integrates, manages and visualizes sequencing-based epigenomic datasets. Five key features were developed to maximize the efficiency of interspecies epigenomic comparisons.

Availability: CEpBrowser is a web application implemented with PHP, MySQL, C and Apache. URL: http://www.cepbrowser.org/.

Contact: szhong@ucsd.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

### FishingCNV: a graphical software package for detecting rare copy number variations in exome sequencing data

Bioinformatics - Thu, 03/28/2013 - 08:49

Summary: Rare copy number variations (CNVs) are frequent causes of genetic diseases. We developed a graphical software package based on a novel approach that can consistently identify CNVs of all types (homozygous deletions, heterozygous deletions, heterozygous duplications) from exome sequencing data without the need of a paired control. The algorithm compares coverage depth in a test sample against a background distribution of control samples and uses principal component analysis to remove batch effects. It is user friendly and can be run on a personal computer.

Availability and Implementation: The main scripts are implemented in R (2.15), and the GUI is created using Java 1.6. It can be run on all major operating systems. A non-GUI version for pipeline implementation is also available. The program is freely available online: https://sourceforge.net/projects/fishingcnv/

Contact: yuhao.shi@mail.mcgill.ca

Supplementary Information:

### MCScanX-transposed: detecting transposed gene duplications based on multiple colinearity scans

Bioinformatics - Thu, 03/28/2013 - 08:49

Summary: Gene duplication occurs via different modes such as segmental and single-gene duplications. Transposed gene duplication, a specific form of single-gene duplication, ‘copies’ a gene from an ancestral chromosomal location to a novel location. MCScanX is a toolkit for detection and evolutionary analysis of gene colinearity. We have developed MCScanX-transposed, a software package to detect transposed gene duplications that occurred within different epochs, based on execution of MCScanX within and between related genomes. MCScanX-transposed can be also used for integrative analysis of gene duplication modes for a genome and to annotate a gene family of interest with gene duplication modes.

Availability: MCScanX-transposed is freely available at http://chibba.pgml.uga.edu/mcscan2/transposed/

Contact: paterson@plantbio.uga.edu

### Density-based hierarchical clustering of pyro-sequences on a large scale - the case of fungal ITS1

Bioinformatics - Thu, 03/28/2013 - 08:49

Motivation: Analysis of millions of pyro-sequences is currently playing a crucial role in the advance of environmental microbiology. Taxonomy independent, i.e. unsupervised clustering of these sequences is essential for the definition of Operational Taxonomic Units. For this application, reproducibility and robustness should be the most sought after qualities, but have so far largely been overlooked.

Results: Over one million hyper-variable ITS1 sequences of fungal origin have been analyzed. The ITS1 sequences were first properly extracted from 454 reads using generalized profiles. Then, otupipe, cd-hit-454, ESPRIT-Tree and DBC454, a new algorithm presented here, were used to analyze the sequences. A numerical assay was developed to measure the reproducibility and robustness of these algorithms. DBC454 was the most robust, closely followed by ESPRIT-Tree. DBC454 features density-based hierarchical clustering, that complements the other methods by providing insights into the structure of the data.

Availability and Implementation: An executable is freely available for non-commercial users at ftp://ftp.vital-it.ch/tools/dbc454. It is designed to run under MPI on a cluster of 64-bit Linux machines running Red Hat 4.x, or on a multi-core OSX system.

Contact: dbc454@vital-it.ch

### An accessible database for mouse and human whole transcriptome qPCR primers

Bioinformatics - Thu, 03/28/2013 - 07:41

Motivation: Real time quantitative PCR (qPCR) is an important tool in quantitative studies of DNA and RNA molecules; especially in transcriptome studies, where different primer combinations allow identification of specific transcripts such as splice variants or precursor mRNA (pre-mRNA). Several softwares which implement various rules for optimal primer design are available. Nevertheless, since designing qPCR primers needs to be done manually, the repeated task is tedious, time consuming and prone to errors.

Results: We used a set of rules to automatically design all possible exon-exon and intron-exon junctions in the Human and Mouse transcriptomes. The resulting database is included as a track in the UCSC genome browser, making it widely accessible and easy to use.

Availability: The database is available from the UCSC genome browser (http://genome.ucsc.edu/), track name "Whole Transcriptome qPCR Primers" for the hg19 (Human) and mm10 (Mouse) genome versions. Batch query is available in: http://www.weizmann.ac.il/complex/compphys/software/Amit/primers/batchqueryqpcrprimers.htm

Contact: eytan.domany@weizmann.ac.il

### Improved ancestry inference using weights from external reference panels

Bioinformatics - Thu, 03/28/2013 - 07:41

Motivation: Inference of ancestry using genetic data is motivated by applications in genetic association studies, population genetics and personal genomics. Here, we provide methods and software for improved ancestry inference using genome-wide SNP weights from external reference panels. This approach makes it possible to leverage the rich ancestry information that is available from large external reference panels, without the administrative and computational complexities of re-analyzing the raw genotype data from the reference panel in subsequent studies.

Results: We extensively validate our approach in multiple African-American, Latino-American and European-American data sets, making use of genome-wide SNP weights derived from large reference panels, including HapMap 3 populations and 6,546 European Americans from the Framingham Heart Study. We show empirically that our approach provides much greater accuracy than either the prevailing Ancestry-Informative Markers (AIMs) approach or the analysis of genome-wide target genotypes without a reference panel. For example, in an independent set of 1,636 European American GWAS samples, we attained prediction accuracy (R2) of 1.000 and 0.994 for the first two principal components (PCs) using our method, compared to 0.418 and 0.407 using 150 published AIMs or 0.955 and 0.003 by applying PCA directly to the target samples. We finally show that the higher accuracy in inferring ancestry using our method leads to more effective correction for population stratification in association studies.

Availability: The SNPweights software is available online at http://www.hsph.harvard.edu/faculty/alkes-price/software/.

### CytoHiC: a cytoscape plugin for visual comparison of Hi-C networks

Bioinformatics - Mon, 03/25/2013 - 05:32

Summary: With the introduction of the Hi-C method new and fundamental properties of the nuclear architecture are emerging. The ability to interpret data generated by this method, which aims to capture the physical proximity between and within chromosomes, is crucial for uncovering the three dimensional structure of the nucleus. Providing researchers with tools for interactive visualization of Hi-C data can help in gaining new and important insights. Specifically, visual comparison can pinpoint changes in spatial organization between Hi-C datasets, originating from different cell lines or different species, or normalized by different methods. Here, we present CytoHiC, a Cytsocape plugin, which allow users to view and compare spatial maps of genomic landmarks, based on normalized Hi-C datasets. CytoHiC was developed to support intuitive visual comparison of Hi-C data and integration of additional genomic annotations.

Availability: The CytoHiC plugin, source code, user manual, example files and documentation are available at: http://apps.cytoscape.org/apps/cytohicplugin

Contact: yolisha@gmail.com or ys388@cam.ac.uk