2010 Technical Reports | George Mason Department of Computer Science

Evaluation of Short Read Metagenomic Assembly

Anveshi Charuvaka and Huzefa Rangwala

GMU-CS-TR-2010-9

Advances in sequencing technologies have equipped researchers with the ability to sequence the collective genome of entire microbial communities commonly referred to as metagenomics. These microbes are are omnipresent within the human body and environments across the world. As such, characterizing and understanding their roles is crucial for improving human health and the environment. The problem of using short reads obtained from current next generation sequencing technologies to assemble the genomes within the community sample is challenging for several reasons. In this study we assess the performance of a state-of-the-art Eulerian-based graph assembler on a series of simulated dataset with varying complexity. We evaluate the feasibility of metagenomic assembly with reads restricted to be 36 base pairs obtained from the Solexa/Illumina platform. We developed a pipeline to evaluate the quality of assembly based on contig length statistics and accuracy. We studied the effect of overlap parameters used for the metagenomic assembly and developed a clustering solution to pool the contigs obtained from different runs of the assembly algorithm which allowed us to obtain longer contigs. We also computed an entropy/impurity metric to assess how mixed the assembled contigs were. Ideally a contig should be assembled from reads obtained from the same organism. We also compared the metagenomic assemblies to the best possible solution that could be obtained by assembling individual source genomes. Our results show that accuracy was better than expected for the metagenomic samples with a few dominant organisms and was especially poor in samples containing many closely related strains.

Energy Management for Time-Critical Energy Harvesting Wireless Sensor Networks

Bo Zhang, Robert Simon and Hakan Aydin

GMU-CS-TR-2010-8

As Cyber-Physical Systems (CPSs) evolve they will be increasingly relied to support time-critical monitoring and control activities. Further, many CPSs that utilizing Wireless Sensor Networking (WSN) technologies will require the use of energy harvesting methods to extend their lifetimes. For this application class, there are currently few effective models that allow the simulation and analysis of new algorithms or system performance. To address this problem, we define a general purpose WSN model to support a time-critical CPS system. We then present a set of Harvesting Aware Speed Selection (HASS) algorithms. Our technique maximizes the minimum energy reserve for all the nodes in the network, thus ensuring highly resilient performance under emergency or fault-driven situations. We present an optimal centralized solution, along with an efficient, fully distributed solution. We propose a CPS-specific experimental methodology, enabling us to evaluate our approach. Our experiments show that our algorithms yield significantly higher energy reserves than baseline methods.

An Information Theoretic Approach for the Analysis of RNA and DNA Binding Sites

Sheng Li and Huzefa Rangwala

GMU-CS-TR-2010-7

Proteins perform several critical biological processes by interacting with other macromolecules (DNA, RNA) and small molecules. Several computational approaches have been developed to determine the protein interaction sites using sequence and structure features. Instead of building another adhoc prediction algorithm, the purpose of this study is to understand the contribution of a s residue in a RNA-binding event and compare it with the DNA-binding process. We evaluate several sequence and structure-based features using mutual information theory. We show that solvent accessibility and profile-based features can be used for developing good protein-RNA binding site determination algorithms. We also recommend features that could discriminate between RNA and DNA binding sites. This work can be extended to understand protein-protein and protein-ligand interactions as well.

Extra Buffer Resources Improving Competitiveness in Minimizing Energy Consumption

Zhi Zhang and Fei Li

GMU-CS-TR-2010-6

In this paper, we consider energy management algorithms for scheduling jobs in power-scare scenarios such as embedded computer systems and sensor networks. We focus on investigating the impact of buffer resources in minimizing the total energy cost in an online setting. The online algorithms do not have any assumptions on job arrivals; their worst-case performance is measured in term of competitive ratio, when they are compared with the optimal algorithms with clairvoyance. We prove that with appropriate extra buffer space, an online algorithm can beat an weak optimal offline algorithm in terms of the total energy required. Our research result helps to quantitatively estimate the optimal on-chip buffer resources allocated in real-time systems with power constraints. We also present the lower bound of competitive ratio that any deterministic online algorithm cannot achieve.

HyperCheck: A Hardware-assisted Integrity Monitor

Jiang Wang, Angelos Stavrou and Anup Ghosh

GMU-CS-TR-2010-5

Over the past few years, virtualization has been employed to environments ranging from densely populated cloud computing clusters to home desktop computers. Security researchers embraced virtual machine monitors (VMMs) as a new mechanism to guarantee deep isolation of untrusted software components. Unfortunately, their widespread adoption promoted VMMs as a prime target for attackers. In this paper, we present HyperCheck, a hardware-assisted tampering detection framework designed to protect the integrity of VMMs and, for some classes of attacks, the underlying operating system (OS). HyperCheck leverages the CPU System Managed Mode (SMM), present in x86 systems, to securely generate and transmit the full state of the protected machine to an external server. Using HyperCheck, we were able to ferret-out rootkits that targeted the integrity of both the Xen hypervisor and traditional OSes. Moreover, HyperCheck is robust against attacks that aim to disable or block its operation. Our experimental results show that Hypercheck can produce and communicate a scan of the state of the protected software in less than 40ms.

Scalable and Robust Shepherding via Deformable Shapes

Joseph F. Harrison, Christopher Vo and Jyh-Ming Lien

GMU-CS-TR-2010-4

In this paper, we present a new motion planning strategy for shepherding in environments containing obstacles. This instance of the group motion control problem is applicable to a wide variety of real life scenarios, such as animal herding, civil crowd control, and oil-spill cleanup. However, the problem is challenging in terms of scalability and robustness because it is dynamic, highly underactuated, and involves multi-agent coordination. Our previous work showed that high-level probabilistic motion planning algorithms combined with simple shepherding behaviors can be beneficial in situations where low-level behaviors alone are insufficient. However, inconsistent results suggested a need for a method that performs well across a wider range of environments. In this paper, we present a new method, called Deform, in which shepherds view the flock as an abstracted deformable shape. We show that our method is more robust than our previous approach and that it scales more effectively to larger teams of shepherds and larger flocks. We also show Deform to be surprisingly robust despite increasing randomness in the motion of the flock.

Following a Large Unpredictable Group of Targets Among Obstacles

Christopher Vo and Jyh-Ming Lien

GMU-CS-TR-2010-3

Camera control is essential in both virtual and real-world environments. Our work focuses on an instance of camera control called target following, and offers an algorithm, based on the ideas of monotonic tracking regions and ghost targets, for following a large coherent group of targets with unknown trajectories, among known obstacles. In multiple-target following, the camera's primary objective is to follow and maximize visibility of multiple moving targets. For example, in video games, a third-person view camera may be controlled to follow a group of characters through complicated virtual environments. In robotics, a camera attached to robotic manipulators could also be controlled to observe live performers in a concert, monitor assembly of a mechanical system, or maintain task visibility during teleoperated surgical procedures. To the best of our knowledge, this work is the first attempting to address this particular instance of camera control.

Ruminate: A Scalable Architecture for Deep Network Analysis

Charles Smutz and Angelos Stavrou

GMU-CS-TR-2010-20

Additional Files: techreports/GMU-CS-TR-2010-20.zip

Traditionally, Network Intrusion Detection Systems (NIDS) inspect packet header and payload data for malicious content. While each system is different, most NIDS perform limited analysis on network streams and network protocols. Unfortunately, current NIDS are typically susceptible to evasion through network protocol encoding, such as base64 encoding of SMTP/MIME or gzip compression of HTTP. In addition, malicious desktop application payloads (e.g., PDF documents, Flash multimedia files) are beyond the inspection capabilities of popular NIDS. To address these limitations, we introduce Ruminate, a scalable object-centric traffic inspection and analysis architecture. Ruminate provides a distributed platform for deep analysis of network payload content. This includes full decoding of network protocols and recursive extraction of client application objects transferred over the network. While traditional NIDS utilize static packet load balancing to provide scalability, Ruminate employs dynamic load distribution of reassembled network streams and embedded objects, outsourcing the heavy processing to other processors or connected hosts. Therefore, high latency or computationally expensive analysis can be performed on commodity servers. Furthermore, our approach empowers system administrators to provision resources and preferentially treat traffic not only depending on the packet header but also on the data objects it carries. To achieve this, each object inspection algorithm is implemented as a separate component or service offered through a highly scalable producer-consumer architecture. We demonstrate using real-world traffic that our load balancing is far superior to existing techniques. This is because its granularity depends on the reconstructed objects rather than packet or simple stream analysis. Unlike existing systems, Ruminate can prevent NIDS evasion that leverages encoding or compression of malicious objects in network protocols, desktop application file formats, or encapsulation within other objects.

Multiple Kernel Learning for Fold Recognition

Huzefa Rangwala

GMU-CS-TR-2010-2

Fold recognition is a key problem in computational biology that involves classifying protein sharing structural similarities into classes commonly known . Recently, researchers have developed several efficient kernel based discriminatory methods for fold classification using sequence information. These methods train one-versus-rest binary classifiers using well optimized kernels from different data sources and techniques. Integrating this vast amount of data in the form of kernel matrices is an interesting and challenging problem. The semidefinite positive property of the various kernel matrices makes it attractive to cast the task of learning an optimal weighting of several kernel matrices as a semi-definite programming optimization problem. We experiment with a previously introduced quadratically constrained quadratic optimization problem for kernel integration using 1-norm and 2-norm support vector machines. We integrate state-of-the-art profilebased direct kernels to learn an optimal kernel matrix. Our experimental results show a small significant improvement in terms of the classification accuracy across the different fold classes. Our analysis illustrates the strength of two dominating kernels across different fold classes, which suggests the redundant nature of the kernel matrices being combined.

TAC-ELM: Metagenomic Taxonomic Classification with Extreme Learning Machines

Zeehasham Rasheed and Huzefa Rangwala

GMU-CS-TR-2010-19

Next-generation technologies have allowed researchers to determine the collective genomes of all organisms within specific environments or communities. Varying species abundance, length and complexities within different communities, coupled with discovery of new species makes the problem of taxonomic assignment to short DNA sequence reads extremely challenging. We have developed a new sequence composition-based taxonomic classifier, TAC-ELM for metagenomic analysis. TAC-ELM uses the framework of extreme learning machines to quickly and accurately learn the weights for a neural network model, with input features consisting of GC content and oligonucleotides. TAC-ELM is evaluated on two standard metagenomic benchmarks with sequence read lengths reflecting the traditional and current technologies. Our empirical results indicate the strength of the developed approach, which outperforms state-of-the-art taxonomic classifiers in terms of accuracy, training time and implementation complexity. We also perform experiments that evaluate the pervasive case within metagenome analysis, where a species may not have been previously sequenced or discovered and will not exist in the reference genome databases.

Predicting Network Response Times Using Social Information

Chen Liang, Sharath Hiremagalore, Angelos Stavrou and Huzefa Rangwala

GMU-CS-TR-2010-18

Social networks and discussion boards have become a significant outlet where people communicate and express their opinion freely. Although the social networks themselves are usually well-provisioned, the participating users frequently point to external links to substantiate their discussions. Unfortunately, the sudden heavy traffic load imposed on the external, linked web sites causes them to become unresponsive leading to what people call the "Flash Crowds" effect. Flash Crowds present a real challenge, as it is not possible to predict their intensity and occurrence time. Moreover, although increasingly capable, most present-day web hosting servers and caching systems are designed to handle a nominal load of requests before they become unresponsive. This can happen either due to the limited bandwidth or processing power allocated to the hosting site. In this paper, we quantify the prevalence of flash crowd events for a popular social discussion board (Digg). Using PlanetLab, we measured the response times of 1289 unique popular websites. We were able to verify that 89% of the popular URLs suffered variations in their response times. In an effort to identify flash crowds ahead of time, we evaluate and compare traffic forecasting mechanisms. We show that predicting network traffic using network measurements has very limited success and cannot be used for large-scale prediction. However, by analyzing the content and structure of the social discussions, we were able to classify 86% of the popular web sites within 5 minutes of their submission and 95% of the sites when more (5 hours) of social content became available. Our work indicates that we can effectively leverage social activity to forecast network events that will be otherwise infeasible to anticipate.

Which Should We Try First? Ranking Information Resources through Query Classification

Joshua Church and Amihai Motro

GMU-CS-TR-2010-17

Users seeking information in distributed environments of large numbers of disparate information resources are often burdened with the task of repeating their queries for each and every resource. Invariably, some of the searched resources are more productive (yield more useful documents) than others, and it would be undoubtedly useful to try these resources first. If the environment is

GeoMason: Geospatial Support for MASON

Keith Sullivan, Mark Coletti and Sean Luke

GMU-CS-TR-2010-16

MASON is a free, open-source Java-based discrete event multi-agent simulation toolkit that has been used to model network intrusions, unmanned aerial vehicles, nomadic migrations, and farmer/herder conflicts, among others. Many multi-agent models use georeferenced data which represent such things as road networks, rivers, vegetation coverage, population, and topology. However, MASON does not directly support georeferenced data. Therefore practitioners using MASON must hand craft such support, which may be difficult and error prone. In this paper we describe newly added geospatial functionality in MASON that addresses this problem. We discuss the design of this functionality, called GeoMASON, and its use and limitations. Finally, we give examples on how to import and manipulate georeferenced data.

Detection of Communities and Bridges in Weighted Networks

Tanwistha Saha, Carlotta Domeniconi and Huzefa Rangwala

GMU-CS-TR-2010-15

Traditional graph-based clustering methods group vertices into discrete non-intersecting clusters under the assumption that each vertex can belong to only a single cluster. On the other hand, recent research on graph-based clustering methods, applied to real world networks (e.g., protein-protein interaction networks and social networks), shows overlapping patterns among the underlying clusters. For example, in social networks, an individual is expected to belong to multiple clusters (or communities), rather than strictly confining himself/herself to just one. As such, overlapping clusters enable better models of real-life phenomena. Soft clustering (e.g., fuzzy c-means) has been used with success for non-graph data, when the objects are allowed to belong to multiple clusters with a certain degree of membership. In this paper, we propose a fuzzy clustering based approach for community detection in a weighted graphical representation of social and biological networks, for which the ground truth associated to the nodes is available. We compare our results with a baseline method for both multi-labeled and single labeled datasets.

Fixed Length Representations of Sequence Data using a Variant of the Hidden Markov Model

Sam Blasiak, Huzefa Rangwala

GMU-CS-TR-2010-14

Sequence classification is central to many practical problems within machine learning. Classification algorithms often center around the notion of a distance metric between examples. Unlike sequences, the Euclidean distance metric between vectors often has an intuitive meaning which transfers naturally to a meaning in the classification domain. Distances metrics between arbitrary pairs of sequences, however, can be harder to define because sequences can vary in both length and the information contained in the order of sequence elements is lost when standard distance metrics are applied. We present a scheme that employs a Hidden Markov Model variant to produce a set of fixed-length vectors from a set of sequences. We then define three inference algorithms, a Baum-Welch variant, a Gibbs Sampling algorithm, and a variational algorithm, to infer model parameters. Finally, we show experimentally that the fixed length representation produced by these inference methods is useful for classifying proteins by structural taxonomy.

Using Ontological Information to Enhance Responder Availability in Emergency Response

Paul Ngo, Duminda Wijesekera

GMU-CS-TR-2010-13

Ensuring effective communications during emergencies is an important issue for any functional government. One way to address this issue is to ensure the availability of the key personnel capable of making the appropriate decisions and taking timely actions with sufficient resources. Many XML-based languages such as the Emergency Data Exchange Language (EDXL) and associated Common Alert Protocol (CAP) have been designed to provide a basis for such communications. To ensure that messages are delivered in a timely manner, we propose some role and task based ontological enhancements for these languages. We show by example how the ontological enhancements can be used to enhance availability of emergency personnel in case of a need.

Dynamic Minkowski Sum of Convex Shapes

Evan Behar and Jyh-Ming Lien

GMU-CS-TR-2010-12

Computing the Minkowski sums of rotating objects has always been done naively by re-computing every Minkowski sum from scratch. The correspondences between the Minkowski sums are typically completely ignored. We propose a method, called DYMSUM, that can efficiently update the Minkowski sums of rotating convex polyhedra. We show that DYMSUM is significantly more efficient than the traditional approach, in particular when the size of the input polyhedra are large and when the rotation is small between frames. From our experimental results, we show that the computation time of the proposed method grows slowly with respect to the size of the input comparing to the naive approach.

A New Method for Mapping the Configuration Space Obstacles of Polygons

Evan Behar and Jyh-Ming Lien

GMU-CS-TR-2010-11

Configuration space (C-space) plays an important role not only in motion planning but also in geometric modeling, shape and kinematic reasoning, and is fundamental to several basic geometric operations, such as continuous collision detection and generalized penetration depth estimation, that also find their applications in motion planning, animation and simulation. In this paper, we developed a new method for constructing the boundary of the C-space obstacles (C-obst) of polygons. This method is simpler to implement and often more efficient than the existing techniques. These main advantages are provided by a new algorithm that allows us to extract the Minkowski sum from the reduced convolution of the input polygons. We also developed a method for estimating the generalized penetration depth by computing the distance between the query point and the C-obst surface.

Taming Uncertainty in Self-Adaptation through Possibilistic Analysis

Naeem Esfahani, Ehsan Kouroshfar, and Sam Malek

GMU-CS-TR-2010-10

Self-adaptation endows a software system with the ability to satisfy certain objectives by automatically modifying its behavior. While many promising approaches for the construction of self-adaptive software systems have been developed, the majority of them ignore the uncertainty underlying the adaptation decisions. This has been one of the key inhibitors to wide-spread adoption of self-adaption techniques in risk-averse real-world settings. In this paper, we describe an approach, called POssIbilistic SElf-aDaptation (POISED), for tackling the challenge posed by uncertainty in making adaptation decisions. POISED builds on possibility theory to assess the positive and negative consequences of uncertainty. It makes adaptation decisions that result in the best range of potential behavior. We demonstrate POISED's application to the problem of improving a software system's quality attributes via runtime reconfiguration of its customizable software components. We have extensively evaluated POISED using a prototype of a robotic software system.

Security Policy Cognizant Module Composition

Paul Seymer, Angelos Stavrou, Duminda Wijesekera, and Sushil Jajodia

GMU-CS-TR-2010-1

Component-based software development and deployment is based on developing individual software modules that are composed on an as needed basis. Such modules expose the computations they provide and their dependencies on providing these computations - that results in a well known requires-provides specifications for modules. This paper provides a framework to combine modules that specify their requires-provides interfaces in a policy dependent way. Our framework specify policies as combinations of Constraint Logic Programming (CLP) based rules and our policies can cover multiple aspects associated of compositions, such as security and quality of service. We apply our framework to specify Quality of Protection (QoP) and Quality of Service (QoP) policies. An example shows the applicability of our policy language to a teleconferencing application with multiple security and resource usage policies.