GMU-CS-TR-2010-9
Advances in sequencing technologies have equipped researchers with the
ability to sequence the collective genome of entire microbial
communities commonly referred to as metagenomics. These microbes are
are omnipresent within the human body and environments across the
world. As such, characterizing and understanding their roles is
crucial for improving human health and the environment. The problem
of using short reads obtained from current next generation sequencing
technologies to assemble the genomes within the community sample is
challenging for several reasons. In this study we assess the
performance of a state-of-the-art Eulerian-based graph assembler on a
series of simulated dataset with varying complexity. We evaluate the
feasibility of metagenomic assembly with reads restricted to be 36
base pairs obtained from the Solexa/Illumina platform. We developed a
pipeline to evaluate the quality of assembly based on contig length
statistics and accuracy. We studied the effect of overlap parameters
used for the metagenomic assembly and developed a clustering solution
to pool the contigs obtained from different runs of the assembly
algorithm which allowed us to obtain longer contigs. We also computed
an entropy/impurity metric to assess how mixed the assembled contigs
were. Ideally a contig should be assembled from reads obtained from
the same organism. We also compared the metagenomic assemblies to the
best possible solution that could be obtained by assembling individual
source genomes. Our results show that accuracy was better than
expected for the metagenomic samples with a few dominant organisms and
was especially poor in samples containing many closely related
strains.
GMU-CS-TR-2010-8
As Cyber-Physical Systems (CPSs) evolve they will be increasingly
relied to support time-critical monitoring and control
activities. Further, many CPSs that utilizing Wireless Sensor
Networking (WSN) technologies will require the use of energy
harvesting methods to extend their lifetimes. For this application
class, there are currently few effective models that allow the
simulation and analysis of new algorithms or system performance. To
address this problem, we define a general purpose WSN model to support
a time-critical CPS system. We then present a set of Harvesting Aware
Speed Selection (HASS) algorithms. Our technique maximizes the
minimum energy reserve for all the nodes in the network, thus ensuring
highly resilient performance under emergency or fault-driven
situations. We present an optimal centralized solution, along with an
efficient, fully distributed solution. We propose a CPS-specific
experimental methodology, enabling us to evaluate our approach. Our
experiments show that our algorithms yield significantly higher energy
reserves than baseline methods.
GMU-CS-TR-2010-7
Proteins perform several critical biological processes by
interacting with other macromolecules (DNA, RNA) and
small molecules. Several computational approaches have
been developed to determine the protein interaction sites
using sequence and structure features. Instead of building
another adhoc prediction algorithm, the purpose of
this study is to understand the contribution of a s
residue in a RNA-binding event and compare it with the
DNA-binding process. We evaluate several sequence and
structure-based features using mutual information theory.
We show that solvent accessibility and profile-based
features can be used for developing good protein-RNA
binding site determination algorithms. We also recommend
features that could discriminate between RNA and
DNA binding sites. This work can be extended to understand
protein-protein and protein-ligand interactions
as well.
GMU-CS-TR-2010-6
In this paper, we consider energy management algorithms for scheduling
jobs in power-scare scenarios such as embedded computer systems and
sensor networks. We focus on investigating the impact of buffer
resources in minimizing the total energy cost in an online
setting. The online algorithms do not have any assumptions on job
arrivals; their worst-case performance is measured in term of
competitive ratio, when they are compared with the optimal algorithms
with clairvoyance. We prove that with appropriate extra buffer space,
an online algorithm can beat an weak optimal offline algorithm in
terms of the total energy required. Our research result helps to
quantitatively estimate the optimal on-chip buffer resources allocated
in real-time systems with power constraints. We also present the lower
bound of competitive ratio that any deterministic online algorithm
cannot achieve.
GMU-CS-TR-2010-5
Over the past few years, virtualization has been employed to environments ranging from densely populated cloud computing clusters to home desktop computers. Security researchers embraced virtual machine monitors (VMMs) as a new mechanism to guarantee deep isolation of untrusted software components. Unfortunately, their widespread adoption promoted VMMs as a prime target for attackers. In this paper, we present HyperCheck, a hardware-assisted tampering detection framework designed to protect the integrity of VMMs and, for some classes of attacks, the underlying operating system (OS). HyperCheck leverages the CPU System Managed Mode (SMM), present in x86 systems, to securely generate and transmit the full state of the protected machine to an external server. Using HyperCheck, we were able to ferret-out rootkits that targeted the integrity of both the Xen hypervisor and traditional OSes. Moreover, HyperCheck is robust against attacks that aim to disable or block its operation. Our experimental results show that Hypercheck can produce and communicate a scan of the state of the protected software in less than 40ms.
GMU-CS-TR-2010-4
In this paper, we present a new motion planning strategy for
shepherding in environments containing obstacles. This instance of the
group motion control problem is applicable to a wide variety of real
life scenarios, such as animal herding, civil crowd control, and
oil-spill cleanup. However, the problem is challenging in terms of
scalability and robustness because it is dynamic, highly
underactuated, and involves multi-agent coordination. Our previous
work showed that high-level probabilistic motion planning algorithms
combined with simple shepherding behaviors can be beneficial in
situations where low-level behaviors alone are insufficient. However,
inconsistent results suggested a need for a method that performs well
across a wider range of environments. In this paper, we present a new
method, called Deform, in which shepherds view the flock as an
abstracted deformable shape. We show that our method is more robust
than our previous approach and that it scales more effectively to
larger teams of shepherds and larger flocks. We also show Deform to be
surprisingly robust despite increasing randomness in the motion of the
flock.
GMU-CS-TR-2010-3
Camera control is essential in both virtual and real-world environments.
Our work focuses on an instance of camera control called target
following, and offers an algorithm, based on
the ideas of monotonic tracking regions and ghost targets,
for following a large coherent group of targets with unknown
trajectories, among known obstacles. In multiple-target following, the
camera's primary objective is to follow and maximize visibility of
multiple moving targets. For example, in video games, a third-person
view camera may be controlled to follow a group of characters through
complicated virtual environments. In robotics, a camera attached to
robotic manipulators could also be controlled to observe live performers
in a concert, monitor assembly of a mechanical system, or maintain task
visibility during teleoperated surgical procedures.
To the best of our knowledge, this work is the first attempting to
address this particular instance of camera control.
GMU-CS-TR-2010-20
Additional Files: techreports/GMU-CS-TR-2010-20.zip
Traditionally, Network Intrusion Detection Systems (NIDS) inspect packet header and payload data for malicious content. While each system is different, most NIDS perform limited analysis on network streams and network protocols. Unfortunately, current NIDS are typically susceptible to evasion through network protocol encoding, such as base64 encoding of SMTP/MIME or gzip compression of HTTP. In addition, malicious desktop application payloads (e.g., PDF documents, Flash multimedia files) are beyond the inspection capabilities of popular NIDS. To address these limitations, we introduce Ruminate, a scalable object-centric traffic inspection and analysis architecture. Ruminate provides a distributed platform for deep analysis of network payload content. This includes full decoding of network protocols and recursive extraction of client application objects transferred over the network. While traditional NIDS utilize static packet load balancing to provide scalability, Ruminate employs dynamic load distribution of reassembled network streams and embedded objects, outsourcing the heavy processing to other processors or connected hosts. Therefore, high latency or computationally expensive analysis can be performed on commodity servers. Furthermore, our approach empowers system administrators to provision resources and preferentially treat traffic not only depending on the packet header but also on the data objects it carries. To achieve this, each object inspection algorithm is implemented as a separate component or service offered through a highly scalable producer-consumer architecture. We demonstrate using real-world traffic that our load balancing is far superior to existing techniques. This is because its granularity depends on the reconstructed objects rather than packet or simple stream analysis. Unlike existing systems, Ruminate can prevent NIDS evasion that leverages encoding or compression of malicious objects in network protocols, desktop application file formats, or encapsulation within other objects.
GMU-CS-TR-2010-2
Fold recognition is a key problem in computational biology
that involves classifying protein sharing structural similarities
into classes commonly known . Recently, researchers
have developed several efficient kernel based discriminatory
methods for fold classification using sequence information.
These methods train one-versus-rest binary classifiers
using well optimized kernels from different data sources
and techniques.
Integrating this vast amount of data in the form of kernel
matrices is an interesting and challenging problem. The
semidefinite positive property of the various kernel matrices
makes it attractive to cast the task of learning an optimal
weighting of several kernel matrices as a semi-definite programming
optimization problem. We experiment with a previously
introduced quadratically constrained quadratic optimization
problem for kernel integration using 1-norm and 2-norm
support vector machines. We integrate state-of-the-art profilebased
direct kernels to learn an optimal kernel matrix. Our
experimental results show a small significant improvement in
terms of the classification accuracy across the different fold
classes. Our analysis illustrates the strength of two dominating
kernels across different fold classes, which suggests the redundant
nature of the kernel matrices being combined.
GMU-CS-TR-2010-19
Next-generation technologies have allowed researchers to determine the collective genomes of all organisms within specific environments or communities. Varying species abundance, length and complexities within different communities, coupled with discovery of new species makes the problem of taxonomic assignment to short DNA sequence reads extremely challenging. We have developed a new sequence composition-based taxonomic classifier, TAC-ELM for metagenomic analysis. TAC-ELM uses the framework of extreme learning machines to quickly and accurately learn the weights for a neural network model, with input features consisting of GC content and oligonucleotides. TAC-ELM is evaluated on two standard metagenomic benchmarks with sequence read lengths reflecting the traditional and current technologies. Our empirical results indicate the strength of the developed approach, which outperforms state-of-the-art taxonomic classifiers in terms of accuracy, training time and implementation complexity. We also perform experiments that evaluate the pervasive case within metagenome analysis, where a species may not have been previously sequenced or discovered and will not exist in the reference genome databases.
GMU-CS-TR-2010-18
Social networks and discussion boards have become a significant outlet where people communicate and express their opinion freely. Although the social networks themselves are usually well-provisioned, the participating users frequently point to external links to substantiate their discussions. Unfortunately, the sudden heavy traffic load imposed on the external, linked web sites causes them to become unresponsive leading to what people call the "Flash Crowds" effect. Flash Crowds present a real challenge, as it is not possible to predict their intensity and occurrence time. Moreover, although increasingly capable, most present-day web hosting servers and caching systems are designed to handle a nominal load of requests before they become unresponsive. This can happen either due to the limited bandwidth or processing power allocated to the hosting site. In this paper, we quantify the prevalence of flash crowd events for a popular social discussion board (Digg). Using PlanetLab, we measured the response times of 1289 unique popular websites. We were able to verify that 89% of the popular URLs suffered variations in their response times. In an effort to identify flash crowds ahead of time, we evaluate and compare traffic forecasting mechanisms. We show that predicting network traffic using network measurements has very limited success and cannot be used for large-scale prediction. However, by analyzing the content and structure of the social discussions, we were able to classify 86% of the popular web sites within 5 minutes of their submission and 95% of the sites when more (5 hours) of social content became available. Our work indicates that we can effectively leverage social activity to forecast network events that will be otherwise infeasible to anticipate.
GMU-CS-TR-2010-17
Users seeking information in distributed environments of large numbers of disparate information resources are often burdened with the task of repeating their queries for each and every resource. Invariably, some of the searched resources are more productive (yield more useful documents) than others, and it would be undoubtedly useful to try these resources first. If the environment is
GMU-CS-TR-2010-16
MASON is a free, open-source Java-based discrete event multi-agent simulation toolkit that has been used to model network intrusions, unmanned aerial vehicles, nomadic migrations, and farmer/herder conflicts, among others. Many multi-agent models use georeferenced data which represent such things as road networks, rivers, vegetation coverage, population, and topology. However, MASON does not directly support georeferenced data. Therefore practitioners using MASON must hand craft such support, which may be difficult and error prone. In this paper we describe newly added geospatial functionality in MASON that addresses this problem. We discuss the design of this functionality, called GeoMASON, and its use and limitations. Finally, we give examples on how to import and manipulate georeferenced data.
GMU-CS-TR-2010-15
Traditional graph-based clustering methods group vertices into discrete non-intersecting clusters under the assumption that each vertex can belong to only a single cluster. On the other hand, recent research on graph-based clustering methods, applied to real world networks (e.g., protein-protein interaction networks and social networks), shows overlapping patterns among the underlying clusters. For example, in social networks, an individual is expected to belong to multiple clusters (or communities), rather than strictly confining himself/herself to just one. As such, overlapping clusters enable better models of real-life phenomena. Soft clustering (e.g., fuzzy c-means) has been used with success for non-graph data, when the objects are allowed to belong to multiple clusters with a certain degree of membership. In this paper, we propose a fuzzy clustering based approach for community detection in a weighted graphical representation of social and biological networks, for which the ground truth associated to the nodes is available. We compare our results with a baseline method for both multi-labeled and single labeled datasets.
GMU-CS-TR-2010-14
Sequence classification is central to many practical problems within machine learning. Classification algorithms often center around the notion of a distance metric between examples. Unlike sequences, the Euclidean distance metric between vectors often has an intuitive meaning which transfers naturally to a meaning in the classification domain. Distances metrics between arbitrary pairs of sequences, however, can be harder to define because sequences can vary in both length and the information contained in the order of sequence elements is lost when standard distance metrics are applied. We present a scheme that employs a Hidden Markov Model variant to produce a set of fixed-length vectors from a set of sequences. We then define three inference algorithms, a Baum-Welch variant, a Gibbs Sampling algorithm, and a variational algorithm, to infer model parameters. Finally, we show experimentally that the fixed length representation produced by these inference methods is useful for classifying proteins by structural taxonomy.
GMU-CS-TR-2010-13
Ensuring effective communications during emergencies is an important issue for any functional government. One way to address this issue is to ensure the availability of the key personnel capable of making the appropriate decisions and taking timely actions with sufficient resources. Many XML-based languages such as the Emergency Data Exchange Language (EDXL) and associated Common Alert Protocol (CAP) have been designed to provide a basis for such communications. To ensure that messages are delivered in a timely manner, we propose some role and task based ontological enhancements for these languages. We show by example how the ontological enhancements can be used to enhance availability of emergency personnel in case of a need.
GMU-CS-TR-2010-12
Computing the Minkowski sums of rotating objects has always been done naively by re-computing every Minkowski sum from scratch. The correspondences between the Minkowski sums are typically completely ignored. We propose a method, called DYMSUM, that can efficiently update the Minkowski sums of rotating convex polyhedra. We show that DYMSUM is significantly more efficient than the traditional approach, in particular when the size of the input polyhedra are large and when the rotation is small between frames. From our experimental results, we show that the computation time of the proposed method grows slowly with respect to the size of the input comparing to the naive approach.
GMU-CS-TR-2010-11
Configuration space (C-space) plays an important role not only in motion planning but also in geometric modeling, shape and kinematic reasoning, and is fundamental to several basic geometric operations, such as continuous collision detection and generalized penetration depth estimation, that also find their applications in motion planning, animation and simulation. In this paper, we developed a new method for constructing the boundary of the C-space obstacles (C-obst) of polygons. This method is simpler to implement and often more efficient than the existing techniques. These main advantages are provided by a new algorithm that allows us to extract the Minkowski sum from the reduced convolution of the input polygons. We also developed a method for estimating the generalized penetration depth by computing the distance between the query point and the C-obst surface.
GMU-CS-TR-2010-10
Self-adaptation endows a software system with the ability to satisfy
certain objectives by automatically modifying its behavior. While
many promising approaches for the construction of self-adaptive
software systems have been developed, the majority of them ignore the
uncertainty underlying the adaptation decisions. This has been one of
the key inhibitors to wide-spread adoption of self-adaption techniques
in risk-averse real-world settings. In this paper, we describe an
approach, called POssIbilistic SElf-aDaptation (POISED), for tackling
the challenge posed by uncertainty in making adaptation
decisions. POISED builds on possibility theory to assess the positive
and negative consequences of uncertainty. It makes adaptation
decisions that result in the best range of potential behavior. We
demonstrate POISED's application to the problem of improving a software
system's quality attributes via runtime reconfiguration of its
customizable software components. We have extensively evaluated POISED
using a prototype of a robotic software system.
GMU-CS-TR-2010-1
Component-based software development and deployment is based on
developing individual software modules that are composed on an as
needed basis. Such modules expose the computations they provide and
their dependencies on providing these computations - that results in a
well known requires-provides specifications for modules. This paper
provides a framework to combine modules that specify their
requires-provides interfaces in a policy dependent way. Our framework
specify policies as combinations of Constraint Logic Programming (CLP)
based rules and our policies can cover multiple aspects associated of
compositions, such as security and quality of service. We apply our
framework to specify Quality of Protection (QoP) and Quality of
Service (QoP) policies. An example shows the applicability of our
policy language to a teleconferencing application with multiple
security and resource usage policies.