GMU-CS-TR-2012-6
The rise of drug-resistant bacteria has brought attention to antimicrobial peptides (AMPs) as targets for novel antibacterial drug research. Many machine learning methods aim to improve recognition of AMPs. Sequence-derived features are often employed in the context of supervised learning through Support Vector Machines (SVMs). This can be useful for expediently screening databases for AMP-like peptides. However, AMPs are characterized by great sequence diversity. Moreover, biological studies focusing on AMP modification and de novo design stand to benefit from computational methods capable of exposing underlying features important for activity at the amino-acid level position. We take the first steps in this direction by considering an extensive list of amino-acid physico-chemical features. We gradually narrow this list down to relevant features in the context of SVM classification. We focus on a specific AMP class, cathelicidins, due to the abundance of documented sequences, to improve their recognition over carefully-designed decoy sequences. Analysis of the features important for the classification reveals interesting physico-chemical properties to preserve when modifying or designing novel AMPs in the wet laboratory.
GMU-CS-TR-2012-5
Owed to their versatile functionality and size, PDF documents have become a popular avenue for exploitation ranging from broadcasted phishing attacks to targeted attacks. In this paper, we present a method for efficiently detecting malicious documents based on feature extraction of document metadata and structure. We demonstrate that by carefully extracting a wide-range of feature sets we can offer a robust malware classifier. Indeed, using multiple datasets containing an aggregate of 5,000 malicious documents and 100,000 benign ones, our classification rates are well above 99% while maintaining low false negatives 0.01% or less for different classification parameters and scenarios. Surprisingly, we also discovered that by artificially reducing the influence of the top features we can achieve robust detection even in an adversarial setting where the attacker is aware of the top features and our normality model. Therefore, due to the large number of the extracted features, a defense posture that employs detectors trained on random sets of features, is robust even against mimicry attacks with intimate knowledge of the training set.
GMU-CS-TR-2012-4
The results of a parameter sweep over a multidimensional parameter space are often used to gain an understanding of the space beyond simply identifying optima. But sweeps are costly, and so it is highly desirable to adaptively sample the space in such a way as to concentrate precious samples on the more "interesting" areas of the space. In this paper we analyze and expand on a previous work which defined such areas as those in which the slope of some quality function was high. We develop a performance metric in terms of generalizability of the resulting sampled model, examine the existing method in terms of scalability and accuracy, and then propose and examine two population-based approaches which address some of its shortcomings.
GMU-CS-TR-2012-3
Co-clustering is a machine learning task where the goal is to simultaneously develop clusters of the data and of their respective features. We address the use of co-clustering ensembles to establish a consensus co-clustering over the data. As is obvious from its name, co-clustering is naturally multiobjective. Previous work tackled the problem using both rudimentary multiobjective optimization and expectation maximization, then later a gradient ascent approach which outperformed both of them. In this paper we develop a new preference-based multiobjective optimization algorithm to compete with the gradient ascent approach. Unlike this gradient ascent algorithm, our approach once again tackles the co-clustering problem with multiple heuristics, but also applies the gradient ascent algorithm's joint heuristic as a preference selection procedure. As a result, we are able to significantly outperform the gradient ascent algorithm on feature clustering and on problems with smaller datasets.
GMU-CS-TR-2012-2
An important challenge in dynamic adaptation of a software system is to prevent inconsistencies (failures) and disruptions in its operations during and after change. Several prior techniques have solved this problem with various tradeoffs. All of them, however, assume the availability of detailed component dependency models. This paper presents a complementary technique that solves this problem in settings where such models are either not available, difficult to build, or outdated due to the evolution of the software. Our approach first mines the execution history of a software system to infer a "stochastic component dependency model", representing the probabilistic sequence of interactions among the system's components. We then demonstrate how this model could be used at runtime to infer the "best time" for adaptation of the system's components. We have thoroughly evaluated this research on a multi-user real world software system and under varying conditions.
GMU-CS-TR-2012-1
Localization and mapping has been an area of great importance and interest to the robotics and computer vision community. It has traditionally been accomplished with range sensors such as lasers and sonars. Recent improvements in processing power coupled with advancements in image matching and motion estimation has allowed development of vision based localization techniques. Despite much progress, there are disadvantages to both range sensing and vision techniques making localization and mapping that is inexpensive and robust hard to attain. With the advent of RGB-D cameras which provide synchronized range and video data, localization and mapping is now able to exploit both range data as well as RGB features. This thesis exploits the strengths of vision and range sensing localization and mapping strategies and proposes novel algorithms using RGB-D cameras. We show how to combine existing strategies and present through evaluation of the resulting algorithms against a dataset of RGB-D benchmarks. Lastly we demonstrate the proposed algorithm on a challenging indoor dataset and demonstrate improvements where either pure range sensing or vision techniques perform poorly.