Title: Protein Function Prediction using Dependence Maximization
Authors: Guoxian Yu, Carlotta Domeniconi, Huzefa Rangwala and Guoji Zhang
Conference: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2013)
Location: Prague, Czech Republic
Acceptance Rate: 25% (111/443)
Preprint: Coming Soon
Title: Relevant Subsequence Detection with Sparse Dictionary Learning
Authors: Sam Blasiak, Huzefa Rangwala and Kathryn Laskey
Conference: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2013)
Location: Prague, Czech Republic
Acceptance Rate: 25% (111/443)
Preprint: Coming Soon
We have two undergraduates from George Mason University and three high school students join the group to work on developing computational models for metagenome analysis and motif detection. Watch this space for all the fun and cool work they do over the summer!
read more
Huzefa has been selected to represent Mason as the university's Rising Star award nominee.
The Outstanding Faculty Awards are the Commonwealth's highest honor for faculty at Virginia's public and private colleges and universities. These awards recognize superior accomplishments in teaching, research, and public service. Within this program, the Rising Star awards recognize early-career faculty. For more information on the awards program, see this page.
Huzefa was selected for the Volgenau School of Engineering's Outstanding Teacher award. The award was presented at the VSE Annual Awards Gala on May 4th.
Azad Naik successfully defends MS thesis "Using Multi-Task Learning for Large Scale Document Classification."
Authority flow techniques like PageRank and ObjectRank can provide personalized ranking of typed entity-relationship graphs. There are two main ways to personalize authority flow ranking: Node-based personalization, where authority originates from a set of user-specific nodes; Edge-based personalization, where the importance of different edge types is user-specific. We propose the first approach to achieve efficient edge-based personalization using a combination of precomputation and runtime algorithms. In particular, we apply our method to ObjectRank, where a personalized weight assignment vector (WAV) assigns different weights to each edge type or relationship type. Our approach includes a repository of rankings for various WAVs. We consider the following two classes of approximation: (a) SchemaApprox is formulated as a distance minimization problem at the schema level; (b) DataApprox is a distance minimization problem at the data graph level. SchemaApprox is not robust since it does not distinguish between important and trivial edge types based on the edge distribution in the data graph. In contrast, DataApprox has a provable error bound. Both SchemaApprox and DataApprox are expensive so we develop efficient heuristic implementations, ScaleRank and PickOne respectively. Extensive experiments on the DBLP data graph show that ScaleRank provides a fast and accurate personalized authority flow ranking.
Co-clustering is a commonly used technique for tapping the rich meta-information of multimedia web documents, including category, annotation, and description, for associative discovery. However, most co-clustering methods proposed for heterogeneous data do not consider the representation problem of short and noisy text and their performance is limited by the empirical weighting of the multi-modal features. In this paper, we propose a generalized form of Heterogeneous Fusion Adaptive Resonance Theory, called GHF-ART, for co-clustering of large-scale web multimedia documents. By extending the two-channel Heterogeneous Fusion ART (HF-ART) to multiple channels, GHF-ART is designed to handle multimedia data with an arbitrarily rich level of meta-information. For handling short and noisy text, GHF-ART does not learn directly from the textual features. Instead, it identifies key tags by learning the probabilistic distribution of tag occurrences. More importantly, GHF-ART incorporates an adaptive method for effective fusion of multi-modal features, which weights the features of multiple data sources by incrementally measuring the importance of feature modalities through the intra-cluster scatters. Extensive experiments on two web image data sets and one text document set have shown that GHF-ART achieves significantly better clustering performance and is much faster than many existing state-of-the-art algorithms
Pattern classification systems are commonly used in adversarial applications, like biometric authentication, network intrusion detection, and spam filtering, in which data can be purposely manipulated by humans to undermine their operation. As this adversarial scenario is not taken into account by classical design methods, pattern classification systems may exhibit vulnerabilities, whose exploitation may severely affect their performance, and consequently limit their practical utility. Extending pattern classification theory and design methods to adversarial settings is thus a novel and very relevant research direction, which has not yet been pursued in a systematic way. In this paper, we address one of the main open issues: evaluating at design phase the security of pattern classifiers, namely, the performance degradation under potential attacks they may incur during operation. We propose a framework for empirical evaluation of classifier security that formalizes and generalizes the main ideas proposed in the literature, and give examples of its use in three real applications. Reported results show that security evaluation can provide a more complete understanding of the classifier's behavior in adversarial environments, and lead to better design choices.
We propose a logical framework to analyze complex predicates (those involving a subquery) in SQL. We propose a new operator in the relational algebra for handling such predicates, and study its properties and how it combines with traditional relational operator. We focus on predicates of the form att Θ MOD S, where att is an attribute, Θ a comparison operator, MOD is one of SOME or ALL, and S is a (correlated or non-correlated) subquery. We provide a formal characterization of these predicate, as well as an implementation and optimization strategies for it. We show that our approach is extendible, so we can support the expression and optimization of other, similar predicates. Finally, we describe experimental evidence that the proposed approach is more efficient than the traditional approach across a variety of conditions.
Entity Resolution is an inherently quadratic task that typically scales to large data collections through blocking. In this paper, we introduce meta-blocking as a generic procedure that intervenes between the creation and the processing of blocks, transforming an initial set of blocks into a new one with substantially fewer comparisons and equally high effectiveness. In essence, meta-blocking aims at extracting the most similar pairs of entities by leveraging the information that is encapsulated in the block-to-entity relationships. To this end, it first builds an abstract graph representation of the original set of blocks, with the nodes corresponding to entity profiles and the edges connecting the co-occurring ones. During the creation of this structure all redundant comparisons are discarded, while the superfluous ones can be removed by pruning of the edges with the lowest weight. We analytically examine both procedures, proposing a multitude of edge weighting schemes, graph pruning algorithms as well as pruning criteria. Our approaches are schema-agnostic, thus accommodating any type of blocks. We evaluate their performance through a thorough experimental study over three large-scale, real-world datasets, with the outcomes verifying significant efficiency enhancements at a negligible cost in effectiveness.
One of the most popular techniques of generating classifier ensembles is known as stacking which is based on a meta-learning approach. In this paper we introduce an alternative method to stacking which is based on cluster analysis. Similar to stacking, instances from a validation set are initially classified by all base classifiers. The output of each classifier is subsequently considered as a new attribute of the instance. Following this, a validation set is divided into clusters according to the new attributes and a small subset of the original attributes of the instances. For each cluster we find its centroid and calculate its class label. The collection of centroids is considered as a meta-classifier. Experimental results show that the new method outperformed all benchmark methods, namely Majority Voting, Stacking J48, Stacking LR, AdaBoost J48 and Random Forest, in 12 out of 22 data sets. The proposed method has two advantageous properties: it is very robust to relatively small training sets and it can be applied in semi-supervised learning problems. We provide a theoretical investigation regarding the proposed method. This demonstrates that for the method to be successful, the base classifiers applied in the ensemble should have greater than 50% accuracy levels.
Creating an efficient and economic trip plan is the most annoying job for a backpack traveler. The search space of all possible itineraries is too costly to fully explore, to simplify the complexity, most work assume that user's trip is limited to some important POIs and will complete within one day. To address the above limitation, in this paper, we design a more general itinerary planning service, which generates multi-day itineraries for the users. In our service, all POIs are considered and ranked based on the users' preference. The problem of searching the optimal itinerary is a team orienteering problem (TOP), a well-known NP-complete problem. To reduce the processing cost, a two-stage planning scheme is proposed. In its preprocessing stage, single-day itineraries are precomputed via the MapReduce jobs. In its online stage, an approximate search algorithm is used to combine the single day itineraries. In this way, we transfer the TOP problem with no polynomial approximation into another NP-complete problem (Set-Packing problem) with good approximate algorithms. Experiments on real datasets show that our approach can generate high quality itineraries efficiently.
We propose a protocol for secure mining of association rules in horizontally distributed databases. The current leading protocol is that of Kantarcioglu and Clifton [18]. Our protocol, like theirs, is based on the Fast Distributed Mining (FDM) algorithm of Cheung et al. [8], which is an unsecured distributed version of the Apriori algorithm. The main ingredients in our protocol are two novel secure multi-party algorithms --- one that computes the union of private subsets that each of the interacting players hold, and another that tests the inclusion of an element held by one player in a subset held by another. Our protocol offers enhanced privacy with respect to the protocol in [18]. In addition, it is simpler and is significantly more efficient in terms of communication rounds, communication cost and computational cost.
Versioning file systems are widely used in modern computer systems as they provide system recovery and old data access functions by retaining previous file system snapshots. However, existing versioning file systems do not perform well with the emerging PCM (phase change memory) storage, because they are optimized for hard disks. Specifically, a large amount of additional writes incurred by maintaining snapshot degrades the performance of PCM seriously as write operations are the performance bottleneck of PCM. This paper presents a novel versioning file system, designed for PCM, that reduces the writing overhead of a snapshot significantly. Unlike existing versioning file systems that incur cascade writes up to the file system root, our scheme breaks the recursive update chain at the immediate parent level. The proposed file system is implemented on Linux 2.6 as a prototype. Measurement studies with various I/O benchmarks show that the proposed file system improves the I/O throughput by 144% on average, compared to ZFS, a representative versioning file system.
In this paper, we propose a novel cross-space affinity learning algorithm over different spaces with heterogeneous structures. Unlike most of affinity learning algorithms on the homogeneous space, we construct a cross-space tensor model to learn the affinity measures on heterogeneous spaces subject to a set of order constraints from the training pool. We further enhance the model with a factorization form which greatly reduces the number of parameters of the model with a controlled complexity. Moreover from the practical perspective, we show the proposed factorized cross-space tensor model can be efficiently optimized by a series of simple quadratic optimization problems in an iterative manner. The proposed cross-space affinity learning algorithm can be applied to many real-world problems, which involve multiple heterogeneous data objects defined over different spaces. In this paper, we apply it into the recommendation system to measure the affinity between users and the product items, where a higher affinity means a higher rating of the user on the product. For an empirical evaluation, a widely-used benchmark movie recommendation data set - MovieLens is used to compare the proposed algorithm with other state-of-the-art recommendation algorithms and we show that very competitive results can be obtained.
This paper designs an efficient image hashing with a ring partition and a non-negative matrix factorization (NMF), which is with both the rotation robustness and good discriminative capability. The key contribution is a novel construction of rotation-invariant secondary image, which is used for the first time in image hashing and is helpful to make image hash resistant to rotation. In addition, NMF coefficients are approximately linearly changed by content-preserving manipulations, so as to measure hash similarity with correlation coefficient. We conduct experiments for illustrating the efficiency with 346 images. Our experiments show that the proposed hashing is robust against content-preserving operations, such as image rotation, JPEG compression, watermark embedding, Gaussian low-pass filtering, gamma correction, brightness adjustment, contrast adjustment and image scaling. Receiver operating characteristics (ROC) curve comparisons are also conducted with the state-of-the-art algorithms, and demonstrate that the proposed hashing is much better than all these algorithms in classification performances with respect to robustness and discrimination.
Result diversification has recently attracted considerable attention as a means of increasing user satisfaction in recommender systems, as well as in web and database search. In this paper, we focus on the problem of selecting the k-most diverse items from a result set. Whereas previous research has mainly considered the static version of the problem, in this paper, we exploit the dynamic case in which the result set changes over time, as for example, in the case of notification services. We define the Continuous k-Diversity Problem along with appropriate constraints that enforce continuity requirements on the diversified results. Our proposed approach is based on cover trees and supports dynamic item insertion and deletion. The diversification problem is in general NP-hard; we provide theoretical bounds that characterize the quality of our solution based on cover trees with respect to the optimal solution. Since results are often associated with a relevance score, we extend our approach to also account for relevance. Finally, we report experimental results concerning the efficiency and effectiveness of our approach on a variety of real and synthetic datasets.
This paper takes the shortest path discovery to study efficient relational approaches to graph search queries. We first abstract three enhanced relational operators, based on which we introduce an FEM framework to bridge the gap between relational operations and graph operations. We show new features introduced by recent SQL standards, such as window function and merge statement, can improve the performance of the FEM framework. Second, we propose an edge weight aware graph partitioning schema and design a bi-directional restrictive BFS (breadth-first-search) over partitioned tables, which improves the scalability and performance without extra indexing overheads. The final extensive experimental results illustrate our relational approach with optimization strategies can achieve high scalability and performance.
Traditionally, as soon as confidentiality becomes a concern, data is encrypted before outsourcing to a service provider. Any software-based cryptographic constructs then deployed, for server-side query processing on the encrypted data, inherently limit query expressiveness. Here, we introduce TrustedDB, an outsourced database prototype that allows clients to execute SQL queries with privacy and under regulatory compliance constraints by leveraging server-hosted, tamper-proof trusted hardware in critical query processing stages, thereby removing any limitations on the type of supported queries. Despite the cost overhead and performance limitations of trusted hardware, we show that the costs per query are orders of magnitude lower than any (existing or) potential future software-only mechanisms. TrustedDB is built and runs on actual hardware, and its performance and costs are evaluated here.