Bioinformatics & Data Mining
PrePrint: Information-Theoretic Outlier Detection for Large-Scale Categorical Data
Outlier detection can usually be considered as a pre-processing step for locating, in a dataset, those objects that do not conform to well-defined notions of expected behavior. It is very important in data mining for discovering novel or rare events, anomalies, vicious actions, exceptional phenomena, etc. We are investigating outlier detection for categorical datasets. This problem is especially challenging because of the difficulty of defining a meaningful similarity measure for categorical data. In this paper, we propose a formal definition of outliers and an optimization model of outlier detection, via a new concept of holo-entropy that takes both entropy and total correlation into consideration. Based on this model, we define a function for the outlier factor of an object which is solely determined by the object itself and can be updated efficiently. We propose two practical 1-parameter outlier detection methods, named ITB-SS and ITB-SP, which require no user-defined parameters for deciding whether an object is an outlier. Users need only provide the number of outliers they want to detect. Experimental results show that ITB-SS and ITB-SP are more effective and efficient than mainstream methods and can be used to deal with both large and high-dimensional datasets where existing algorithms fail.
Categories: Bioinformatics & Data Mining
PrePrint: Distributed Line Graphs: A Universal Technique for Designing DHTs Based on Arbitrary Regular Graphs
Most proposed DHTs engage certain topology maintenance mechanisms specific to the static graphs on which they are based. The designs of these mechanisms are complicated and repeated with graph-relevant concerns. In this paper we propose the “distributed line graphs” (DLG), a universal technique for designing DHTs based on arbitrary regular graphs. Using DLG, the main features of the initial graphs are preserved, and thus people can design a new DHT by simply choosing the graph with desirable features and applying DLG to it. We demonstrate the power of DLG by illustrating four DLG-enabled DHTs based on different graphs, namely, Kautz, de Bruijn, butterfly and hypertree graphs. The effectiveness of our proposals is demonstrated through analysis, simulation and implementation.
Categories: Bioinformatics & Data Mining
PrePrint: Transfer Across Completely Different Feature Spaces via Spectral Embedding
In many applications, it is difficult to obtain a lot of labeled examples. One practically important problem is: can the labeled data from other related sources help predict the target task, even if they have (a) different feature spaces (e.g., image vs. text data), (b) different data distributions, and (c) different output spaces? This paper proposes a solution and discusses the conditions where this is highly likely to produce better results. It first unifies the feature spaces of the target and source data sets by spectral embedding, even when they are with completely different feature spaces. The principle is to devise an optimization objective that preserves the original structure of the data, while at the same time, maximizes the similarity between the two. A linear projection model, as well as a non-linear approach are derived on the basis of this principle with closed forms. Second, a judicious sample selection strategy is applied to select only those related source examples. At last, a Bayesian-based approach is applied to model the relationship between different output spaces. The three steps can bridge related heterogeneous sources in order to learn the target task. Among the 20 experiment data sets, the proposed models can reduce the error rate by as much as 50%.
Categories: Bioinformatics & Data Mining
PrePrint: A New Algorithm for Inferring User Search Goals with Feedback Sessions
For a broad-topic and ambiguous query, different users may have different search goals when they submit it to a search engine. The inference and analysis of user search goals can be very useful in improving search engine relevance and user experience. In this paper, we propose a novel approach to infer user search goals by analyzing search engine query logs. Firstly, we propose a framework to discover different user search goals for a query by clustering the proposed feedback sessions. Feedback sessions are constructed from user click-through logs and can efficiently reflect the information needs of users. Secondly, we propose a novel approach to generate pseudo-documents to better represent the feedback sessions for clustering. Finally, we propose a new criterion 'Classified Average Precision (CAP)' to evaluate the performance of inferring user search goals. Experimental results are presented using user click-through logs from a commercial search engine to validate the effectiveness of our proposed methods.
Categories: Bioinformatics & Data Mining
PrePrint: Fuzzy Web Data Tables Integration Guided by an Ontological and Terminological Resource
In this paper, we present the design of the ONDINE system which allows the loading and the querying of a data warehouse opened on the Web, guided by a domain Termino-Ontological Resource (TOR). The data warehouse, composed of data tables extracted from Web documents, has been built to supplement existing local data sources. First we present the main steps of our semi-automatic method to annotate Web data tables driven by a domain TOR. The output of this method is an XML/RDF data warehouse composed of XML documents representing Web data tables with their fuzzy RDF annotations. We then present our flexible querying system which allows the local data sources and the XML/RDF data warehouse to be simultaneously and uniformly queried, using the domain TOR. This system relies on SPARQL and permits to retrieve approximate answers extracted from Web data tables by comparing preferences, expressed in selection criteria using fuzzy sets, with fuzzy RDF annotations.
Categories: Bioinformatics & Data Mining
PrePrint: Radio Database Compression for Accurate Energy-Efficient Localization in Fingerprinting Systems
Location fingerprinting is a positioning method that exploits the already existing infrastructures such as cellular networks or WLANs. Regarding the recent demand for energy efficient networks and the emergence of issues like green networking, we propose a clustering technique to compress the radio database in the context of cellular fingerprinting systems. The aim of the proposed technique is to reduce the computation cost and transmission load in the mobile-based implementations. The presented method may be called Block-based Weighted Clustering (BWC) technique, which is applied in a concatenated location-radio signal space, and attributes different weight factors to the location and radio components. Computer simulations and real experiments have been conducted to evaluate the performance of our proposed technique in the context of a GSM network. The obtained results confirm the efficiency of the BWC technique, and show that it improves the performance of standard k-means and hierarchical clustering methods.
Categories: Bioinformatics & Data Mining
PrePrint: Facilitating Effective User Navigation through Web Site Structure Improvement
Designing well-structured Web sites to facilitate effective user navigation has long been a challenge. A primary reason is that the Web developers' understanding of how a Web site should be structured can be considerably different from that of the users. While various methods have been proposed to re-link Web pages to improve navigability using user navigation data, the completely reorganized new structure can be highly unpredictable, and the cost of disorienting users after the changes remains unanalyzed. This paper addresses how to improve a Web site without introducing substantial changes. Specifically, we propose a mathematical programming model to improve the user navigation on a Web site while minimizing alterations to its current structure. Results from extensive tests conducted on a publicly available real data set indicate that our model not only significantly improves the user navigation with very few changes, but also can be effectively solved. We have also tested the model on large synthetic data sets to demonstrate that it scales up very well. In addition, we define two evaluation metrics and use them to assess the performance of the improved Web site using the real data set. Evaluation results confirm that the user navigation on the improved structure is indeed greatly enhanced.
Categories: Bioinformatics & Data Mining
PrePrint: A Bound on Kappa-Error Diagrams for Analysis of Classifier Ensembles
Kappa-error diagrams are used to gain insights about why an ensemble method is better than another on a given data set. A point on the diagram corresponds to a pair of classifiers. The x-axis is the pairwise diversity (kappa), and the y-axis is the averaged individual error. In this study, kappa is calculated from the 2x2 correct/wrong contingency matrix. We derive a lower bound on kappa which determines the feasible part of the kappa-error diagram. Simulations and experiments with real data show that there is unoccupied feasible space on the diagram corresponding to (hypothetical) better ensembles, and that individual accuracy is the leading factor in improving the ensemble accuracy.
Categories: Bioinformatics & Data Mining
PrePrint: Range-Based Skyline Queries in Mobile Environments
Skyline query processing in location-based services, which considers both spatial and non-spatial attributes of the objects being queried, has recently received increasing attention. Existing solutions focus on solving point- or line-based skyline queries, in which the query location is an exact location point or a line segment. However, due to privacy consideration and limited precision of localization devices, the input of a user location is often a two-dimensional range. This paper studies a new problem on how to process such range-based skyline queries. Two novel algorithms are proposed: one is index-based (I-SKY) and the other is not based on any index (N-SKY). To handle frequent movements of the objects being queried, we also propose incremental versions of I-SKY and N-SKY, which avoid recomputing the query index and results from scratch. Additionally, we develop efficient solutions for probabilistic and continuous range-based skyline queries. Experimental results show that our proposed algorithms well outperform the baseline algorithm that simply adopts the existing line-based skyline solution. Moreover, the incremental versions of I-SKY and N-SKY save substantial computation costs, especially when the objects move frequently.
Categories: Bioinformatics & Data Mining
PrePrint: Building a Scalable Database-Driven Reverse Dictionary
In this paper, we describe the design and implementation of a reverse dictionary. Unlike a traditional forward dictionary, which maps from words to their definitions, a reverse dictionary takes a user input phrase describing the desired concept, and returns a set of candidate words that satisfy the input phrase. This work has significant application not only for the general public, particularly those who work closely with words, but also in the general field of conceptual search. We present a set of algorithms and the results of a set of experiments showing the retrieval accuracy of our methods and the runtime response time performance of our implementation. Our experimental results show that our approach can provide significant improvements in performance scale without sacrificing the quality of the result. Our experiments comparing the quality of our approach to that of currently available reverse dictionaries show that of our approach can provide significantly higher quality over either of the other currently-available implementations.
Categories: Bioinformatics & Data Mining
PrePrint: Mining User Queries with Markov Chains: Application to Online Image Retrieval.
We propose a novel method for automatic annotation, indexing and annotation-based retrieval of images. The new method, that we call Markovian Semantic Indexing (MSI), is presented in the context of an online image retrieval system. Assuming such a system, the users' queries are used to construct an \textit{Aggregate Markov Chain} ($AMC$) through which the relevance between the keywords seen by the system is defined. The users' queries are also used to automatically annotate the images. A stochastic distance between images, based on their annotation and the keyword relevance captured in the $AMC$, is then introduced. Geometric interpretations of the proposed distance are provided and its relation to a clustering in the keyword space is investigated. By means of a new measure of Markovian state similarity, the \textit{mean first cross passage time} ($CPT$), optimality properties of the proposed distance are proved. Images are modeled as points in a vector space and their similarity is measured with MSI. The new method is shown to possess certain theoretical advantages and also to achieve better Precision vs Recall results when compared to Latent Semantic Indexing (LSI) and probabilistic Latent Semantic Indexing (pLSI) methods in Annotation Based Image Retrieval (ABIR) tasks.
Categories: Bioinformatics & Data Mining
PrePrint: Scalable and Parallel Boosting with MapReduce
In this era of data abundance, it has become critical to be able to process large volumes of data at much faster rates than ever before. Boosting is a powerful predictive model that has been successfully used in many real-world applications. However, due to it's inherent sequential nature, achieving scalability for boosting is not trivial and demands the development of new parallelized versions which will allow them to efficiently handle large-scale data. In this paper, we propose two parallel boosting algorithms, AdaBoost.PL and LogitBoost.PL, which facilitate simultaneous participation of multiple computing nodes to construct a boosted ensemble classifier. The proposed algorithms are competitive to the corresponding serial versions in terms of the generalization performance. In addition, our algorithms achieve significant speedup since our approach does not require individual computing nodes to communicate with each other for sharing their data. Hence, they are applicable and are robust in preserving privacy of computations as well. We used Map-Reduce framework to implement our algorithms and demonstrated the performance in terms of classification accuracy, speedup and scaleup using a wide variety of synthetic and real-world data sets.
Categories: Bioinformatics & Data Mining
PrePrint: k-Pattern Set Mining under Constraints
We introduce the problem of $k$-pattern set mining, concerned with finding a set of $k$ related patterns under constraints. This contrasts to regular pattern mining, where one searches for many individual patterns. The $k$-pattern set mining problem is a very general problem that can be instantiated to a wide variety of well-known mining tasks including concept-learning, rule-learning, redescription mining, conceptual clustering and tiling. To this end, we formulate a large number of constraints for use in $k$-pattern set mining, both at the local level, that is, on individual patterns, and on the global level, that is, on the overall pattern set. Building general solvers for the pattern set mining problem remains a challenge. Here, we investigate to what extent constraint programming (CP) can be used as a general solution strategy. We present a mapping of pattern set constraints to constraints currently available in CP. This allows us to investigate a large number of settings within a unified framework and to gain insight in the possibilities and limitations of these solvers. This is important as it allows us to create guidelines in how to model new problems successfully and how to model existing problems more efficiently. It also opens up the way for other solver technologies.
Categories: Bioinformatics & Data Mining
PrePrint: Efficient Service Skyline Computation for Composite Service Selection
Service composition is emerging as an effective vehicle for integrating existing Web services to create value-added and personalized composite services. As Web services with similar functionality are expected to be provided by competing providers, a key challenge is to find the "best" Web services to participate in the composition. When multiple quality aspects (e.g., response time, fee, etc) are considered, a weighting mechanism is usually adopted by most existing approaches, which requires users to specify their preferences as numeric values. We propose to exploit the {\em dominance relationship} among service providers to find a set of "best" possible composite services, referred to as a {\em composite service skyline}. We develop efficient algorithms that allow us to find the composite service skyline from a significantly reduced searching space instead of considering all possible service compositions. We propose a novel bottom-up computation framework that enables the skyline algorithm to scale well with the number of services in a composition. We conduct a comprehensive analytical and experimental study to evaluate the effectiveness, efficiency, and scalability of the composite skyline computation approaches.
Categories: Bioinformatics & Data Mining
PrePrint: The Minimum Consistent Subset Cover Problem: A Minimization View of Data Mining
In this paper, we introduce and study the minimum consistent subset cover (MCSC) problem. Given a finite ground set X and a constraint t, find the minimum number of consistent subsets that cover X, where a subset of X is consistent if it satisfies t. The MCSC problem generalizes the traditional set covering problem and has minimum clique partition, a dual problem of graph coloring, as an instance. The problem reflects a minimization view of data mining. Many common data mining tasks in rule learning, clustering, and pattern mining can be formulated as MCSC instances. In particular, we discuss the minimum rule set problem that minimizes model complexity of decision rules, the converse k-clustering problem that minimizes the number of clusters, and the pattern summarization problem that minimizes the number of patterns. For any of these MCSC instances, our proposed generic algorithm CAG can be directly applicable. CAG starts by constructing a maximal optimal partial solution, then performs an example-driven specific-to-general search on a dynamically maintained bipartite assignment graph to simultaneously learn a set of consistent subsets with small cardinality covering the ground set.
Categories: Bioinformatics & Data Mining
PrePrint: Data Cube Materialization and Mining over MapReduce
Computing interesting measures for data cubes and subsequent mining of interesting cube groups over massive datasets are critical for many important analyses done in the real world. Previous studies have focused on algebraic measures such as SUM that are amenable to parallel computation and can easily benefit from the recent advancement of parallel computing infrastructure such as MapReduce. Dealing with holistic measures such as TOP-K, however, is non-trivial. In this paper we detail real-world challenges in cube materialization and mining tasks on Web-scale datasets. Specifically, we identify an important subset of holistic measures and introduce MR-Cube, a MapReduce based framework for efficient cube computation and identification of interesting cube groups on holistic measures. We provide extensive experimental analyses over both real and synthetic data. We demonstrate that, unlike existing techniques which cannot scale to the 100 million tuple mark for our datasets, MR-Cube successfully and efficiently computes cubes with holistic measures over billion-tuple datasets.
Categories: Bioinformatics & Data Mining
PrePrint: λ-Diverse Nearest Neighbors Browsing for Multi-Dimensional Data
Traditional search methods try to obtain the most relevant information and rank it according to the degree of similarity to the queries. Diversity in query results is also preferred by a variety of applications since results very similar to each other cannot capture all aspects of the queried topic. In this work, we focus on the λ-diverse k-nearest neighbor search problem on spatial and multi-dimensional data. Unlike the approach of diversifying query results in a post-processing step, we naturally obtain diverse results with the proposed geometric and index-based methods. We first make an analogy with the concept of natural neighbors and propose a natural neighbor-based method for 2D and 3D data and an incremental browsing algorithm based on Gabriel graphs for higher dimensional spaces. We then introduce a diverse browsing method based on the distance browsing feature of spatial index structures, such as R-trees. The algorithm maintains a priority queue with mindivdist of the objects depending on both relevancy and angular diversity and efficiently prunes non-diverse items and nodes. We experimented with a number of spatial and high-dimensional datasets, including Factual's US points-of-interest dataset with 13M entries. With effective pruning, our diverse browsing method is shown to be more efficient and more effective than KNN and KNDN techniques.
Categories: Bioinformatics & Data Mining
PrePrint: Cutting Plane Training for Linear Support Vector Machines
Suppor t Vector Machines (SVMs) have been shown to achieve high performance on classification tasks across many domains, and a great deal of work has been dedicated to developing training algorithms for linear SVMs which are computationally tractable on large data sets. One approach [1] approximately minimizes risk through use of cutting planes, and is improved by [2], [3]. We build upon this work, presenting a modification to the algorithm developed by [2]. We demonstrate empirically that our changes can reduce cutting-plane training time by up to 40%, and discuss how effectiveness of our method is impacted by changes in data sets and parameter settings.
Categories: Bioinformatics & Data Mining
PrePrint: Clustering Large Probabilistic Graphs
We study the problem of clustering probabilistic graphs. Similar to the problem of clustering standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes in probabilistic protein-protein interaction networks and discovering groups of users in affiliation networks. We extend the edit-distance based definition of graph clustering to probabilistic graphs. We establish a connection between our objective function and correlation clustering to propose practical approximation algorithms for our problem. A benefit of our approach is that our objective function is parameter-free. Therefore, the number of clusters is part of the output. We also develop methods for testing the statistical significance of the output clustering and study the case of noisy clusterings. Using a real protein-protein interaction network and ground-truth data, we show that our methods discover the correct number of clusters and identify established protein relationships. Finally, we show the practicality of our techniques using a large social network of Yahoo! users consisting of one billion edges.
Categories: Bioinformatics & Data Mining
PrePrint: Event Tracking for Real-Time Unaware Sensitivity Analysis (EventTracker)
A novel platform for instantaneous Sensitivity Analysis applicable to large scale real-time data acquisition systems is introduced. The drive for the proposed EventTracker platform is the assumption that modern industrial systems are suffieciently flexible and equiped to capture data. This flexibility to adapt can only be assured if the data collected can be succinctly interpreted and translated into corrective actions in timely manner. An important factor that will help in data interpretation and information modelling is the appreciation of the affect system inputs have on each output at their time of occurrence. Existing sensitivity analysis methods appear to hamper efficient and timely sensitivity analysis due to their heavy reliance on historical data, or their sluggishness in providing a timely solution that is of use in real-time applications. This inefficiency is compounded by computational limitations and the complexity of existing models. Dealing with real-time event driven systems, the underpinning logic of the proposed approach is the assumption that, in the vast majority of cases changes to input variables triggers events. The proposed event tracking sensitivity analysis method describes variables and the system state as a collection of events. Compared with Entropy-based, the proposed event tracking sensitivity analysis demostrates 10% improvement in computational efficiency with and same accuracy.
Categories: Bioinformatics & Data Mining

