MLBio+Laboratory Machine Learning in Biomedical Informatics



Feed aggregator

PrePrint: CoDe Modeling of Graph Composition for Data Warehouse Report Visualization

TKDE - Wed, 05/01/2013 - 09:54
The visualization of information contained in reports is an important aspect of human-computer interaction, for both the accuracy and the complexity of relationships between data must be preserved. A greater attention has been paid to individual report visualization through different types of standard graphs (Histograms, Pies, etc.). However, this kind of representation provides separate information items and gives no support to visualize their relationships which are extremely important for most decision processes. This paper presents a design methodology exploiting the visual language CoDe based on a logic paradigm. CoDe allows to organize the visualization through the CoDe model which graphically represents relationships between information items and can be considered a conceptual map of the view. The proposed design methodology is composed of four phases: the CoDe Modeling and OLAP Operation pattern definition phases define the CoDe model and underlying meta-data information, the OLAP Operation phase physically extracts data from a data warehouse and the Report Visualization phase generates the final visualization. Moreover, a case study on real data is provided.

PrePrint: Linkable Ring Signature with Unconditional Anonymity

TKDE - Wed, 05/01/2013 - 09:54
In this paper, we construct a linkable ring signature scheme with unconditional anonymity. It has been regarded as an open problem in \cite{LiuWeWo04b} since 2004 for the construction of an unconditional anonymous linkable ring signature scheme. We are the first to solve this open problem by giving a concrete instantiation, which is proven secure in the random oracle model. Our construction is even more efficient than other schemes that can only provide computational anonymity. Simultaneously, our scheme can act as an counterexample to show that Theorem 1 stated in \cite{JeongKL08} is not always true, which stated that linkable ring signature scheme cannot provide strong anonymity. Yet we prove that our scheme can achieve strong anonymity (under one of the interpretations).

PrePrint: A Probabilistic Approach to String Transformation

TKDE - Wed, 05/01/2013 - 09:54
Many problems in natural language processing, data mining, information retrieval, and bioinformatics can be formalized as string transformation, which is a task as follows. Given an input string, the system generates the $k$ most likely output strings corresponding to the input string. This paper proposes a novel and probabilistic approach to string transformation, which is both accurate and efficient. The approach includes the use of a log linear model, a method for training the model, and an algorithm for generating the top $k$ candidates, whether there is or is not a predefined dictionary. The log linear model is defined as a conditional probability distribution of an output string and a rule set for the transformation conditioned on an input string. The learning method employs maximum likelihood estimation for parameter estimation. The string generation algorithm based on pruning is guaranteed to generate the optimal top $k$ candidates. The proposed method is applied to correction of spelling errors in queries as well as reformulation of queries in web search. Experimental results on large scale data show that the proposed approach is very accurate and efficient improving upon existing methods in terms of accuracy and efficiency in different settings.

PrePrint: Privacy-Preserving Enhanced Collaborative Tagging

TKDE - Wed, 05/01/2013 - 09:54
Collaborative tagging is one of the most popular services available online, and it allows end user to loosely classify either online or offline resources based on their feedback, expressed in the form of free-text labels (i.e., tags). Although tags are not per se sensitive information, the wide use of collaborative tagging services increases the risk of cross referencing, thereby seriously compromising user privacy. In this paper, we make a first contribution in this direction by showing how a specific privacy-enhancing technology, namely tag suppression, can be used to protect end-user privacy. Moreover, we analyze how our approach can affect the effectiveness of a policy-based collaborative tagging system which supports enhanced Web access functionalities, like content filtering and discovery, based on preferences specified by end users.

PrePrint: Typicality-Based Collaborative Filtering Recommendation

TKDE - Wed, 05/01/2013 - 09:54
Collaborative filtering (CF) is an important and popular technology for recommender systems. However, current CF methods suffer from such problems as sparsity problem, inaccurate recommendation and producing big-error predictions. In this paper, we borrow ideas of object typicality from cognitive psychology and propose a novel typicality-based collaborative filtering recommendation method named TyCo. A distinct feature of typicality-based CF is that it finds ‘neighbors’ of users based on user typicality degrees in user groups (instead of the co-rated items of users or common users of items in traditional CF). To the best of our knowledge, there is no prior work on investigating CF recommendation by combining object typicality. The proposed method outperforms the compared CF recommendation methods on recommendation accuracy (in terms of MAE). Especially, this method works well even with sparse training data and have less time cost than previous CF methods. Further, it can obtain more accurate predictions with less big-error predictions.

PrePrint: Mining Weakly-Labeled Web Facial Images for Search-Based Face Annotation

TKDE - Wed, 05/01/2013 - 09:54
This paper investigates a framework of Search-Based Face Annotation (SBFA) by mining weakly-labeled facial images that are freely available on the World Wide Web (WWW). One challenging problem for search-based face annotation scheme is how to effectively perform annotation by exploiting the list of most similar facial images and their weak labels that are often noisy and incomplete. To tackle this problem, we propose an effective Unsupervised Label Refinement (ULR) approach for refining the labels of web facial images using machine learning techniques. We formulate the learning problem as a convex optimization and develop effective optimization algorithms to solve the large-scale learning task efficiently. To further speed up the proposed scheme, we also propose a clustering-based approximation algorithm which can improve the scalability considerably. We have conducted an extensive set of empirical studies on a large-scale web facial image testbed, in which encouraging results showed that the proposed ULR algorithms can significantly boost the performance of the promising SBFA scheme.

PrePrint: Disputant Relation-based Classification for Contrasting Opposing Views of Contentious News Issues

TKDE - Wed, 05/01/2013 - 09:54
Contentious news issues, such as the health care reform debate, draws much interest from the public; however, it is not simple for an ordinary user to search and contrast the opposing arguments and have a comprehensive understanding of the issues. Providing a classified view of the opposing views of the issues can help readers to easily understand the issue from multiple perspectives. We present disputant relation-based method for classifying news articles on contentious issues. We observe that the disputants of a contention are an important feature to understand the discourse. It performs unsupervised classification on news articles based on disputant relations, and helps readers intuitively view the articles through the opponent-based frame and attain balanced understanding, free from a specific biased viewpoint. The method is performed in three stages: disputant extraction, disputant partitioning, and article classification. We apply a modified version of HITS algorithm and an SVM classifier trained with pseudo-relevant data for article analysis. We conduct an accuracy analysis and an upper bound analysis for evaluation of the method.

PrePrint: A Cocktail Approach for Travel Package Recommendation

TKDE - Wed, 05/01/2013 - 09:54
This paper provides a study of exploiting online travel information for personalized travel package recommendation. A critical challenge along this line is to address the unique characteristics of travel data, which distinguish travel packages from traditional items for recommendation. To that end, we first analyze the characteristics of the travel packages and develop a Tourist-Area-Season Topic (TAST) model. This TAST model can represent travel packages and tourists by topic distributions, where the topic extraction is conditioned on both the tourists and the intrinsic features of the landscapes. Then, based on this representation, we propose a cocktail approach to generate the lists for personalized travel package recommendation. Furthermore, we extend the TAST model to the Tourist-Relation-Area-Season Topic (TRAST) model for capturing the latent relationships among the tourists in each travel group. Finally, we evaluate the TAST model, the TRAST model, and the cocktail recommendation approach on the real-world travel package data. Experimental results show that the TAST model can effectively capture the unique characteristics of the travel data and the cocktail approach is thus much more effective than traditional recommendation techniques for travel package recommendation. Also, the TRAST model can be used as an effective assessment for travel group formation.

PrePrint: Learning Conditional Preference Networks From Inconsistent Examples

TKDE - Wed, 05/01/2013 - 09:54
The problem of learning Conditional Preference Networks(CP-nets) from a set of examples has received great attention recently. However, because of the randomicity of the users' behaviors and the observation errors, there is always some noise making the examples inconsistent, namely, there exists at least one outcome preferred over itself (by transferring) in examples. Existing CP-nets learning methods can not handle inconsistent examples. In this work, we introduce the model of learning consistent CP-nets from inconsistent examples and present a method to solve this model. We do not learn the CP-nets directly. Instead, we first learn a preference graph from the inconsistent examples, because dominance testing and consistency testing in preference graphs are easier than those in CP-nets. The problem of learning preference graphs is translated into a 0-1 programming and is solved by the branch-and-bound search. Then the obtained preference graph is transformed into a CP-net equivalently , which can entail a subset of examples with maximal sum of weight. Examples are given to show that our method can obtain consistent CP-nets over both binary and multivalued variables from inconsistent examples. The proposed method is verified on both simulated data and real data, and it is also compared with existing methods.

PrePrint: Effective Online Group Discovery in Trajectory Databases

TKDE - Wed, 05/01/2013 - 09:54
GPS-enabled devices are pervasive nowadays. Finding movement patterns in trajectory data stream is gaining in importance. We propose a group discovery framework that aims to efficiently support the online discovery of moving objects that travel together. The framework adopts a sampling-independent approach that makes no assumptions about when positions are sampled, gives no special importance to sampling points, and naturally supports the use of approximate trajectories. The framework's algorithms exploit state-of-the-art, density-based clustering (DBScan) to identify groups. The groups are scored based on their cardinality and duration, and the top-$k$ groups are returned. To avoid returning similar subgroups in a result, notions of domination and similarity are introduced that enable the pruning of low-interest groups. Empirical studies on real and synthetic data sets offer insight into the effectiveness and efficiency of the proposed framework.

PrePrint: Efficient Index-Based Approaches for Skyline Queries in Location-Based Applications

TKDE - Wed, 05/01/2013 - 09:54
Enriching many location-based applications, various new skyline queries are proposed and formulated based on the notion of locational dominance, which extends conventional one by taking objects’ nearness to query positions into account additional to objects’ non-spatial attributes. To answer a representative class of skyline queries for location-based applications efficiently, this paper presents two index-based approaches, namely, Augmented R-tree and Dominance Diagram. Augmented R-tree extends R-tree by including aggregated non-spatial attributes in index nodes to enable dominance checks during index traversal. Dominance Diagram is a solution-based approach, by which each object is associated with a precomputed non-dominance scope wherein query points should have the corresponding object not locationally dominated by any other. Dominance Diagram enables skyline queries to be evaluated via parallel and independent comparisons between non-dominance scopes and query points, providing very high search efficiency. The performance of these two approaches is evaluated via empirical studies, in comparison with other possible approaches.

PrePrint: Supervised Multiple Kernel Embedding for Learning Predictive Subspaces

TKDE - Wed, 05/01/2013 - 09:54
For supervised learning problems, dimensionality reduction is generally applied as a preprocessing step. However, coupled training of dimensionality reduction and supervised learning steps may improve the prediction performance. In this paper, we propose a novel dimensionality reduction algorithm coupled with a supervised kernel-based learner, called supervised multiple kernel embedding, that integrates multiple kernel learning to dimensionality reduction and performs prediction on the projected subspace with a joint optimization framework. Combining multiple kernels allows us to combine different feature representations and/or similarity measures towards a unified subspace. We perform experiments on one digit recognition and two bioinformatics data sets. Our proposed method significantly outperforms multiple kernel Fisher discriminant analysis followed by a standard kernel-based learner, especially on low dimensions.

PrePrint: Clustering Based on Enhanced α-Expansion Move

TKDE - Wed, 05/01/2013 - 09:54
The exemplar-based data clustering problem can be formulated as minimizing an energy function defined on a Markov random field (MRF). However, most algorithms for optimizing MRF energy function can’t be directly applied to the task of clustering, as the problem has a high-order energy function. In this paper, we first show that the high order energy function for the clustering problem can be simplified as a pairwise energy function with the metric property, and consequently it can be optimized by the alpha-expansion move algorithm based on graph cut. Then, the original expansion move algorithm is improved in the following two aspects: (I) Instead of solving a minimal s-t graph cut problem, we show that there is an explicit and interpretable solution for minimizing the energy function in the clustering problem. Based on this interpretation, a fast alpha-expansion move algorithm is proposed, which is much more efficient than the graph cut based algorithm. (II) The fast alpha-expansion move algorithm is further improved by extending its move space so that a larger energy value reduction can be achieved in each iteration. Experiments on benchmark datasets show the enhanced expansion move algorithm has a better performance, compared to other state-of-the-art exemplar-based clustering algorithms.

PrePrint: EnBay: A Novel Pattern-Based Bayesian Classifier

TKDE - Wed, 05/01/2013 - 09:54
A promising approach to Bayesian classification is based on exploiting frequent patterns, i.e., patterns that frequently occur in the training dataset, to estimate the Bayesian probability. Pattern-based Bayesian classification focuses on building and evaluating reliable probability approximations by exploiting a subset of frequent patterns tailored to a given test case. This paper proposes a novel and effective approach to estimate the Bayesian probability. Differently from previous approaches, the Entropy-based Bayesian classifier, namely EnBay, focuses on selecting the minimal set of long and not overlapped patterns that best complies with a conditional-independence model, based on an entropy-based evaluator. Furthermore, the probability approximation is separately tailored to each class. An extensive experimental evaluation, performed on both real and synthetic datasets, shows that EnBay is significantly more accurate than most state-of-the-art classifiers, Bayesian and not.

PrePrint: iLike: Bridging the Semantic Gap in Vertical Image Search by Integrating Text and Visual Features

TKDE - Wed, 05/01/2013 - 09:54
With the development of Internet and Web 2.0, large volume of multimedia content have been made online. It is highly desired to provide easy accessibility to such contents. Towards this goal, content-based image retrieval has been intensively studied in the research community, while text-based search is better adopted in the industry. Both approaches have inherent disadvantages and limitations. Therefore, unlike the great success of text search, Web image search engines are still premature. In this paper, we present iLike, a vertical image search engine which integrates textual and visual features to improve retrieval performance. We bridge the semantic gap by capturing the meaning of each text term in the visual feature space, and re-weight visual features according to their significance to the query terms. We also bridge the user intention gap since we are able to infer the "visual meanings" behind the textual queries. Last but not least, we provide a visual thesaurus, which is generated from the statistical similarity between the visual space representation of textual terms. Experimental results show that our approach improves both precision and recall, compared with content-based or text-based image retrieval techniques. More importantly, search results from iLike is more consistent with users' perception of the query terms.

PrePrint: Efficient Cluster Labeling for Support Vector Clustering

TKDE - Wed, 05/01/2013 - 09:54
We propose a new efficient algorithm for solving the cluster labeling problem in Support Vector Clustering (SVC). The proposed algorithm analyzes the topology of the function describing the SVC cluster contours and explores interconnection paths between critical points separating distinct cluster contours. This process allows distinguishing disjoint clusters and associating each point to its respective one. The proposed algorithm implements a new fast method for detecting and classifying critical points while analyzing the interconnection patterns between them. Experiments indicate that the proposed algorithm significantly improve the accuracy of the SVC labeling process in the presence of clusters of complex shape, while reducing the processing time required by existing SVC labeling algorithms by orders of magnitude.

PrePrint: A Family of Joint Sparse PCA Algorithms for Anomaly Localization in Network Data Streams

TKDE - Wed, 05/01/2013 - 09:54
Determining anomalies in data streams that are collected and transformed from various types of networks has recently attracted significant research interest. Principal Component Analysis (PCA) has been extensively applied to detecting anomalies in network data streams. However, none of existing PCA based approaches addresses the problem of identifying the sources that contribute most to the observed anomaly, or anomaly localization. In this paper, we propose novel sparse PCA methods to perform anomaly detection and localization for network data streams. Our key observation is that we can localize anomalies by identifying a sparse low dimensional space that captures the abnormal events in data streams. To better capture the sources of anomalies, we incorporate the structure information of the network stream data in our anomaly localization framework. Furthermore, we extend our joint sparse PCA framework with multi-dimensional Karhunen Lo`eve Expansion (KLE) that considers both spatial and temporal domains of data streams to stabilize localization performance.We have performed comprehensive experimental studies of the proposed methods, and have compared our methods with the state-of-the-art using three real-world data sets from different application domains. Our experimental studies demonstrate the utility of the proposed methods.

PrePrint: Dealing with Uncertainty: A Survey of Theories and Practices

TKDE - Wed, 05/01/2013 - 09:54
Uncertainty accompanies our life processes, and covers almost all fields of scientific studies. Two general categories of uncertainty, namely, aleatory uncertainty and epistemic uncertainty, exist in the world. While aleatory uncertainty refers to the inherent randomness in nature, derived from natural variability of the physical world (e.g., random show of a flipped coin), epistemic uncertainty origins from human's lack of knowledge of the physical world, as well as ability of measuring and modeling the physical world (e.g., computation of the distance between two cities). Different kinds of uncertainty call for different handling methods. Aggarwal, Yu, Sarma, Zhang et al. have made good surveys on uncertain database management based on the probability theory. This paper reviews multi-disciplinary uncertainty processing activities in diverse fields. Beyond the dominant probability theory and fuzzy theory, we also review information-gap theory and recently derived uncertainty theory. Practices of these uncertainty handling theories in the domains of economics, engineering, ecology, and information sciences are also described. It is our hope that this study could provide insights to the database community on how uncertainty is managed in other disciplines; and further challenge and inspire database researchers to develop more advanced data management techniques and tools to cope with a variety of uncertainty issues in the real world.

PrePrint: A Group Incremental Approach to Feature Selection Applying Rough Set Technique

TKDE - Wed, 05/01/2013 - 09:54
Many real data increase dynamically in size. This phenomena occurs in several fields including economics, population studies and medical research. As an effective and efficient mechanism to deal with such data, incremental technique has been proposed in the literature and attracted much attention, which stimulates the result in this paper. When a group of objects are added to a decision table, we first introduce incremental mechanisms for three representative information entropies and then develop a group incremental rough feature selection algorithm based on information entropy. When multiple objects are added to a decision table, the algorithm aims to find the new feature subset in a much shorter time. Experiments have been carried out on nine UCI data sets and the experimental results show that the algorithm is effective and efficient.

PrePrint: Towards the Automatic Extraction of Policy Networks using Web Links and Documents

TKDE - Wed, 05/01/2013 - 09:54
Policy networks are widely used by political scientists and economists to explain various financial and social phenomena, such as the development of partnerships between political entities or institutions from different levels of governance. The analysis of policy networks demands a series of arduous and time consuming manual steps including interviews and questionnaires. In this paper, we estimate the strength of relations between actors in policy networks using features extracted from data harvested from the web. Features include webpage counts, outlinks, and lexical information extracted from web documents or web snippets. The proposed approach is automatic and does not require any external knowledge source, other than the specification of the word forms that correspond to the political actors. The features are evaluated both in isolation and jointly for both positive and negative (antagonistic) actor relations. The proposed algorithms are evaluated on two EU policy networks from the political science literature. Performance is measured in terms of correlation and mean square error between the human rated and the automatically extracted relations. Correlation of up to 0.74 is achieved for positive relations. The extracted networks are validated by political scientists and useful conclusions about the evolution of the networks over time are drawn.


Powered by Drupal, an open source content management system