2001 Technical Reports | George Mason Department of Computer Science

Database Schema Matching Using Machine Learning with Feature Selection

Jacob Berlin and Amihai Motro

ISE-TR-01-06

Schema matching, the problem of finding mappings between the attributes of two semantically related database schemas, is an important aspect of many database applications such as schema integration, data warehousing, and electronic commerce. Unfortunately, schema matching remains largely a manual, labor-intensive process. Furthermore, the effort required is typically linear in the number of schemas to be matched; the next pair of schemas to match is not any easier than the previous pair. In this paper we describe a system, called Automatch, that uses machine learning techniques to automate schema matching. Based primarily on Bayesian learning, the system acquires probabilistic knowledge from examples that have been provided by domain experts. This knowledge is stored in a knowledge base called the attribute dictionary. When presented with a pair of new schemas that need to be matched (and their corresponding database instances), Automatch uses the attribute dictionary to find an optimal matching. We also report initial results from the Automatch project.

Covering Bitmap with Trapezoids is Hard

Changzhou Wang and X. Sean Wang

ISE-TR-01-05

Each sub-series of a time series can be mapped to a point on a plane whose two coordinates represent the starting time and the ending time, respectively. In certain applications, points with similar properties are likely to be clustered in a special trapezoid shape. To save these points in a compact way, the following optimization problem needs to be studied: find the minimum number of trapezoids which cover a given set of points. This paper formalizes the optimization problem as a decision problem, namely, Bitmap Cover with special Trapezoids (BCT), and proves that BCT is NP-hard.

Maintainability of the Linux Kernel

Stephen R. Schach, Bo Jin, David R. Wright, Gillian Z. Heller, A. Jefferson Offutt

ISE-TR-01-04

We have examined 365 versions of Linux. For every version, we counted the number of instances of common (global) coupling between each of the 17 kernel modules and all the other modules in that version of Linux. We found that the number of instances of common coupling grows exponentially with version number. This result is significant at the 99.99% level, and no additional variables are needed to explain this increase. On the other hand, the number of lines of code in each kernel modules grows only linearly with version number. We conclude that, unless Linux is restructured with a bare minimum of common coupling, the dependencies induced by common coupling will, at some future date, make Linux exceedingly hard to maintain without inducing regression faults.

Generating Test Data From Requirements/Specifications: Phase IV Final Report

Jeff Offutt

ISE-TR-01-03

This report presents results for the Rockwell Collins Inc. sponsored project on generating test data from requirements/specifications, which started January 1, 2000. The purpose of this project is to improve our ability to test software that needs to be highly reliable by developing formal techniques for generating test cases from formal specificational descriptions of the software. Formal specifications give software developers the opportunity to describe exactly what services the software should provide, providing information to software designers in a clear, unambiguous manner. Formal specifications give test engineers information about the expectations of the software in a form that can be automatically manipulated. This Phase IV, 2000 report presents progress on an empirical evaluation of the efficacy of the transition-pair criterion developed in previous years. Transition-pair tests have been developed for the Flight Guidance System (FGS) and they were run on the faulty versions of FGS developed last year. These tests and the results are summarized in this preliminary report. This report also presents progress in our test data generation tool. This tool has been significantly expanded to handle multiple SCR tables, recursively defined tables, event and condition tables, non-boolean variables, and multiple-variable expressions. It is integrated with the Naval Research Laboratory's SCRTool and Rational Software Corporation's Rational Rose tool.

VirtuE: Virtual Enterprises for Information Markets

Alessandro D'Atri and Amihai Motro

ISE-TR-01-02

An essential part of a modern economy is an information market. In this market, information products are being traded in countless ways. Information is bought, modified, integrated, incorporated into other products and then sold again. Usually, the production of an information product requires the collaboration of several participants. A virtual enterprise is a community of business entities that collaborate on the manufacturing of new products. This collaboration is often ad hoc, for a specific product only, after which the virtual enterprise may dismantle. The virtual enterprise paradigm is particularly appealing for modeling collaborations for manufacturing information products, and in this paper we present a new model, called VirtuE, for modeling such activities.

Large Constraint Joins and Disjoint Decompositions

Alexander Brodsky and Victor E. Segal

ISE-TR-01-01

This paper is devoted to the problem of evaluation of a conjunction of N disjunctions of linear constraints in a multidimensional space (constraint join), which is a common constraint database operation. Existing methods are exponential in N. Here combinatorial geometry methods are applied to establish conditions for a polynomial size of the join output. A polynomial method to compute a join for an arbitrary input is then given. As part of the solution, a new algorithm for convex decomposition of an arbitrary non-convex polyhedron is developed. The H-tree is proposed - a new data structure combining constraint storage and indexing. H-tree and R-tree implementations of the algorithm are given, and analytical considerations regarding the performance are provided.