Mark Hartong, Rajini Goel, Duminda Wijesekera
ISE-TR-06-10
Positive Train Control (PTC) is a wireless control system ensuring railroad safety by enforcing train separation, speed enforcement, roadway worker protection and other safety functions. Due to shared track rights over each-other's tracks in North America, company A's trains must be safely operated by company B's crew on company C's tracks, requiring different PTC systems to securely interoperate with each other. For a security framework to ensure that, we propose using a trust management system with certificates and over the air re-keying (OTAR). Back of the envelope calculations show that our solution meets timing needs of PTC. Index Terms: Security, Rail Transportation Control, SCADA Systems, Cryptography
Bojun Yan, Carlotta Domeniconi
ISE-TR-06-09
A critical problem related to kernel-based methods is the selection of an optimal kernel for the problem at hand. The kernel function in use must conform with the learning target in order to obtain meaningful results. While solutions to estimate optimal kernel functions and their parameters have been proposed in a supervised setting, the problem presents open challenges when no labeled data are provided, and all we have available is a set of pairwise must-link and cannot-link constraints. In this paper we address the problem of optimizing the kernel function using pairwise constraints for semi-supervised clustering. To this end we derive a new optimization criterion to automatically estimate the optimal parameters of composite Gaussian kernels, directly from the data and the given constraints. We combine the optimal kernel function computed by our technique with a recently introduced semi-supervised kernel-based algorithm to demonstrate experimentally the effectivess of our approach. The results show that our method enables the practical utilization of powerful kernel-based semi-supervised clustering approaches by providing a mechanism to automatically set the involved critical parameters.
Saket Kaushik, Duminda Wijesekera and Paul Ammann
ISE-TR-06-08
Web Services offer an excellent opportunity to redesign and replace old and insecure applications with more flexible and robust ones. WSEmail is one such application that replaces conventional message delivery systems with a family of Web Services that achieve the same goal. In this paper we analyze the existing WSEmail specification against the standard set of use cases (and misuse cases) supported (resp. prevented) by SMTP implementations - the current default message delivery infrastructure - and augment it with several missing pieces. In addition, we show how the WSEmail family of Web Services, specified in WSDL, can be orchestrated using BPEL. Finally, we provide a synchronization analysis of our WSEmail orchestration and show its correctness.
Sasket Kaushik, Csilla Farkas, Duminda Wijesekera, and Paul Ammann
ISE-TR-06-07
Ontologies are used as a means of expressing agreements to a vocabulary shared by a community in a coherent and consistent community members in a decentralized manner. As it happens on the internet, ontologies are created by community members in a decentralized manner, requiring that they be merged before being used by the community. We develop an algabra to do so in the Resource Discription Framework (RDF). To provide formal semantics of the proposed algebraic property names, while ontology C has been composed by ooperators, we type a fragment of the RDF syntax.
Ning Kang, Carlotta Domeniconi, and Daniel Barbará
ISE-TR-06-06
In text mining we often have to handle large document collections. The labeling of such large corpuses of documents is too expensive and impractical. Thus, there is a need to develop (unsupervised) clustering techniques for text data, where the distributions of words can vary significantly from one category to another. The vector space model of documents easily leads to a 30000 or more dimensions. In such high dimensionality, the effectiveness of any distance function that equally uses all input features is severely compromised. Furthermore, it is expected that different words may have different degrees of relevance for a given category of documents, and a single word may have a different importance across different categories. In this paper we first propose a global unsupervised feature selection approach for text, based on frequent itemset mining. As a result, each document is represented as a set of words that co-occur frequently in the given corpus of documents. We then introduce a locally adaptive clustering algorithm, designed to estimate (local) word relevance and, simultaneously, to group the documents. We present experimental results to demonstrate the feasibility of our approach. Furthermore, the analysis of the weights credited to terms provide evidence that the identified keywords can guide the process of label assignment to clusters. We take into consideration both spam email filtering and general classification datasets. Our analysis of the distribution of weights in the two cases provides insights on how the spam problem distinguishes from the general classification case. Keywords: Feature selection, feature relevance and weighting, subspace clustering, document categorization, spam emails.
Saket Kaushik, William Winsborough, Duminda Wijesekera, and Paul Ammann
ISE-TR-06-05
In this paper we identify an undesirable side-effect of combining different email-control mechanisms for protection from unwanted messages, namely, leakage of recipients' private information to message senders. This is because some email-control mechanisms like bonds, graph-turing tests, etc., inherently leak information, and without discontinuing their use, leakage channels cannot be closed. We formalize the capabilities of an attacker and show how she can launch guessing attacks on recipient's mail acceptance policy that utilizes leaky mechanism in its defence against unwanted mail. As opposed to the classical Dolev-Yao attacker and its extensions, attacker in our model guesses the contents of a recipient's private information. The use of leaky mechanisms allow the sender to verify her guess. We assume a constraint logic programming based policy language for specification and evaluation of mail acceptance criteria and present two different program transformations that can prevent guessing attacks while allowing recipients to utilize any email-control mechanism in their policies.
Carlotta Domeniconi, Dimitrios Gunopulos, Sheng Ma, Dimitris Papadopoulos, and Bojun Yan
ISE-TR-06-04
Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features. This approach avoids the risk of loss of information encountered in global dimensionality reduction techniques, and does not assume any data distribution model. Our method associates to each cluster a weight vector, whose values capture the relevance of features within the corresponding cluster. We experimentally demonstrate the gain in perfomance our method achieves with respect to competitive methods, using both synthetic and real datasets. In particular, our results show the feasibility of the proposed technique to perform simultaneous clustering of genes and conditions in gene expression data, and clustering of very high dimensional data such as text data.
Mats Grindal, Jeff Offutt, and Jonas Mellin
ISE-TR-06-03
This paper presents data from a study of the current state of practice of software testing. Test managers from twelve different software organizations were interviewed. The interviews focused on the amount of resources spent on testing, how the testing is conducted, and the knowledge of the personnel in the test organizations. The data indicate that the overall test maturity is low. Test managers are aware of this but have trouble improving. One problem is that the organizations are commercially successful, suggesting that products must already be “good enough.” Also, the current lack of structured testing in practice makes it difficult to quantify the current level of maturity and thereby articulate the potential gain from increasing testing maturity to upper management and developers.
Jon Whittle
ISE-TR-06-02
Despite attempts to formalize the semantics of use cases, they remain an informal notation. The informality of use cases is both a blessing and a curse. Whilst it admits an easy learning curve and enables communication between software stakeholders, it is also a barrier to the application of automated methods for test case generation, validation or simulation. This paper presents a precise way of specifying use cases based on a three-level modeling paradigm strongly influenced by UML. The formal syntax and semantics of use case charts are given, along with an example that illustrates how they can be used in practice.
Daniel Barbará
ISE-TR-06-01
Finding points that are outliers with respect to a set of other points is an important task in data mining. Outlier detection can uncover important anomalies in fields like intrusion detection and fraud analysis. In data streaming, the presence of a large number of outliers indicates that the underlying process that is generating the data is undergoing significant changes and the models that attempt to characterize it need to be updated. Although there has been a significant amount of work in outlier detection, most of the algorithms in the literature resort to a particular definition of what an outlier is (e.g., density-based), and use thresholds to detect them. In this paper we present a novel technique to detect outliers that does not impose any particular definition for them. The test we propose aims to diagnose whether a given point is an outlier with respect to an existing clustering model (i.e., a set of points partitioned in groups). However, the test can also be successfully utilize to recognize outliers when the clustering information is not available. This test is based on Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. The test uses hypothesis testing to prove or disprove whether a point is fit to be in each of the clusters of the model. We demonstrate, experimentally, that the test is highly robust, and produces very few misdiagnosed points, even when no clustering information is available. We also show that the test can be succesfully applied to identify outliers present inside a data set for which no other information is available, thereby provinding the user with a clean data set to identify future outliers. Our experiments also show that even if the data set used to identify further outliers is contaminated with some outliers, the test can perform succesfully.