ISE-TR-06-10
Positive Train Control (PTC) is a wireless control system ensuring railroad safety by enforcing train
separation, speed enforcement, roadway worker protection and other safety functions. Due to shared
track rights over each-other's tracks in North America, company A's trains must be safely operated by
company B's crew on company C's tracks, requiring different PTC systems to securely interoperate with
each other. For a security framework to ensure that, we propose using a trust management system with
certificates and over the air re-keying (OTAR). Back of the envelope calculations show that our solution
meets timing needs of PTC. Index Terms: Security, Rail Transportation Control, SCADA Systems, Cryptography
ISE-TR-06-09
A critical problem related to kernel-based methods is the selection of an optimal
kernel for the problem at hand. The kernel function in use must conform with the
learning target in order to obtain meaningful results. While solutions to estimate
optimal kernel functions and their parameters have been proposed in a supervised
setting, the problem presents open challenges when no labeled data are provided,
and all we have available is a set of pairwise must-link and cannot-link constraints.
In this paper we address the problem of optimizing the kernel function using pairwise constraints for semi-supervised clustering. To this end we derive a new optimization criterion to automatically estimate the optimal parameters of composite
Gaussian kernels, directly from the data and the given constraints. We combine
the optimal kernel function computed by our technique with a recently introduced
semi-supervised kernel-based algorithm to demonstrate experimentally the effectivess of our approach. The results show that our method enables the practical utilization of powerful kernel-based semi-supervised clustering approaches by providing a mechanism to automatically set the involved critical parameters.
ISE-TR-06-08
Web Services offer an excellent opportunity to redesign and replace
old and insecure applications with more flexible and robust ones.
WSEmail is one such application that replaces conventional message
delivery systems with a family of Web Services that achieve the same
goal. In this paper we analyze the existing WSEmail specification
against the standard set of use cases (and misuse cases) supported
(resp. prevented) by SMTP implementations - the current default
message delivery infrastructure - and augment it with several missing
pieces. In addition, we show how the WSEmail family of Web
Services, specified in WSDL, can be orchestrated using BPEL.
Finally, we provide a synchronization analysis of our WSEmail
orchestration and show its correctness.
ISE-TR-06-07
Ontologies are used as a means of expressing agreements to a
vocabulary shared by a community in a coherent and consistent
community members in a decentralized manner. As it happens
on the internet, ontologies are created by community members
in a decentralized manner, requiring that they be merged before
being used by the community. We develop an algabra to do so in
the Resource Discription Framework (RDF). To provide formal
semantics of the proposed algebraic property names, while ontology
C has been composed by ooperators, we type a fragment of the RDF syntax.
ISE-TR-06-06
In text mining we often have to handle large document collections. The labeling of such large corpuses of documents
is too expensive and impractical. Thus, there is a need to
develop (unsupervised) clustering techniques for text data,
where the distributions of words can vary significantly from
one category to another. The vector space model of documents easily leads to a
30000 or more dimensions. In such high dimensionality, the
effectiveness of any distance function that equally uses all
input features is severely compromised. Furthermore, it is
expected that different words may have different degrees of
relevance for a given category of documents, and a single
word may have a different importance across different categories. In this paper we first propose a global unsupervised feature selection approach for text, based on frequent itemset
mining. As a result, each document is represented as a set of
words that co-occur frequently in the given corpus of documents. We then introduce a locally adaptive clustering
algorithm, designed to estimate (local) word relevance and,
simultaneously, to group the documents. We present experimental results to demonstrate the feasibility of our approach. Furthermore, the analysis of the
weights credited to terms provide evidence that the identified keywords can guide the process of label assignment to
clusters. We take into consideration both spam email filtering and general classification datasets. Our analysis of the
distribution of weights in the two cases provides insights on
how the spam problem distinguishes from the general classification case. Keywords: Feature selection, feature relevance and weighting, subspace clustering, document categorization, spam emails.
ISE-TR-06-05
In this paper we identify an undesirable side-effect of combining different email-control mechanisms for protection
from unwanted messages, namely, leakage of recipients' private information to message senders. This is because
some email-control mechanisms like bonds, graph-turing tests, etc., inherently leak information, and without
discontinuing their use, leakage channels cannot be closed. We formalize the capabilities of an attacker and show
how she can launch guessing attacks on recipient's mail acceptance policy that utilizes leaky mechanism in its
defence against unwanted mail. As opposed to the classical Dolev-Yao attacker and its extensions, attacker in our model guesses the contents of
a recipient's private information. The use of leaky mechanisms allow the sender to verify her guess. We assume a
constraint logic programming based policy language for specification and evaluation of mail acceptance criteria
and present two different program transformations that can prevent guessing attacks while allowing recipients to
utilize any email-control mechanism in their policies.
ISE-TR-06-04
Clustering suffers from the curse of dimensionality, and similarity functions that use all input features with equal relevance may not be effective. We
introduce an algorithm that discovers clusters in subspaces spanned by different
combinations of dimensions via local weightings of features. This approach avoids the
risk of loss of information encountered in global dimensionality reduction techniques,
and does not assume any data distribution model. Our method associates to each
cluster a weight vector, whose values capture the relevance of features within the
corresponding cluster. We experimentally demonstrate the gain in perfomance our
method achieves with respect to competitive methods, using both synthetic and real
datasets. In particular, our results show the feasibility of the proposed technique to
perform simultaneous clustering of genes and conditions in gene expression data,
and clustering of very high dimensional data such as text data.
ISE-TR-06-03
This paper presents data from a study of the current state of practice of software testing.
Test managers from twelve different software organizations were interviewed. The interviews
focused on the amount of resources spent on testing, how the testing is conducted, and the
knowledge of the personnel in the test organizations. The data indicate that the overall test maturity is low. Test managers are aware of this
but have trouble improving. One problem is that the organizations are commercially successful, suggesting that products must already be âgood enough.â Also, the current lack of
structured testing in practice makes it difficult to quantify the current level of maturity and
thereby articulate the potential gain from increasing testing maturity to upper management
and developers.
ISE-TR-06-02
Despite attempts to formalize the semantics of use cases,
they remain an informal notation. The informality of use cases is both a
blessing and a curse. Whilst it admits an easy learning curve and enables
communication between software stakeholders, it is also a barrier to the
application of automated methods for test case generation, validation
or simulation. This paper presents a precise way of specifying use cases
based on a three-level modeling paradigm strongly influenced by UML.
The formal syntax and semantics of use case charts are given, along with
an example that illustrates how they can be used in practice.
ISE-TR-06-01
Finding points that are outliers with respect to a set of other
points is an important task in data mining. Outlier detection can uncover important anomalies in fields like intrusion
detection and fraud analysis. In data streaming, the presence of a large number of outliers indicates that the underlying process that is generating the data is undergoing significant changes and the models that attempt to characterize it
need to be updated. Although there has been a significant
amount of work in outlier detection, most of the algorithms
in the literature resort to a particular definition of what an
outlier is (e.g., density-based), and use thresholds to detect
them. In this paper we present a novel technique to detect outliers that does not impose any particular definition
for them. The test we propose aims to diagnose whether a
given point is an outlier with respect to an existing clustering
model (i.e., a set of points partitioned in groups). However,
the test can also be successfully utilize to recognize outliers
when the clustering information is not available. This test
is based on Transductive Confidence Machines, which have
been previously proposed as a mechanism to provide individual confidence measures on classification decisions. The
test uses hypothesis testing to prove or disprove whether a
point is fit to be in each of the clusters of the model. We
demonstrate, experimentally, that the test is highly robust,
and produces very few misdiagnosed points, even when no
clustering information is available. We also show that the
test can be succesfully applied to identify outliers present
inside a data set for which no other information is available,
thereby provinding the user with a clean data set to identify future outliers. Our experiments also show that even if
the data set used to identify further outliers is contaminated
with some outliers, the test can perform succesfully.