A Framework for Finding Patterns in Mixed and Streaming Data | George Mason Department of Computer Science

When: Wednesday, November 20, 2019 from 02:00 PM to 04:00 PM
Speakers: Rohan Khade
Location: ENGR 4201
Export to iCal

The work presented in this thesis was motivated by a desire to reliably detect and explain factors resulting in failure at final test during the semiconductor packaging and test process. The analysis of semiconductor manufacturing data is non-trivial because the numbers of attributes and instances are large. Apart from that, the data generated continues to grow exponentially due to increased part and package complexity (e.g. integration of more functions within the same package) while the time window for data analysis and yield learning continues to decrease in order to support faster time to market. To overcome these conflicting requirements, we have been working on developing scalable exploratory data mining solutions that are amenable to the increasing volumes of data coming out of Intel factories. Our intent is to learn complex patterns quickly to give valuable and actionable feedback to the engineers. Previously this discovery process required an engineer to know what to look for, and even after that needed considerable amount of analysis. Some of this motivated us to focus on a general framework for pattern mining on discrete and continuous variables with high order data dimensions.

Pattern mining is an umbrella term used for data mining algorithms with the goal of finding relationships between attributes, such as association rules, frequent sets, contrast patterns, emerging patterns, etc. While pattern mining is a well-researched topic in data mining, with many applications in diverse disciplines, there remain some open problems that have not been addressed by existing work. We summarize some of the problems encountered and addressed in this dissertation.

In many real-world applications such as manufacturing, data contain both continuous and categorical attributes. In our work, we propose novel methodologies to find patterns in datasets with such mixed attributes. More specifically, our algorithms dynamically discretizes continuous attributes in an itemset in a supervised fashion. We propose a top-down recursive approach to find intervals for continuous attributes that result in statistically significant patterns. As opposed to a global discretization scheme, where each attribute is discretized exactly once, our approach allows local discretization --- that is, any continuous attribute can be discretized in different ways based on the consequent. This approach makes it possible to capture different inter-variable relationships. We evaluate our algorithm with several synthetic and real datasets, including Intel manufacturing data that motivated this research. The experimental results and analysis indicate that our algorithm is capable of finding more meaningful rules for multivariate data than existing algorithms.

Also, in many real-world scenarios, the data arrives in a streaming manner; the goal is to find and maintain the most current representation/model of the data. General challenges for streaming data include data arriving at high intensity, detecting and handling concept drift and updating the model in reasonable time. Since the data arrive at a fast speed, we propose a weighted average method using a sliding window to update patterns. Our updating strategy detects concept drift, detects anomalous patterns and provides a consistent view of the data.

Posted 5 years, 8 months ago

A Framework for Finding Patterns in Mixed and Streaming Data Events / Oral Defense of Doctoral Dissertation

Categories

A Framework for Finding Patterns in Mixed and Streaming Data
Events / Oral Defense of Doctoral Dissertation