Professor Harry Wechsler
Department of Computer Science
e-mail : wechsler@cs.gmu.edu
web : http://cs.gmu.edu/~wechsler/
(703) 993-1533 (office)
(703) 993-1530 (sec)
(703)993-1710 (fax)
FALL '2004
CS 750 Theory and Applications of Data
Mining
Class
Information
001 72028 T
Prerequisites
CS 450
(“databases”), CS 580 (“AI”) or permission of instructor
Office Hours
T
Textbook
1. Data Mining : Concepts and Techniques, Han and Kamber, Morgan
Kaufmann, 2001
web site for textbook slides : http://www.cs.sfu.ca/~han/bk
References
1. V. Cherkassky and F. Mulier, Learning
from Data : Concepts, Theory, and Methods, John Wiley,
1999.
2. D. Pyle, Data Preparation for Data
Mining, Morgan Kaufmann, 1999.
3. R. Baeza-Yates and B. Ribeiro-Neto, Modern
Information Retrieval,
Addison-Wesley, 1999.
4. U.
Fayyad, G. Grinstein, and A. Wierse, Information
Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2002.
5. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of
Statistical Learning : Data Mining, Inference, and
Prediction, Springer, 2001.
Course
Description
Concepts
and techniques in data mining and their multidisciplinary applications. Topics include data warehousing and databases, data
cleaning and transformation, pattern transformation and data compression,
concept description, association and correlation rules, data classification and
predictive modeling, clustering, performance analysis and scalability, data
mining in advanced database systems including text, audio and images, and
emerging themes and future challenges related to biometrics and the semantic
web. Term team project and topical
review are required.
Motivation
The explosive growth
in generating, collecting and storing data has generated an urgent need for new
techniques and automated tools that can intelligently assist us in transforming
the vast amounts of data into useful information and knowledge. Data mining is
a multidisciplinary field, drawing from areas including AI, database
technology, data visualization, information retrieval, high performance
computing, machine learning, mathematical programming, neural networks, pattern
recognition, statistical learning theory, and statistics. The course provides the graduate students the
opportunity to learn about the management and use of large data repositories
based upon a multidisciplinary approach.
Goals
The objective of this course is to introduce graduate students to
current research, technological advances and trends in data mining. Data mining, which supports knowledge
discovery in databases (KDD), helps with the automated extraction of patterns
representing knowledge implicitly stored in large databases, data warehouses,
and other massive information repositories.
The course focuses on issues related to the feasibility, usefulness,
efficiency, and scalability of automated techniques for the discovery of
patterns hidden in large databases.
Students will be exposed to the above topics via lectures and reading
assignments, including recent journal and conference papers. Students are
expected to complete a term project and to make an in depth presentation on a
topic related to data mining. As data mining has matured, the field is now
advancing on three new fronts: (i) ability to mine
data in real time; (ii) predictive analysis rather than merely explain past
trends; and (iii) analyze messy “unstructured” data.
Follow – Up Studies
with Professor Wechsler : 1. INFT 844 -- Pattern Recognition – Spring 2005; 2. CS
667 – Biometrics; 3. Certificate in Biometrics; and 4. PhD dissertation.
Grading
(Team) Term
Project à 75
%.
In-Depth Science and Technology REVIEW à 25 %
Term Project
Students work are advised to work in teams on
the term project.
Scope and range for the project has to be agreed with the instructor.
Task involves meaningful functionality and significant amounts of data.
Project includes
the following STEPS :
1. Problem definition,
requirements analysis and conceptual design.
2. Data selection / sampling.
3. Cleaning and integration / Preprocessing.
4. Data transformation / Data Reduction.
5. Data Mining.
6. Modeling, test & evaluation, and performance assessment.
7. Visualization and knowledge discovery.
Reviews and class presentations are conducted stepwise
throughout the course (see tentative schedule). First a draft for each step is
expected
the week the STEP is listed in the tentative schedule listed below.
Based upon feedback received in class the same step is completed and
presented again the following week.
Final (In Class)
Project Presentation (SLIDES)
(about 30 minutes)
1. Survey / Literature Review
of (a) application
and (b) task / functionality , data mining (STEP 5)
and model selection (“training strategy”).
2. Brief Description of STEPS 1 – 7.
3. Performance Evaluation and Assessment of your project.
Final Project Report (HARD COPY) (at
most 15 pages)
Submit Technical Report (TR) that
covers your Final Project
Presentation.
Tentative Schedule
|
August 31 |
Chs. 1: Introduction – Data Warehouses, Databases, Data Mining and Knowledge Discovery, and the Semantic Web. |
|
September 7 |
STEP 1 |
|
September 14 |
|
|
September 21 |
STEP 4 |
|
September 28 – October 5 |
|
|
October 12 |
Columbus Day Recess |
|
October 19 - 26 |
Performance
Assessment : Training (and Validation), Testing and Evaluation; |
|
November 2 |
|
|
November 9 |
|
|
November 16 |
STEPS 6 - 7 |
|
November 23 |
Statistical Learning Theory (SLT), Generalization and Prediction Risk, Structural Risk Minimization (SRM), and Support Vector Machines (SVM). |
|
November 30 |
FINAL PROJECT
PRESENTATION |
|
December 7 |
FINAL PROJECT
PRESENTATION |