Professor Harry Wechsler
Department of Computer Science
e-mail : wechsler@cs.gmu.edu
web : http://cs.gmu.edu/~wechsler/
(703) 993-1533 (office)
(703) 993-1530 (sec)
(703)993-1710 (fax)
SUMMER '2005
CS 750 Theory and Applications of Data
Mining
Class Information
B01 6/6 50790 MW
4:30 p.m. – 7:10 p.m. ENT 275
Prerequisites
CS 450
(“databases”), CS 580 (“AI”)
or permission of
instructor
Office Hours
Before and after
the class or by appointment (SITE II - Rm. 461)
Textbook
1. Data Mining : Concepts and
Techniques, Han and Kamber, Morgan Kaufmann, 2001
- web site for
textbook slides : http://www.cs.sfu.ca/~han/bk
1a. web site for data
mining software http://www.togaware.com/datamining/survivor/Weka.html
References
1. V. Cherkassky and F. Mulier, Learning
from Data : Concepts, Theory, and Methods, John Wiley,
1999.
2. D. Pyle, Data Preparation for Data
Mining, Morgan Kaufmann, 1999.
3. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley,
1999.
4. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of
Statistical Learning : Data Mining, Inference, and
Prediction, Springer, 2001.
Course Description
Concepts and techniques in data mining and their multidisciplinary
applications. Topics include data warehousing and databases, data cleaning and
transformation, pattern transformation and data compression, concept
description, association and correlation rules, data classification and
predictive modeling, clustering, performance analysis and scalability, data
mining in advanced database systems including text, audio and images, and
emerging themes and future challenges related to biometrics and the semantic
web. Term team project and topical
review are required.
Motivation
The explosive growth
in generating, collecting and storing data has generated an urgent need for new
techniques and automated tools that can intelligently assist us in transforming
the vast amounts of data into useful information and knowledge. Data mining is
a multidisciplinary field, drawing from areas including AI, database
technology, data visualization, information retrieval, high performance
computing, machine learning, mathematical programming, neural networks, pattern
recognition, statistical learning theory, and statistics. The course provides the graduate students the
opportunity to learn about the management and use of large data repositories
based upon a multidisciplinary approach.
Goals
The objective of this course is to introduce graduate students to
current research, technological advances and trends in data mining. Data mining, which supports knowledge
discovery in databases (KDD), helps with the automated extraction of patterns
representing knowledge implicitly stored in large databases, data warehouses,
and other massive information repositories.
The course focuses on issues related to the feasibility, usefulness,
efficiency, and scalability of automated techniques for the discovery of
patterns hidden in large databases.
Students will be exposed to the above topics via lectures and reading
assignments, including recent journal and conference papers. Students are
expected to complete a term project and to make an in depth presentation on a
topic related to data mining. As data mining has matured, the field is now
advancing on three new fronts: (i) ability to mine
data in real time; (ii) predictive analysis rather than merely explain past
trends; and (iii) analyze messy “unstructured” data.
Follow – Up Studies
with Professor Wechsler : 1. CS 667 –
Biometrics – Spring 2006; 2. CS 775
/ IT 844
-- Pattern Recognition – Spring 2007; 3. Certificate in Biometrics; 4.
PhD dissertation.
Grading
(Team) Term Project à
75%.
Science and Technology REVIEW à 25%
No FINAL EXAM
Term Project
Students are
working in teams on the term project.
Scope and range for the project has to be agreed with the instructor.
Task involves meaningful functionality and significant amounts of data.
Project includes the following STEPS :
1. Problem definition,
requirements analysis and conceptual design.
2. Data selection / sampling.
3. Cleaning and integration / Preprocessing.
4. Data transformation / Data Reduction.
5. Data Mining.
6. Modeling, test & evaluation, and performance assessment.
7. Visualization and knowledge discovery.
Reviews and class presentations are conducted stepwise
throughout the course (see tentative schedule). First a draft for each step is
expected
the week the STEP is listed in the tentative schedule listed below.
Based upon feedback received in class the same step is completed and
presented again the following week.
Final (In Class)
Project Presentation (SLIDES)
(about 30 minutes)
1. Survey / Literature Review
of (a) application
and (b) task / functionality, data mining (STEP 5)
and model selection (“training strategy”).
2. Brief
Description of STEPS 1 – 7.
3. Performance Evaluation and Assessment of your project.
Final Project Report (HARD COPY) (at
most 15 pages)
Submit Technical Report (TR) that
covers your Final
Project Presentation.
Tentative Schedule
|
June 6 |
|
|
June 8 |
|
|
June 13 |
|
|
June 15 |
|
|
June 20 - 22 |
Trees) June 20: STEPS 2 - 3 |
|
June 27 - 29 |
Performance
Assessment : Training (and Validation), Testing and Evaluation; June 27: STEP 4 |
|
July 4 |
Independence Day |
|
July 6 |
|
|
July 11 |
|
|
July 13 - 18 |
Statistical
Learning Theory (SLT), Generalization and Prediction Risk, Structural Risk
Minimization (SRM), and Support Vector Machines (SVM) Biometrics July 18: STEPS 6 - 7 |
|
July 20 |
FINAL PROJECT
PRESENTATIONS |
|
July 25 |
FINAL PROJECT
PRESENTATIONS |