Professor Harry
Wechsler
Department of Computer
Science
e-mail : wechsler@cs.gmu.edu
web
: http://cs.gmu.edu/~wechsler/
(703) 993-1533 (office)
(703) 993-1530 (sec)
(703)993-1710 (fax)
SUMMER '2007
CS 750 Theory and Applications of Data Mining
Class Information
A01
5/21 40977 MWF 3:45 p.m. – 6:50 p.m. IN 136
Prerequisites
CS 450
(“databases”), CS 580 (“AI”) or equivalent
Office Hours
M-W-F 3:15 – 3:45
PM (SITE II -
Rm. 461)
Textbook
Introduction
to Data Mining, Tan, Steinbach and
Kumar,
Pearson Addison
Wesley, 2006
web site for textbook slides : http://www-users.cs.umn.edu/~kumar/dmbook/
Reference
Data Mining: Concepts and Techniques (2nd. edition), Han and Kamber, Elsevier, 2006
web site for textbook slides http://www-faculty.cs.uiuc.edu/~hanj/bk2/
WEKA web site for data mining
software
http://www.togaware.com/datamining/survivor/Weka.html
UCI Machine Learning Repository
Content Summary
http://www.ics.uci.edu/~mlearn/MLSummary.html
References
1. V.
Cherkassky and F. Mulier, Learning from Data : Concepts, Theory, and
Methods, John Wiley, 1999.
2. D. Pyle, Data Preparation for Data Mining, Morgan
Kaufmann, 1999.
3.
R. Baeza -Yates and B. Ribeiro-Neto, Modern Information Retrieval,
Addison-Wesley, 1999.
4. T.
Hastie, R. Tibshirani, and J. Friedman, The
Elements of Statistical Learning: Data Mining, Inference, and Prediction,
Springer, 2001.
Course Description
Concepts and techniques in data mining and their multidisciplinary implementation
and applications. Topics include data warehousing and
databases, data cleaning and transformation, concept description, association
and correlation rules, data classification and predictive modeling, clustering,
performance analysis and scalability, mining stream and sequence data, social
network analysis, multimedia data mining, biometrics, and emerging themes and
trends. Term team project and topical review are required.
Motivation
The explosive growth in generating, collecting and storing
data has generated an urgent need for new techniques and automated tools that
can intelligently assist us in transforming the vast amounts of data into
useful information and knowledge. Data mining is a multidisciplinary field,
drawing from areas including AI, database technology, data visualization,
information retrieval, high performance computing, machine learning,
mathematical programming, neural networks, pattern recognition, statistical
learning theory, and statistics. The course provides the graduate
students the opportunity to learn about the management and use of large data
repositories based upon a multidisciplinary approach.
Goals
The objective of this course is to introduce graduate
students to data mining basics, current research, technological advances and
trends in data mining. Data mining, which supports knowledge
discovery in databases (KDD), helps with the automated extraction of patterns
representing knowledge implicitly stored in large databases, data warehouses,
and other massive information repositories. The course focuses on issues
related to the feasibility, usefulness, efficiency, and scalability of
automated techniques for the discovery of patterns hidden in large
databases. Students will be exposed to the above topics via lectures and
reading assignments, including recent journal and conference papers. Students
are expected to complete a term project and to make an in depth presentation on
a topic related to data mining. As data
mining has matured, the field is now advancing on three new fronts: (i) ability to mine data in real time; (ii) predictive
analysis rather than merely explaining past trends; and (iii) ability to
analyze messy “unstructured” data.
Follow – Up Studies with Professor
Wechsler : 1. CS 778 – Biometrics – Spring 2008; 2. CS 668
/ IT 844 -- Pattern Recognition [or CS 775 Advanced Pattern
Recognition ] – Spring 2009; 3. Certificate in Biometrics; 4. PhD dissertation.
Grading
(Team) Term
Project à 50 %.
Midterm – June
9 à 50 %
Term
Project
Students are working
in teams on the term project.
Scope and range for the project has to be agreed with the instructor.
Task involves meaningful functionality and significant amounts of data.
Project includes the following STEPS :
1. Problem definition, requirements analysis and
conceptual design.
2. Data selection / sampling. // visualization //
3. Cleaning and integration / Preprocessing // visualization //
4. Data transformation / Data Reduction // visualization //
5. Data Mining // visualization //
6. Modeling, test & evaluation, and performance assessment // visualization
//
7. Knowledge discovery // visualization //
Use
domain knowledge and visualization for all the steps.
Iteratively
refine the quality and scope of your project
Reviews and class
presentations are conducted stepwise
throughout the course (see tentative schedule). First a draft for each step is
expected
the lecture the STEP is listed in the tentative schedule listed below.
Based upon feedback received in class the same step is completed and
presented again the following lecture.
Final (In
Class) Project Presentation (SLIDES) (about 45 minutes)
1. Survey /
Literature Review of (a) application
and (b) task / functionality, data mining (STEP 5)
and model selection (“training strategy”).
2. Brief
Description of STEPS 1 – 7.
3. Performance
Evaluation and Assessment of your project.
Final Project
Report (HARD COPY) (at most 15 pages)
Submit Technical Report (TR) that covers your Final Project
Presentation.
Tentative
Schedule
May 21 |
- Appendix C –
Probability and Statistics - |
May 23 |
- Appendix A – Linear Algebra - |
May 25 |
- Appendix E – Optimization - |
May
28 |
Memorial
Day – no class |
May 30 June 1 – June 4 - |
Data reduction & transformation - Step 2& 3 [5/30]
Appendix B – Dimensionality Reduction |
June 6 |
REVIEW for Mid – Term Appendix D –Regression |
June 8 |
Mid – Term Closed books and notes bring blue book and calculator |
June 11 – 13 - 15 |
Chaps. 4/5, 6/7, 8/9 - Advanced Topics – Classification – Association –Clustering
Biometrics
STEP 5 – June 11
STEPS 6 – 7 – June 15 |
June 18 |
FINAL
PROJECT PRESENTATION |
June 20 |
FINAL
PROJECT PRESENTATION |