Professor Harry Wechsler

Department of Computer Science

George Mason University

Fairfax, VA 22030

e-mail : wechsler@cs.gmu.edu

web : http://cs.gmu.edu/~wechsler/

           (703) 993-1533 (office)

(703) 993-1530 (sec)

(703)993-1710 (fax)

 

GEORGE MASON UNIVERSITY

       SUMMER   '2007

       CS 750 Theory and Applications of Data Mining

      

      Class Information

A01     5/21   40977   MWF  3:45 p.m.  –   6:50 p.m.   IN 136

Prerequisites

CS 450 (“databases”), CS 580 (“AI”) or   equivalent

Office Hours

M-W-F 3:15 – 3:45 PM  (SITE II - Rm. 461)

 

Textbook

Introduction to Data Mining, Tan, Steinbach and Kumar,

Pearson Addison Wesley, 2006

web  site for textbook slides  : http://www-users.cs.umn.edu/~kumar/dmbook/

 

            Reference

Data Mining: Concepts and Techniques (2nd. edition), Han and   Kamber, Elsevier, 2006

web  site for textbook slides  http://www-faculty.cs.uiuc.edu/~hanj/bk2/

 

 WEKA web site for data mining software

 

http://www.togaware.com/datamining/survivor/Weka.html

 

UCI Machine Learning Repository Content Summary

 

http://www.ics.uci.edu/~mlearn/MLSummary.html

 

References

1.  V. Cherkassky  and F. Mulier, Learning from Data : Concepts, Theory, and Methods,  John Wiley,   1999.

 

         2.   D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999.

 

3.   R. Baeza -Yates and B. Ribeiro-Neto  Modern Information Retrieval, Addison-Wesley, 1999.

  

4.    T. Hastie, R. Tibshirani, and J. Friedman,  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2001.

 

 

          Course Description

Concepts and techniques in data mining and their  multidisciplinary  implementation and   applications.  Topics include data warehousing and databases, data cleaning and transformation, concept description, association and correlation rules, data classification and predictive modeling, clustering, performance analysis and scalability, mining stream and sequence data, social network analysis, multimedia data mining, biometrics, and emerging themes and trends.  Term team project and topical review are required.

Motivation

The explosive growth in generating, collecting and storing data has generated an urgent need for new techniques and automated tools that can intelligently assist us in transforming the vast amounts of data into useful information and knowledge. Data mining is a multidisciplinary field, drawing from areas including AI, database technology, data visualization, information retrieval, high performance computing, machine learning, mathematical programming, neural networks, pattern recognition, statistical learning theory, and statistics.  The course provides the graduate students the opportunity to learn about the management and use of large data repositories based upon a multidisciplinary approach.

 

Goals

The objective of this course is to introduce graduate students to data mining basics, current research, technological advances and trends in data mining.   Data mining, which supports knowledge discovery in databases (KDD), helps with the automated extraction of patterns representing knowledge implicitly stored in large databases, data warehouses, and other massive information repositories.  The course focuses on issues related to the feasibility, usefulness, efficiency, and scalability of automated techniques for the discovery of patterns hidden in large databases.  Students will be exposed to the above topics via lectures and reading assignments, including recent journal and conference papers. Students are expected to complete a term project and to make an in depth presentation on a topic related to data mining.   As data mining has matured, the field is now advancing on three new fronts: (i) ability to mine data in real time; (ii) predictive analysis rather than merely explaining past trends; and (iii) ability to analyze messy “unstructured” data.

 

 

 

Follow – Up Studies with Professor  Wechsler :  1. CS 778 – Biometrics – Spring 2008;  2. CS 668 /  IT 844  -- Pattern Recognition [or CS 775 Advanced Pattern Recognition ] – Spring 2009; 3. Certificate  in Biometrics; 4. PhD dissertation.

 

Grading

(Team) Term Project à  50 %.

Midterm – June 9 à 50 %

Term Project

Students are working in teams on the term project.
Scope and range for the project has to be agreed with the instructor.
Task involves meaningful functionality and significant amounts of data.
Project includes the following   STEPS :


1. Problem definition, requirements analysis and conceptual design.
2. Data selection / sampling. // visualization //
3. Cleaning and integration / Preprocessing // visualization //
4. Data transformation / Data Reduction // visualization //
5. Data Mining // visualization //
6. Modeling, test & evaluation, and performance assessment // visualization //
7. Knowledge discovery // visualization //

Use domain knowledge and visualization for all the steps.

Iteratively refine the quality and scope of your project

Reviews and class presentations are conducted stepwise
throughout the course (see tentative schedule). First a draft for each step is expected
the lecture the STEP is listed in the tentative schedule listed below.
Based upon feedback received in class the same step is completed and
presented again the following lecture.

Final (In Class)  Project Presentation (SLIDES) (about 45 minutes)

1.  Survey / Literature Review of  (a) application
and (b) task / functionality, data mining (STEP 5)
and model selection (“training strategy”).

2.    Brief   Description of STEPS 1 – 7.

3.    Performance Evaluation and Assessment of your project.

Final Project Report (HARD COPY) (at most 15 pages)

         Submit Technical Report (TR) that covers your Final Project  Presentation.

 

Tentative Schedule

May 21

Ch. 1: Introduction – Data Warehouses, Databases, Data Mining and Knowledge Discovery, and the Semantic Web (http://www.w3.org/2001/sw)

- Appendix C – Probability and Statistics -

May 23

Ch. 2: Data    STEP 1

- Appendix A – Linear Algebra  -

May 25

Ch. 3: Exploring Data

- Appendix E  – Optimization -

May 28

Memorial Day – no class

May 30

June 1 –

June 4 -

Data reduction & transformation - Step 2& 3 [5/30]

                 Ch. 4:  Classification – Basics (Part I) -

Ch. 6: Associations – Basics (Part I) - Step 4 [6/4]

Appendix B – Dimensionality Reduction

 

June 6

Ch. 8: Clustering – Basics (Part I)

REVIEW for Mid – Term

Appendix D  –Regression

June 8

Mid – Term

Closed books and notes

bring blue book and calculator

June 11 – 13 - 15

           Chaps. 4/5,  6/7, 8/9

                       - Advanced Topics –

Classification – Association –Clustering

       Ch. 10 – Anomaly Detection

                  Biometrics

        STEP 5 – June 11

      STEPS  6 – 7 – June 15

June 18

FINAL  PROJECT   PRESENTATION

June 20

FINAL  PROJECT   PRESENTATION