INFS 755 Summer 2015

Data Mining   CRN: 42430 INFS 755 - C01

Instructor:  Prof. Harry Wechsler

Email from GMU accounts with subject: INFS 755

Course Description Concepts and techniques in data mining and multidisciplinary applications. Topics include databases; data cleaning and transformation; concept description; association and correlation rules; data classification and predictive modeling; performance analysis and scalability; data mining in advanced database systems, including text, audio, and images; and emerging themes and future challenges.

Goals: Critical Thinking (look for Pitfalls); Model Selection and Predictive Analytics Using Cross-Validation and Training; Meaningful (size and scope) Data Mining Application (to find useful patterns); Experimental Design, Metrics and Performance Evaluation; Theory vs. Practice.

Time, Day, and Venue: MWF, 3:45 pm 6:45 pm

Nguyen Engineering Building 1107

Office Hours: MWF 2:45 3:30 pm or by appointment, ENGR 4448.

First day of classes: June 29, 2015

No class on Friday, July 3, 2015

Last day of classes: Wednesday, July 29, 2015

Final Exam: Friday, July 31, 2015

Required Textbook: P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison Wesley, 2006.

Complementary Textbook 1: J. Han and M. Kamber, Data Mining (3rd ed.) Morgan Kaufmann, 2011.

Complementary Textbook 2: I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques (3rd ed.), Morgan Kaufmann, 2011.

Complementary Textbook 3: T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning (2nd ed.), Springer, 2009.

Complementary Textbook 4: A. Rajaraman, J. Leskovec, and J. D. Ullman, Mining of Massive Datasets (2nd ed.), Cambridge University Press, 2014.

Software and Data:

UCI Machine Learning Repository is a repository of databases and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

UCI Knowledge Discovery in Databases Archive is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application area.

Kaggle is the home of data science and data mining competitions.

Resources: Software and Data.


MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

SVM light and LibSVM are two popular implementations of various support vector machines (SVM) algorithms. and

R Programming language for statistical computing and graphics. and

MATLAB and Toolboxes The Language of Technical .


Grading Composition (100 points)

         Homework 20% // late homework not accepted //

         Midterm Monday, July 13, 2015 20 %

         Team Term Project and FINAL Review July 27, 2015

and July 29, 2015 20 %

         (Cumulative) Final July 31, 2015 - 40 %

Grading Scale

Honor Code

You are expected to abide by the GMU honor code. Homework assignments and exams are individual efforts. Information on the university honor code can be found at

Additional departmental CS information: