Fall 2014: Data Mining [INFS755]

Professor:
Carlotta Domeniconi, Rm 4424 ENG, carlotta\AT\cs.gmu.edu

Teaching Assistant: Md. Reza, mreza\AT\gmu.edu
 Office Hours:
(Prof. Domeniconi): TR 4:30pm  5:30PM, Rm 4424 ENG
(GTA Md. Reza): WR 56PM, Rm. 4456

Prerequisites:
Some programming experience is expected.
Students should be familiar with
basic probability and statistics concepts, and linear algebra.

Location and Time:
We meet in Innovation Hall 206, T 7:20pm  10:00pm

Textbook:
PangNing Tan, Michael Steinbach, and Vipin Kumar Introduction to Data Mining,
Addison Wesley, 2006.
Book's companion website

Useful material

Overview on Linear Algebra

Andrew Moore's Tutorials: Collection of tutorials on topics of interest for this class

Schedule of Classes
General Description and Preliminary List of Topics:
Data mining is the process of automatically discovering useful information in large data repositories. The course covers key concepts and algorithms at the core of data mining.
Topics include: classification, clustering, association analysis, anomaly detection.
Course Format:
Lectures by the instructor. Besides material from the textbook, topics not discussed in the book may also be
covered.
Research papers and handouts of material not covered in the book will
be made available.
Grading will be based on homework assignments,
exams, and a project. Homework assignments will require
some programming. Exams and homework assignments must be done on an individual basis. Any deviation from this policy will be considered a violation of the GMU Honor Code.
Grading:
Assignments: 15%
Midterm: 25%
Final: 25%
Project: 30%
Participation: 5%
Course Project:
The project gives you an opportunity to explore in depth a particular topic/area of the course that interests you. The topic of the project, of course, should be related to the material covered in class, but otherwise you are free to select the specific topic. Possible types of projects include:
An application research project: The project demonstrates the application of some techniques discussed in class in an application domain (e.g., text mining, bioinformatics, computer vision, image processing, artificial intelligence etc.). Properties, drawbacks, advantages of the used techniques are analyzed within the context of the explored application domain.
A theoretical or methodological research project: A study of different classes of models and approaches; proving either theoretically or experimentally properties of known algorithms; designing a new approach.
Software and Data:
UCI Machine Learning Repository is a repository of databases, domain theories and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
UCI Knowledge Discovery in Databases Archive is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas
More datasets
Resources: software and data
Weka is an open source Java package implementing many learning algorithms
MALLET is a Javabased package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
SVM light and
LibSVM
are two popular implementations of various SVM algorithms
TMG is a Matlab Toolbox that can be used for various tasks in text mining