Instructor: Prof. Harry
Wechsler wechsler@gmu.edu
Email from GMU accounts with subject: INFS 755
Course Description – Concepts and
techniques in data mining and multidisciplinary applications. Topics include databases;
data cleaning and transformation; concept description; association and
correlation rules; data classification and predictive modeling; performance
analysis and scalability; data mining in advanced database systems, including
text, audio, and images; and emerging themes and future challenges.
Goals: Critical Thinking (look for Pitfalls); Model Selection and Predictive Analytics Using Cross-Validation and Training; Meaningful (size and scope) Data Mining Application (to find useful patterns); Experimental Design, Metrics and Performance Evaluation; Theory vs. Practice.
Time, Day, and Venue: MWF, 3:45 pm – 6:45 pm
– Nguyen Engineering Building 1107
Office Hours: MWF 2:45 – 3:30 pm or by appointment, ENGR 4448.
http://summer.gmu.edu/dates-2015/
First day of classes: June 29, 2015
No class on Friday, July 3, 2015
Last day of classes: Wednesday, July 29, 2015
Final Exam: Friday, July 31, 2015
Required Textbook: P. N. Tan, M.
Steinbach, and V. Kumar, Introduction
to Data Mining, Addison Wesley, 2006. http://www-users.cs.umn.edu/~kumar/dmbook/index.php
Complementary Textbook 1: J. Han and M. Kamber,
Data Mining (3rd ed.) Morgan
Kaufmann, 2011. http://web.engr.illinois.edu/~hanj/bk3/bk3_slidesindex.htm
Complementary Textbook 2: I. H. Witten, E.
Frank, and M. A. Hall, Data Mining:
Practical Machine Learning Tools and Techniques (3rd ed.), Morgan Kaufmann,
2011. http://www.cs.waikato.ac.nz/ml/weka/book.html
Complementary Textbook 3: T. Hastie, R.
Tibshirani, and J. Friedman, The Elements of
Statistical Learning (2nd ed.), Springer, 2009. http://statweb.stanford.edu/~tibs/ElemStatLearn/
Complementary Textbook 4: A. Rajaraman, J.
Leskovec, and J. D. Ullman, Mining of
Massive Datasets (2nd ed.), Cambridge University Press, 2014. http://infolab.stanford.edu/~ullman/mmds/book.pdf
Software and Data:
UCI
Machine Learning Repository is a repository of databases and data
generators that are used by the machine learning community for the empirical
analysis of machine learning algorithms. http://archive.ics.uci.edu/ml/
UCI Knowledge Discovery in Databases
Archive
is an online repository of large data sets which encompasses a wide variety of
data types, analysis tasks, and application area. http://kdd.ics.uci.edu/
Kaggle is the home of data science and data mining
competitions. http://www.kaggle.com/
Resources: Software and Data. http://www-users.cs.umn.edu/~kumar/dmbook/resources.htm
WEKA http://www.cs.waikato.ac.nz/ml/weka/
MALLET is a Java-based
package for statistical natural language processing, document classification,
clustering, topic modeling, information extraction, and other machine learning
applications to text. http://mallet.cs.umass.edu/
SVM light and LibSVM are two popular
implementations of various support vector machines (SVM) algorithms. http://svmlight.joachims.org/ and http://www.csie.ntu.edu.tw/~cjlin/libsvm/
R – Programming language for statistical computing and
graphics. http://en.wikipedia.org/wiki/R_%28programming_language%29 and http://www.r-project.org/
MATLAB and Toolboxes – The Language of Technical . http://www.mathworks.com/products/matlab/
CLOSED BOOK
EXAMINATIONS
·
Homework
– 20% // late homework not accepted //
·
Midterm
– Monday, July 13, 2015 –
20 %
·
Team
Term Project and FINAL Review – July 27, 2015
and July 29, 2015 – 20 %
·
(Cumulative)
Final – July 31, 2015 - 40
%
http://www.fcps.edu/southcountyhs/sservices/gradescale.html
You are expected to abide by the GMU honor code. Homework assignments and exams are individual efforts. Information on the university honor code can be found at
http://oai.gmu.edu/the-mason-honor-code/
Additional departmental CS information: http://cs.gmu.edu/wiki/pmwiki.php/HonorCode/CSHonorCodePolicies