MLBio+Laboratory Machine Learning in Biomedical Informatics

Digital Humanities Data Mining with Weka (Resources Page)

Workshop Information
Instructor: Huzefa Rangwala

Time/Date: 9:30am-11:00am (June 15, 2012)
Room: Nguyen Engineering 1109
Link to the workshop:

Summary of Workshop
Weka is a powerful platform that allows users to implement data mining algorithms, quickly and we will start with a gentle introduction to data mining. We will define data mining tasks, along with its application towards the rich datasets available from the digital humanities. We will then proceed with a hands-on tutorial on how to use WEKA to build interesting predictive or exploratory models.
Software and Dataset Repositories
UCI Machine Learning Repository is a repository of databases, domain theories and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
WEKA is an open source Java package that implements several data mining algorithms. It includes a GUI which allows for automation of several data mining tasks.
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
CLUTO is a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters. CLUTO is well-suited for clustering data sets arising in many diverse application areas including information retrieval, customer purchasing transactions, web, GIS, science, and biology.
YALE (Yet Another Learning Environment) is another open source Java package. It includes a GUI which allows automation of the whole data process from feature normalization to feature selection, learning and cross-validation
SVM light and LibSVM are two popular implementations of various SVM algorithms
TMG is a Matlab Toolbox that can be used for various tasks in text mining
Rapid-I , a free commercial package that integrates with WEKA + HADOOP
Books, Classes and Free MOOCs
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar Introduction to Data Mining, Addison Wesley, 2006. Book's companion website
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2006 Companion website
Andrew Ng (Stanford) Free Machine Learning Class
Rangwala (GMU, not online) Data Mining Class
Slides/Datasets Used at the Workshop
Slides (PDF)
Iris Data
Email Spam Data

Powered by Drupal, an open source content management system