GMU CS-584 Syllabus

Theory and Applications of Data Mining


Course Summary

The acceleration of technological advancements has led to the generation of an unprecedented volume of data. Data mining, a pivotal discipline at the intersection of computer science, statistics, and artificial intelligence, seeks to delineate insightful patterns from this extensive dataset. This domain not only encompasses applications such as credit rating and fraud detection but also extends to areas like genomics, climatology, quantum mechanics, and complex systems analysis. It emphasizes theoretical methodologies rooted deeply in statistical and computational models and explores the adaptability and versatility of these methods across diverse domains.

This course is structured to offer an in-depth exploration into the theoretical underpinnings of data mining techniques. Concurrently, it underscores their broad applications by providing case studies and hands-on coding sessions. Students will be equipped with both the knowledge and skills to navigate the realm of data mining.

Class Time and Location

Time: Friday 10:30 am-1:10 pm

Date Range: Aug 30, 2024 - Dec 6, 2024

Location: Horizon Hall 2016

Instructor

Name: Keren Zhou

Email: kzhou6@gmu.edu

Office: ENGR 5315

Office Hour: Friday 2:00pm - 3:00pm

Teaching Assistants

Name: Anuj Pokhrel

Office Hour: Friday 2:00pm - 4:00pm

Office: ENGR 4456

Canvas Moderator Hours: Monday - Wednesday 12pm

Prerequisites

Grade of C or better in CS 310 and STAT 344

Learning Outcomes

  1. To provide a comprehensive understanding of data mining techniques, elucidating both their potential and limitations.
  1. To instill proficiency in integrating data mining insights with software packages, with an emphasis on data analysis in Python.
  1. To facilitate hands-on experience, enabling students to conceptualize, design, and execute data mining projects.

Textbooks

  1. Introduction to Data Mining
  1. Data Mining: Concepts and Techniques, 3rd Edition (Optional)
    It’s OK to use the fourth edition
  1. A Programmer's Guide to Data Mining (Optional)

Course Policy

Please follow GMU’s honor code policy:

To promote a stronger sense of mutual responsibility, respect, trust, and fairness among all members of the George Mason University community and with the desire for greater academic and personal achievement, we, the student members of the university community, have set forth this honor code: Student members of the George Mason University community pledge not to cheat, plagiarize, steal, or lie in matters related to academic work

And the CS department has its own policy.

Please do note there has been revisions to the honor code policy:

Unless permission to do so is granted by the instructor, you (or your group, if a group assignment) may not:

-
use assistive technology, artificial intelligence, or other tools to complete assignments which can generate, translate, or otherwise create/correct code or answers (many types of assistive technology may be permitted, but you must ask permission)

Any student use of Generative-AI tools should follow the fundamental principles of GMU’s Academic Standards policies. 

Disability Accommodations

Should you possess documented evidence of a learning disability or any other condition that could impact your academic achievements, kindly ensure this documentation is registered with the Office of Disability Services. Subsequently, please initiate a conversation with the professor regarding potential accommodations.

Course Structure

Assignments

Throughout this course, students will complete up to three practical assignments. Each assignment must be done individually, while the project will be chosen and conducted in groups.
Further details about the assignments will be available on the Canvas system.

Project

Central to this course is the semester project, closely tied to subjects broached during the lectures. Students have the autonomy to select their own project themes. Collaboration is encouraged, with projects typically involving teams of 2-4 participants.

For project submissions, each group needs to make a single collective submission.

Project Inspiration

There are two categories of topics you can choose for the project:

Option 1: Nutrition Data Mining

All students participate in this project will get +5 points

Option 2: Explore interesting ideas on open source datasets or your own dataset

Project Components

  1. Project Proposal (2-5 pages in PDF, 20% of project grade): Your blueprint should encapsulate:
    • The problem
    • Anticipated hurdles
    • Pertinent previous studies and their limitations
    • Data sources to be utilized
    • Evaluation methodology

    Ensure all group member names are inscribed in the submitted document.

  1. Final Report and Code (5-10 pages PDF + CODE, 60% of project grade):

    A. Report (30% of project grade): The format should mirror that of a scholarly article, encompassing at least:

    • Abstract
    • Introduction
    • Data
    • Methods
    • Experiments
    • Related Work
    • Conclusions
    • Division of Work

    We will evaluate your report based on the following criteria:

    • Clarity of ideas (10%)
    • Depth of data analysis and critical thinking (10%)
    • Report format (30%)
    • Novelty of methods (10%)
    • Soundness of experiments (40%)

    Advice:

    B. Code (30% of project grade): Nest your code within a "CODE" directory with a README file, including instructions to run the tests and validate results

    Compile into a zip file, inclusive of the pdf and the CODE/ directory. Always inscribe all group member names in the pdf.

  1. Presentation (10 mins, 20% of the project grade): Presentations will be graded on the following aspects:
    • Content depth and accuracy (30%)
    • Clarity of presentation, such as tables and figures on each slide (30%)
    • Ability to engage the audience and answer questions (20%)
    • Time management (20%)

    Advice:

Grading

Schedule (subject to change)

Week DateTopicReadingsTimeline
1August 30, 2024IntroductionCh 1, 2.1
2September 6, 2024Data Preprocessing & MeasurementCh 2.2, 2.3, 2.4
4September 13, 2024Classification-1Ch 3HW1 out
5September 20, 2024Classification-2Ch 4
6September 27, 2024Clustering-1Ch 7HW2 out
7October 4, 2024Clustering-2Ch 8Proposal due
8October 11, 2024Outlier DetectionCh 9HW1 due
8October 18, 2024Midterm ExamHW3 out
9October 25, 2024Invited Talk & RA HiringHW2 due
10November 1, 2024Large Language Models
11November 8, 2024Parallel Data Mining
12November 15, 2024Project Presentation-1HW3 due
13November 22, 2024Project Presentation-2
14November 29, 2024Thanksgiving
15December 6, 2024Q&A (maybe online)Final Report/Code Due