GMU CS-584 Syllabus
Theory and Applications of Data Mining
Course Summary
The acceleration of technological advancements has led to the generation of an unprecedented volume of data. Data mining, a pivotal discipline at the intersection of computer science, statistics, and artificial intelligence, seeks to delineate insightful patterns from this extensive dataset. This domain not only encompasses applications such as credit rating and fraud detection but also extends to areas like genomics, climatology, quantum mechanics, and complex systems analysis. It emphasizes theoretical methodologies rooted deeply in statistical and computational models and explores the adaptability and versatility of these methods across diverse domains.
This course is structured to offer an in-depth exploration into the theoretical underpinnings of data mining techniques. Concurrently, it underscores their broad applications by providing case studies and hands-on coding sessions. Students will be equipped with both the knowledge and skills to navigate the realm of data mining.
Class Time and Location
Time: Friday 10:30 am-1:10 pm
Date Range: Aug 25, 2023 - Dec 13, 2023
Location: East 201
Instructor
Name: Keren Zhou
Email: kzhou6@gmu.edu
Office: ENGR 5315
Office Hour: Friday 2:00pm - 3:00pm
Teaching Assistants
Name: Long Cao Thanh Doan
Office Hour: Tuesday 3:00pm - 4:00pm
Office: ENGR 4456
Piazza Moderator Hours: Monday - Wednesday 12pm
Name: Xue Yu
Office Hour: Thursday 11:00am -12:00pm
Office: BUCH D215
Piazza Moderator Hours: Wednesday 12pm - Friday
Prerequisites
Grade of C or better in CS 310 and STAT 344
Course Objectives
- To provide a comprehensive understanding of data mining techniques, elucidating both their potential and limitations.
- To instill proficiency in integrating data mining insights with software packages, with an emphasis on data analysis in Python.
- To facilitate hands-on experience, enabling students to conceptualize, design, and execute data mining projects.
Textbooks
- Data Mining: Concepts and Techniques, 3rd Edition (Optional) It’s OK to use the fourth edition
- A Programmer's Guide to Data Mining (Optional)
Honor Code
Please follow GMU’s honor code policy:
To promote a stronger sense of mutual responsibility, respect, trust, and fairness among all members of the George Mason University community and with the desire for greater academic and personal achievement, we, the student members of the university community, have set forth this honor code: Student members of the George Mason University community pledge not to cheat, plagiarize, steal, or lie in matters related to academic work
And the CS department has its own policy.
Please do note there has been revisions to the honor code policy:
Unless permission to do so is granted by the instructor, you (or your group, if a group assignment) may not: … - use assistive technology, artificial intelligence, or other tools to complete assignments which can generate, translate, or otherwise create/correct code or answers (many types of assistive technology may be permitted, but you must ask permission)
Disability Accommodations
Should you possess documented evidence of a learning disability or any other condition that could impact your academic achievements, kindly ensure this documentation is registered with the Office of Disability Services. Subsequently, please initiate a conversation with the professor regarding potential accommodations.
Course Structure
Assignments
Throughout this course, students will complete up to three practical assignments. Each assignment must be done individually, while the project will be chosen and conducted in groups. Further details about the assignments will be available on the Blackboard system.
- Grading for the Assignment will be split on
- Implementation (70%)
- Correctness (63%)
- Coding style (7%)
- Report (10%)
- Ranking results (20%)
- Top 10% (20%)
- Top 50% (18%)
- Rest (16%)
- Implementation (70%)
Project
Central to this course is the semester project, closely tied to subjects broached during the lectures. Students have the autonomy to select their own project themes. Collaboration is encouraged, with projects typically involving teams of 2-4 participants. Piazza is available as a resource for sharing thoughts or for seeking team partners.
For project submissions, each group needs to make a single collective submission.
Project Inspiration
There are two categories of topics you can choose for the project:
Option 1: Graph mining challenges
- Frequent subgraph mining on GPUs
- Frequent subgraph mining on CPUs
You can choose either topic 1 or topic 2.
Your proposed solution must be faster than the existing implementation, either by designing new algorithms or by optimizing the code, to achieve full scores.
Partial credit (up to 90%) will be given based on the same standard as those who choose option 2. Reimplement the algorithms correctly using another language (e.g., Rust) can receive partial credits.
Option 2: Explore interesting ideas on open source datasets or your own dataset
- Examples
Project Components
- Project Proposal (2 pages in PDF, 10% of project grade): Your blueprint should encapsulate:
- The problem
- Anticipated hurdles
- Pertinent previous studies and their limitations
- Data sources to be utilized
- Evaluation methodology
Ensure all group member names are inscribed in the submitted document.
- Mid-term Review (3-5 pages in PDF, 20% of project grade): Initiate the shaping of your review by expounding on:
- Data
- Outline the utilized data and any relevant acquisition techniques.
- Proposed Technique
- Detail your approach, requisite definitions, and possibly a conceptual proof.
- Preliminary Trials
- Share initial findings.
- Progress and Upcoming Endeavors
- Recapitulate your trajectory and highlight any alterations from the initial blueprint.
All group member names must be enumerated in the submitted document.
- Data
- Final Report and Code (5-10 pages PDF + CODE, 50% of project grade):
A. Report Layout (25% of project grade): The format should mirror that of a scholarly article, encompassing at least:
- Abstract
- Introduction
- Data
- Methods
- Experiments
- Related Work
- Conclusions
- Division of Work
B. Code (25% of project grade): Nest your code within a "CODE" directory with a README file, including instructions to run the tests and validate results
Compile into a zip file, inclusive of the pdf and the CODE/ directory. Always inscribe all group member names in the pdf.
- Presentation (10 mins, 20% of the project grade): Presentations will be graded on the following aspects:
- Content depth and accuracy
- Clarity of presentation
- Ability to engage the audience and answer questions
- Time management
Grading
- Project (50%) No late submission
- Assignments (30%) A 3-days grace period is allowed for assignments
- Midterm Exam (20%) No late submission
Schedule (subject to change)
Week | Date | Topic | Readings | Timeline |
---|---|---|---|---|
1 | August 25, 2023 | Introduction | Ch 1, 2.1 | |
2 | September 1, 2023 | Data Preprocessing & Python | Ch 2.2, 2.3 | |
3 | September 8, 2023 | Data Measurement | Ch 2.4 | HW1 out |
4 | September 15, 2023 | Classification-1 | Ch 3 | Proposal due |
5 | September 22, 2023 | Classification-2 | Ch 4 | |
6 | September 29, 2023 | Clustering-1 | Ch 7 | HW2 out |
7 | October 6, 2023 | Clustering-2 | Ch 8 | HW1 due |
8 | October 13, 2023 | Midterm Exam | HW3 out | |
9 | October 20, 2023 | Outlier Detection | Ch 9 | Mid-term Review Due |
10 | October 27, 2023 | Pattern Mining-1 | Ch 5 | HW2 due |
11 | November 3, 2023 | Pattern Mining-2 | Ch 6 | |
12 | November 10, 2023 | Networks | MEJN, Pajek | HW3 due |
13 | November 17, 2023 | Invited Talk | ||
14 | November 24, 2023 | Thanksgiving | Final Report/Code Due | |
15 | December 1, 2023 | Project Presentation | ||
16 | December 8, 2023 | No class |
Related Materials (Optional)
- [Pajek] Nooy, Wouter de, Andrej Mrvar, and Vladimir Batagelj. Exploratory Social Network Analysis with Pajek. Structural analysis in the social sciences. New York: Cambridge University Press, 2005.
- [MEJN] Newman, M. E. J. "The Structure and Function of Complex Networks." SIAM Review. 45 (2003): 167-256.