INFS 795 / IT 803

Special Topics in Data Mining Applications

 

 

Instructor:


Dr. Jessica Lin
Office: Science & Technology II, Room 453
Phone: 703-993-4693
Email:  jessica[AT]ise[DOT]gmu[DOT]edu
Office Hours: TBA

 

Lectures:

 

Thursday 7:20-10:00pm, Innovation Hall 136

Prerequisite:

 

INFS-755 or equivalent knowledge. Some programming skills required for the final project.

 

Textbook (optional):

Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6.

Course Description:

 

Time series, or measurements taken over time in its traditional sense, is perhaps the most commonly encountered data type, encompassing almost every human endeavor including medicine, finance, aerospace, industry, science, etc. While time series data present special challenges to researchers due to its unique characteristics, the past decade has seen an explosion in time series data mining.  This seminar provides an overview on state of the art research on mining temporal data. Topics covered include data representation, similarity search, clustering, classification, anomaly detection, and rule discovery. Sequential pattern discovery on discrete, temporal data (web logs, customer transactions, etc), and mining of streaming time series will also be discussed.

 

Course Format:


The course will include lectures by the instructor, presentations from students, and class discussion. You will be asked to read research papers published in major conferences and/or journals (paper list TBA).

 

Grading:

 

Grading will be based on participation, a presentation, quizzes, and a final project. Each week you are required to read two papers, one of which will be presented by a student. You will be quizzed on both papers the following week. The presenting student will make up 2 simple quiz questions on the paper he or she presents.

 

Participation/Attendance: 5%

Quizzes: 20%

Presentation: 25%

Project Proposal: 15%

Project: 35%

 

Honor Code Statement:

 

Please be familiar with the GMU Honor Code. Any deviation from this is considered an Honor Code violation. All assignments (written and programming) for this class are individual unless otherwise specified.

 

Tentative Schedule (TBA):

 

 

Dates

Topics

Slides

Readings

Presenter

1

Aug 31

Introduction/Time Series Representation

Intro1

1, 2

 

2

Sept 7

Similarity Search/Indexing I

Intro2

3, 4

 

3

Sept 14

Similarity Search/Indexing II

Benchmark

ikmeans

5, 6

David Debarr

4

Sept 21

Classification

Classification

Self_training

7, 8

David Debarr

5

Sept 28

Rule Discovery

Rule_discovery

STS_clustering

9, 10

Hugo Kang

6

Oct 5

Clustering

 

11, 12

 

7

Oct 12

Anomaly Detection

 

13, 14

Emmanuel Tchanque

8

Oct 19

Motif Discovery

 

15, 16

Indar Bhatia

9

Oct 26

Burst/Periodicity Detection

 

17, 18

David Etter*

10

Oct 33

Trajectories

 

19, 20

David Etter

11

Nov 9

Sequential Pattern Mining

 

21, 22

Vipul Bajpai

12

Nov 16

Streaming Time Series/Data Streams

 

 

Rafal Ladysz

13

Nov 23

Thanksgiving (No Class)

 

 

 

14

Nov 30

Project Presentation

 

 

 

15

Dec 7

Project Presentation

 

 

 

 

List of Papers (under construction):

1. Rakesh Agrawal, Christos Faloutsos and Arun Swami, Efficient Similarity Search In Sequence Databases FODO conference, Evanston, Illinois, Oct. 13-15, 1993.

2. Christos Faloutsos, M. Ranganathan and Yannis Manolopoulos Fast Subsequence Matching in Time-Series Databases Proc. ACM SIGMOD, Minneapolis MN, May 25-27, 1994, pp. 419-429.

 

3. Chan, K. & Fu, A. W. (1999). Efficient time series matching by wavelets. In proceedings of the 15th IEEE Int'l Conference on Data Engineering. Sydney, Australia, Mar 23-26. pp 126-133.

 

4. Keogh, E. and Kasetty, S. (2002). On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. In the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23 - 26, 2002. Edmonton, Alberta, Canada. pp 102-111.

 

5. Geurts, P. 2001. Pattern Extraction for Time Series Classification. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (September 03 - 05, 2001). L. D. Raedt and A. Siebes, Eds. Lecture Notes In Computer Science, vol. 2168. Springer-Verlag, London, 115-127.

6. Li Wei and Eamonn Keogh  (2006) Semi-Supervised Time Series Classification. SIGKDD 2006.

7. Gautam Das, King-Ip Lin, Heikki Mannila, Gopal Renganathan, Padhraic Smyth: Rule Discovery from Time Series. KDD 1998: 16-22.

 

8. Keogh, E., Lin, J. & Truppel, W. (2003). Clustering of Time Series Subsequences is Meaningless: Implications for Past and Future Research. In proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003). Melbourne, FL. Nov 19-22. p.115-122.   

 

9. Gavrilov, M., Anguelov, D., Indyk, P. & Motwani, R. (2000). Mining the stock market: which measure is best? In proceedings of the 6th ACM Int'l Conference on Knowledge Discovery and Data Mining. Boston, MA, Aug 20-23. pp 487-496.

 

10. Bagnall, A.J. and Janacek, G.J., Clustering time series from ARMA models with clipped data, In proceedings of the 10th International Conference on Knowledge Discovery in Data and Data Mining (ACM SIGKDD 2004), Seattle, USA, pp. 49-58, 2004

 

11. D. Dasgupta and S. Forrest, "Novelty Detection in Time Series Data Using Ideas from Immunology", Proceedings of the 5th International Conference on Intelligent Systems, Reno, June, 1996.

 

12. Keogh, E., Lin, J. & Fu, A. (2005). HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. In the 5th IEEE International Conference on Data Mining. New Orleans, LA. Nov 27-30.

 

13. Chiu, B., Keogh, E., and Lonardi, S. 2003. Probabilistic discovery of time series motifs. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 493-498.

 

14. Lin, J., Keogh, E., Lonardi, S., Lankford, J. P., and Nystrom, D. M. 2004. Visually mining and monitoring massive time series. In Proceedings of the Tenth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Seattle, WA, USA, August 22 - 25, 2004). KDD '04. ACM Press, New York, NY, 460-469.

 

15. Vlachos, M., Meek, C., Vagena, Z., and Gunopulos, D. 2004. Identifying similarities, periodicities and bursts for online search queries. In Proceedings of the 2004 ACM SIGMOD international Conference on Management of Data (Paris, France, June 13 - 18, 2004). SIGMOD '04. ACM Press, New York, NY, 131-142.

 

16. Michail Vlachos, Kun-Lung Wu, Shyh-Kwei Chen, Philip S. Yu: Fast Burst Correlation of Financial Data. PKDD 2005: 368-379

 

17. Vlachos, M., Hadjieleftheriou, M., Gunopulos, D., and Keogh, E. 2003. Indexing multi-dimensional time-series with support for multiple distance measures. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 216-225.

 

18. Cai, Y. and Ng, R. 2004. Indexing spatio-temporal trajectories with Chebyshev polynomials. In Proceedings of the 2004 ACM SIGMOD international Conference on Management of Data (Paris, France, June 13 - 18, 2004). SIGMOD '04. ACM Press, New York, NY, 599-610.

 

19. Agrawal, R. and Srikant, R. 1995. Mining Sequential Patterns. In Proceedings of the Eleventh international Conference on Data Engineering (March 06 - 10, 1995). P. S. Yu and A. L. Chen, Eds. ICDE. IEEE Computer Society, Washington, DC, 3-14.

 

20. Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Discovery of frequent episodes in event sequences . Data Mining and Knowledge Discovery 1(3): 259 - 289, November 1997.

 

21. Gao, L., Yao, Z., and Wang, X. S. 2002. Evaluating continuous nearest neighbor queries for streaming time series via pre-fetching. In Proceedings of the Eleventh international Conference on information and Knowledge Management (McLean, Virginia, USA, November 04 - 09, 2002). CIKM '02. ACM Press, New York, NY, 485-492.

 

22. Zhu, Y. and Shasha, D. 2003. Efficient elastic burst detection in data streams. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 336-345.

 

* (optional) Keogh, E. & Pazzani,M (1999). Relevance feedback retrieval of time series data. In Proceedings of the 22th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. pp 183-190.