
|
Joint CS/ISE Seminar
Thursday, October 19 Near Duplicate Document DetectionDr. Abdur Chowdhuryhttp://www.ir.iit.edu/~abdur/ AbstractDetection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional duplicate detection techniques relying on direct inter-document similarity computation (e.g., using the cosine measure) that are often not feasible given time and memory performance constraints. On the other hand, fingerprint-based methods are very attractive computationally but are brittle with respect to small changes in document content. This talk focuses the history of duplication algorithms and a general technique of increasing fingerprint robustness via lexicon randomizations. Speaker BioDr. Abdur Chowdhury has over 12 years of research and development experience in computer science. He has worked as both a researcher and a developer, which has given him a great perspective on balancing software design with innovation. During this time he has over twenty patent applications filed and over 70 publications in magazines, scientific conferences, journals and book chapters covering many computer science topics like AI, networking, operating systems, system scaling, and information retrieval. |