
|
Joint CS/ISE Seminar
Tuesday, September 26 Extracting Topics from Web Archives: Stochastic k-Means AnalysisDr. Hiromichi FujisawaStanford University and Hitachi Research Labs Tokyo, Japan AbstractA Web Archives project at Stanford University has been conducted in a series of Digital Library Initiatives. They have been collecting and archiving web pages under certain conditions. The talk will be on the methods and programs created to analyze an example collection of web news pages on California Special Election 2005. News text body extraction, duplication elimination, meta page elimination, topics extraction by using stochastic k-mean clustering, and topic sentence extraction are the functions programmed. Given a collection of 36,475 pages, we have automatically identified 1791 unique news pages, and a clustering experiment has identified all propositions in that election and some other topics. Stochastic k-Mean Clustering has been devised to improve the convergence toward a more globally optimum solution. Speaker BioDr. Hiromichi Fujisawa is on a research sabbatical at Stanford University. He is Corporate Chief Scientist at Hitachi Central Research Lab in Tokyo, Japan. Dr. Fujisawa is an IEEE Fellow and a member of the IEEE SPectrum Advisory Board.He has published extensively in the areas of Document Analysis, Document Retrieval, Feature Extraction for Character Recognition, and Just-in-Time Knowledge Management. He holds 26 patents related to his research and development in these areas. |