•   When: Friday, February 23, 2018 from 10:00 AM to 11:00 AM
  •   Speakers: Nosayba El-Sayed
  •   Location: Research Hall 163
  •   Export to iCal

Abstract

How can rigorous data analysis, based on various logs collected at large-scale datacenters, help us improve the resilience and performance of large-scale systems and the applications they run?

 

Designing datacenters that are reliable, energy-efficient, and capable of delivering high performance and high utilization is a nontrivial problem facing scientists, businesses, and governments alike. In this talk, I will first demonstrate how trace-driven analysis helped me uncover various interesting (and often surprising) patterns in the behaviour of systems and applications in large-scale datacenters. For example, based on datasets collected at different organizations, I will discuss how factors like temperature, power quality, and user behaviour impact datacenter reliability. Then, I will discuss how the insights obtained from these studies can help us design better frameworks for allocating and managing resources in current and future datacenters. I will also demonstrate how simple machine learning techniques can be used to accurately predict job failures in datacenters with high precision and recall. Finally, I will present a recent, open-sourced tool that leverages machine learning clustering to unlock significant performance on modern datacenter hardware.

 

Bio

Nosayba El-Sayed is a Postdoctoral Associate at CSAIL, MIT. Her research focuses on designing and implementing data-driven techniques that exploit the wealth of data generated in modern platforms to improve the reliability and performance of datacenters. She completed her PhD at the University of Toronto, during which time she interned at Amazon's Datacenter Global Services division where she designed a server-outage analysis and prediction framework. More recently, Nosayba has focused on investigating how new features available in modern hardware can be leveraged to boost datacenter utilization, using data-driven techniques. Nosayba's work has been published in venues such as SIGMETRICS, DSN, ICDCS, SC, and HPCA. Her work on datacenter reliability received a SIGMETRICS best paper award and was featured in ;login! Usenix Magazine, Data Center Knowledge, and Communications of the ACM.

 

Posted 9 months, 3 weeks ago