- When: Thursday, November 07, 2024 from 11:00 AM to 12:00 PM
- Speakers: Shi Feng
- Location: Nguyen Engineering Bldg., Room 4801
- Export to iCal
Abstract:
AIs are being deployed to solve increasingly complex problems, and reliable human oversight becomes a huge challenge: AIs are also getting better at producing outputs that look correct to humans but are in fact subtly flawed. To support effective oversight, approaches like debate, constitutional AI, and reward modeling all involve using AIs to assist human evaluators. Although promising, these approaches can create new risks, as AIs are being used to evaluate themselves. In this talk, I will discuss three failure modes in both AI-assisted evaluation and training with flawed supervision. I will also discuss preliminary work on mitigating these risks.
Bio:
Shi (https://ihsgnef.github.io/) is an assistant professor at the George Washington University. He received his PhD at UMD supervised by Jordan Boyd-Graber. He did postdocs at UChicago with Chenhao Tan and NYU in the Alignment Research Group with Sam Bowman and He He. Shi works on AI safety, in particular scalable oversight, as an extension of his work on human-AI collaboration, interpretability eval, and adversarial robustness. His most recent work focuses on a meta-evaluation of risks in scalable oversight methods and evaluations.
Zoom (https://gmu.zoom.us/j/95380267114?pwd=InbkKjByEnJpDAYYxfen3YkAPLc1VT.1)
Meeting Id: 953 8026 7114
Passcode: 184133
Posted 1 month, 2 weeks ago