Large-Scale and Language-Oblivious Code Authorship Identification | George Mason Department of Computer Science

When: Wednesday, November 06, 2019 from 02:00 PM to 03:00 PM
Speakers: Dr. Aziz Mohaisen
Location: Engineering Building 4201
Export to iCal

Abstract

In this talk we present our work on a Deep Learning-based Code AuthorshipIdentification System (called DL-CAIS) for code authorship attribution thatfacilitates large-scale, language-oblivious, and obfuscation-resilient codeauthorship identification. The deeplearning architecture adopted in this work includes TF-IDF-based deeprepresentation using multiple Recurrent Neural Network (RNN) layers andfully-connected layers dedicated to authorship attribution learning. The deeprepresentation feeds into a random forest classifier for scalability tode-anonymize the author. Comprehensive experiments are conducted to evaluateDL-CAIS over the entire Google Code Jam (GCJ) dataset across all years (from2008 to 2016) and over real-world code samples from 1987 public repositories onGitHub. We achieve an accuracy of 96% when experimenting with 1,600 authors forGCJ, and 94.38% for a real-world dataset of 745 C programmers. Our system alsoallows us to identify 8,903 authors, the largest-scale dataset used by far,with an accuracy of 92.3%. Moreover, ourtechnique is resilient to language-specifics, and thus it can identify authorsof four programming languages (e.g., C, C++, Java, and Python), and authorswriting in mixed languages (e.g., Java/C++, Python/C++). Finally, our system isresistant to sophisticated obfuscation (e.g., using C Tigress) with an accuracyof 93.42% for a set of 120 authors.

Speaker Bio

Aziz Mohaisen is an Associate Professor of Computer Science at the University of Central Florida. Prior to joining Central Florida, he held various positions in academia and industry, in the US and South Korea. At UCF, he runs the Security and Analytics Lab (SEAL), where his research interests are broadly in the area of computer security and online privacy with an emphasis on DDoS attacks and defenses, malware analysis and detection, blockchain systems, (adversarial) deep learning, and Internet of Things security. Published in various prestigious venues, his research work has been featured in MIT Technology Review, the New Scientist, Scientific American, Financial Times, Science Daily, Minnesota Daily, Slashdot, The Verge, Deep Dot Web, and Slate, among many others. He is a co-Editor-in-Chief of Transactions on Security and Safety, an Associate Editor of IEEE Transactions on Mobile Computing, and an Area Editor of Computer Networks. He is a senior member of ACM and IEEE.

Posted 5 years, 8 months ago

Large-Scale and Language-Oblivious Code Authorship Identification Events / GRAND Seminar

Categories

Large-Scale and Language-Oblivious Code Authorship Identification
Events / GRAND Seminar