As announced at the Chinasoft conference, the Ph.D. thesis "Deep Learning for Code with Language Definition Awareness: Techniques and Applications" by Qihao Zhu won the Distinguished Thesis Award by TCSE, CCF. Qihao Zhu is co-advised by Yingfei Xiong and Lu Zhang.
Description of the thesis:
Using deep learning models to process code is a trend in software engineering development. However, deep learning is not designed to learn strict formal definitions such as the syntax and the type system of programming languages, which has become a key issue hindering the development in this direction.
Qihao's doctoral dissertation proposes deep code learning techniques that are aware of language definitions. By designing a novel neural network architecture and program representation, it systematically guides neural networks to learn language definitions such as syntax and types, and further proposes application techniques in defect repair and code search.
The paper generates a series of tools and models:
● Code models with optimal performance across different scales, such as GrammarT5, the best-performing code generation model with less than a billion of parameters, and Grape, the best-performing code generation model with tens of millions of parameters.
● The first neural network repair tool, Recoder, that outperforms traditional APR methods.
● ET, the first-place winner in the Java functional defect track of the International Defect Repair Competition.
The method proposed in the paper is also applied to the DeepSeek-Coder model developed by DeepSeek Corporation and the defect repair tool developed by ZTE Corporation.
The thesis work led to 16 papers at top conferences, with a total citation count of approximately 1,565. One paper won the ACM SIGSOFT Distinguished Paper Award at ASE and another paper received a nomination of Distinguished paper at ESEC/FSE. One paper is the Top 2 most cited papers of ESEC/FSE'21.
The research presented in the thesis has significant impacts both academically and industrially. Deep learning models have become the foundational system software in the era of intelligent software engineering, and their performances are critical. This doctoral thesis systematically reveals how to guide statistical neural networks to learn formal language definitions, significantly enhancing the performance of deep learning models in code-related tasks. Based on the technology presented in the doctoral thesis, Qihao led the development of the DeepSeek-Coder-V1 model during his internship at DeepSeek. This model is one of the best-performing open-source code LLM in the world. The corresponding technical report has been cited over 300 times in less than a year and is widely used by scholars worldwide in various fields such as decompilation, code analysis, and code repair. The repair tool Recoder, based on this thesis, surpassed traditional methods for the first time in over four years of neural repair research, and demonstrated optimal performance in multiple subsequent third-party studies, pioneering the shift to deep learning in the field of repair. ZTE Corporation independently developed a repair tool based on this thesis, successfully repairing 21 out of 32 real defects from the company's business department (65.6%). ZTE Corporation anticipates that this tool can significantly improve developer repair efficiency and reduce software maintenance costs.