A Draper and Boston University research team has used machine learning algorithms to develop a new vulnerability detection system that can operate on a large-scale. Software professionals are expected to use this development for identifying vulnerabilities in their software more efficiently. The tool helps with faster automated vulnerabilities detection for C/C++ source code, which has been encouraging results already. Techniques capable of revealing code vulnerability patterns are being developed today due to the widespread application of open-source repositories.
The Common Vulnerabilities and Exposures database (CVE) reports scores of vulnerabilities every year. If not addressed adequately, the vulnerabilities could lead to devastating effects, such as those observed in recent cases, viz. WannaCry ransomware cryptoworm and Heartbleed bug.
Dataset Included Millions of Function-level Examples of C and C++ Code
Featuring millions of open-source functions, a large dataset was compiled in the Draper and Boston University research. It was labeled using Flawfinder, Cppcheck, and Clang, three static (pre-runtime) analysis tools. These tools are used to discover potential exploits. Drawn from public Git repositories on GitHub, Debian Linux distribution, and the SATEIV Juliet Test Suite, millions of function-level instances of C and C++ code were included in the dataset. The vulnerability detection tool developed from the research is based on deep feature representation learning directly interpreting lexed source code.
The researchers made use of natural language processing (NLP) to design the technique since programming languages share some similarities with human languages. NLP was combined with random forest (RM) for obtaining more accurate predictions.