Published in AI

AI produces mixed results as code generators

by on08 July 2024


It depends if it has had coffee

IEEE Spectrum (the IEEE's official publication) has released a study about how effective AI  code generators compare human programmers and the results were mixed.

A study published in the June issue of IEEE Transactions on Software Engineering evaluated the code produced by OpenAI's ChatGPT in terms of functionality, complexity, and security.

The results show that ChatGPT has a very broad range of success when producing functional code — with a success rate ranging from as low as 0.66 per cent to as high as 89 per cent, depending on the task's difficulty, the programming language, and other factors.

While the AI generator can sometimes produce better code than humans, the analysis reveals some security concerns with AI-generated code.

The study tested GPT-3.5 on 728 coding problems from the LeetCode testing platform in five programming languages: C, C++, Java, JavaScript, and Python. The results? Overall, ChatGPT was pretty good at solving problems in different coding languages, especially for issues on LeetCode before 2021.

For instance, it produced functional code for easy, medium, and hard problems with success rates of about 89, 71, and 40 per cent, respectively.

The study’s author, Yutian Tang, a lecturer at the University of Glasgow, said that when it comes to algorithm problems after 2021, ChatGPT's ability to generate functionally correct code is affected. It sometimes fails to understand the meaning of questions, even for easy-level problems.

For example, ChatGPT's ability to produce functional code for "easy" coding problems dropped from 89 per cent to 52 per cent after 2021.

Its ability to generate functional code for "hard" problems dropped from 40 per cent to 0.66 per cent after this time as well.

The researchers also explored ChatGPT's ability to fix its coding errors after receiving feedback from LeetCode. They randomly selected 50 coding scenarios in which ChatGPT initially generated incorrect code, either because it didn't understand the content or the problem at hand.

While ChatGPT was good at fixing compiling errors, it was generally not good at correcting its own mistakes. The researchers also found that ChatGPT-generated code had many vulnerabilities, such as missing null tests, but many of these were easily fixable.

"Interestingly, ChatGPT can generate code with smaller runtime and memory overheads than at least 50 per cent of human solutions to the same LeetCode problems," the report said.

Last modified on 08 July 2024
Rate this item
(0 votes)