• DiggerInsights
  • Posts
  • ChatGPT Accuracy Deteriorating, Has It Become Disposable?

ChatGPT Accuracy Deteriorating, Has It Become Disposable?

Research by Stanford and Berkeley Presents Cases of ChatGPT Accuracy and Answer Quality Depleting

Since its release, ChatGPT has become the talk of the town. In less than its first week of launch last November, ChatGPT gained one million users and became the fastest-growing user base in the history of consumer applications. As of right now, it is estimated that the OpenAI-owned large language model (LLM) has 100 million active users.

Co-founded by Sam Altman, ChatGPT is known for its ability to produce human-like AI-generated content, from creative writing to solving complicated math or physics problems.

The model has proven its ability to demonstrate human behavior and mimic human-like conversation by being the second AI Chatbot to pass the Turing Test, following Google’s LaMDA. According to JMIR Medical Education, ChatGPT’s responses to people’s healthcare-related queries, presented during an NYU team’s research study, were nearly indistinguishable from those provided by humans, further proving its abilities.

ChatGPT’s extremely rapid growth in only a few months made plenty of individuals rely greatly on it, like in the case of students replacing human tutors with ChatGPT. Based on statistics from Statista, over 500 companies across various industries have embraced the use of ChatGPT in their business function in 2023.

Word has circled around town, however, that ChatGPT’s accuracy, a trait the LLM is greatly known for, has recently plummeted quite significantly. The numbers were provided by a research team from Stanford and Berkeley who conducted tests on ChatGPT’s two models, the 4.0 and 3.5. How did they do this?

Benchmarks

The team, which consisted of three members, namely Lingjiao Chen, Matei Zaharia, and James Zou, evaluated ChatGPT’s behaviors on March 2023 and June 2023 to test whether their quality in performance can vary significantly over time.

The team assigned four tasks for the two models, which consisted of solving math problems, answering sensitive/dangerous questions, generating code, and visual reasoning. The tasks act as benchmarks to represent the diverse and useful capabilities of the LLMs and were chosen due to their relatively objective nature, thus making them simpler to evaluate.

When applying the four benchmarks, the team used one main performance metric, different for each task, respectively accuracy, answer rata, directly executable, and exact match, to provide quantitive results. The team added two separate metrics, verbosity, and overlap, for all tasks.

What can we tell from the results?

Results

When the team first released their results on July 18th, to everybody’s befuddlement, in 3 out of 4 tasks, the supposed superior, paid version of ChatGPT (GPT-4) showed a significant decline, whilst the free version of ChatGPT (GPT-3.5) presented improved results.

In the case of the math problem, for example, the team asked both models whether 17077 is a prime number and applied a Chain-of-Thought prompt, where they asked the models to explain the mathematical process step-by-step and provide their final answer following said explanation.

Photo Courtesy of Stanford University and UC Berkeley

From the graph and the two models’ responses shown above, it can be seen that GPT-4’s accuracy dropped from 97.6% in March to 2.4% in June. In June, not only did GPT-4 provide the incorrect answer, but it also took no time to think step-by-step to solve the problem. GPT-3.5, however, saw significant improvements of 7.4% to 86.8% accuracy. In March, GPT-3.5 answered incorrectly at first but provided a thorough explanation which then led to the correct answer. In June, GPT-3.5 followed the instructions exactly, where it explained the process quite verbosely and answered correctly at the end.

Our team conducted our own test in July 2023. We asked the two models to solve the exact same math problem to check whether they still reacted in the same manner. We recorded their answers, and these were the results:

Math Problem Solved by GPT-3.5 in July 2023

Math Problem Solved by GPT-4.0 in July 2023

GPT-3.5, similar to its response in June, provided a thorough step-by-step process that led to the correct solution. Though GPT-4 provided a more verbose response this time around, it delivered incorrect information in the process and eventually gave the incorrect answer once again.

In the case of asking sensitive/dangerous questions, GPT-4 answered fewer questions from March to June, while GPT-3.5 answered slightly more questions. However, while the two models previously provided an explanation as to why they couldn’t answer such sensitive/dangerous questions, they simply apologized and said they couldn’t assist with the questions in June. This may prove xAI founder Elon Musk’s theory that AI is now being trained to be more politically correct. 

During the questioning of sensitive/dangerous topics, the team conducted a jailbreaking prompt known as AIM (always intelligent and Machiavellian), and the results suggest that GPT-4 is less prone to jailbreaking attacks compared to GPT-3.5.

The graph below shows the rest of the results, in which both GPT-4 and GPT-3.5 showed a decline in code generation, rendering their code not directly executable when sent to LeetCode, an online programming platform, to be judged for evaluation. Visual reasoning is the only task where both models seem to show similar improvement.

Photo Courtesy of Stanford University and UC Berkeley

Conclusion

The findings that the team has shared showed that the GPT-4 and GPT-3.5 models’ behavior and response quality have varied quite significantly in only a relatively brief amount of time, some positive and some negative.

GPT-3.5 showed significant improvement, while GPT-4, the model that has increasingly become the one more businesses and individuals have depended on, has become less reliable due to its deteriorating accuracy quality.

This may mean users and businesses who greatly rely on these LLM services in their workflow may need to conduct monitoring analyses before moving forward with its use. As for general users, re-checking information from more trustworthy sources can be done, as it is always better to be safe than sorry.

Meme & AI-Generated Picture

Job Posting

  • Box - Team Manager, Business Analytics - Chicago, IL (Hybrid)

  • Applied Systems - Principal Data Scientist - United States (Remote)

  • Veeva - Interactive Art Director - New York City, NY (Remote/Hybrid)

  • Atticus - Senior Product Designer - Los Angeles, CA (Remote)

Promote your product/service to Digger Insights’ Community

Advertise with Digger Insights. Digger Insights’ Miners are professionals and business owners with diverse Industry backgrounds who are looking for interesting and helpful tools, products, services, jobs, events, apps, and books. Email us [email protected]

Your feedback would be greatly appreciated, send it to [email protected] 

Reply

or to participate.