The Capabilities and Limitations of Current Large Language Models (LLMs)

GOOVER DAILY REPORT June 30, 2024

Summary
Introduction to Large Language Models
Capabilities and Comparisons of Current LLMs
Limitations of Current LLMs
Innovations and Emerging Models
Specialized Applications
Conclusion

1. Summary

The report titled "The Capabilities and Limitations of Current Large Language Models (LLMs)" provides a detailed exploration of various Large Language Models and their current standings, capabilities, and limitations. It covers notable models like OpenAI’s GPT-4, Google’s Gemini, Anthropic’s Claude 3.5 Sonnet, and Tsinghua University’s ChatGLM, providing performance comparisons and key application areas such as medicine, coding, and content generation. The report also discusses methods like prompt engineering, retrieval-augmented generation, and innovations in AI benchmarking that are driving advancements in the field. Moreover, it identifies challenges in true reasoning, problem-solving, overfitting, and issues with current benchmarking methodologies, highlighting ongoing efforts to address these limitations and enhance LLM capabilities.

2. Introduction to Large Language Models

2-1. Overview of LLMs and Transformer Architecture

A large language model (LLM) is a computational model characterized by its ability to execute general-purpose language generation and other natural language processing tasks, such as classification. LLMs are based on language models and obtain their capabilities by learning statistical relationships from vast amounts of text during a computationally intensive self-supervised and semi-supervised training process. They are often designed as artificial neural networks that utilize the transformer architecture, which was invented in 2017. As of June 2024, the largest and most capable LLMs employ a decoder-only transformer-based architecture, which facilitates efficient processing and generation of large-scale text data.

2-2. Notable Models and Developments

Some notable large language models include OpenAI's GPT series (e.g., GPT-3.5 and GPT-4), Google's Gemini, Meta's LLaMA family of models, Anthropic's Claude models, and Mistral AI's models. The transformer architecture, which became significant with the introduction of Google's 2017 paper 'Attention Is All You Need,' forms the back-end of these models. Following the success of initial transformer models like BERT and the GPT series, development continued rapidly. In particular, GPT-4, launched in 2023, boasted substantial improvements over GPT-3, including the ability to respond to image inputs as text prompts. Additionally, open-source models like BLOOM and LLaMA have gained traction since 2022.

2-3. Key Areas of Application

Large language models are used in a wide variety of applications, such as text generation, classification, and natural language understanding. These models have found applications in chatbots like ChatGPT, which leverages the GPT-4 model, and in specialized fields like medicine, where they assist in diagnosing diseases by analyzing medical histories and clinical imaging. Additionally, LLMs are used for creating AI art, writing functional programming code, providing real-time data retrieval and analysis, and enhancing the virtual shopping experience by offering personalized recommendations. They are also employed in cybersecurity operations, demand forecasting, and predictive maintenance in industries such as manufacturing and logistics.

3. Capabilities and Comparisons of Current LLMs

3-1. Performance Metrics and Benchmarks

Anthropic's Claude 3.5 Sonnet has set new benchmarks in AI performance, notably surpassing GPT-4 Omni in several areas such as graduate-level reasoning, undergraduate-level knowledge, and coding proficiency. For example, Claude 3.5 scored impressively on benchmarks like GPQA and MMLU, showcasing its ability to handle nuanced tasks and generate high-quality content with a natural tone. The model operates at twice the speed of its predecessor Claude 3 Opus, emphasizing its suitability for complex tasks like multi-step workflows and context-sensitive customer support. Comparatively, GPT-4 has also demonstrated elite-level performance, consistently achieving scores above 85 on multiple benchmarks including MMLU, HumanEval, and MGSM. In a detailed benchmarking report, GPT-4o excelled in general language understanding tasks and coding benchmarks, while Claude showed superiority in tasks requiring a longer context window and complex reasoning. Notably, GPT-4 Turbo outperformed Google's Gemini 1.0 Pro in metrics such as accuracy and cosine similarity while exhibiting longer response times. Furthermore, Tsinghua University's ChatGLM has been noted to match or exceed GPT-4's performance across a variety of benchmarks. On the MMLU benchmark, it achieved a score of 83.3%, close to GPT-4's 86.4%, and on the GSM8K benchmark, it scored 93.3% compared to GPT-4's 92.0%. This indicates strong multilingual capabilities and high proficiency in math and reasoning tasks.

3-2. Comparative Analysis: GPT-4 vs. Claude vs. Gemini

The competitive analysis between GPT-4, Claude, and Gemini reveals distinct areas of strength for each model. Claude 3.5 Sonnet, for instance, has proven superior in coding tasks and visual reasoning, excelling in benchmarks like HumanEval and DROP. It also outperforms others in text transcription from imperfect images and complex problem-solving, making it ideal for sectors like retail, logistics, and financial services. GPT-4, however, stands tall in general language understanding and multi-task evaluations, frequently securing top spots in standardized benchmarks. Researchers found that GPT-4o had a higher average win rate compared to Claude-3-Opus and Gemini models in crowdsourced human evaluations. GPT-4o was particularly praised for its performance in coding tasks, although some users found Claude to be more reliable for highly detailed and complex coding instructions. Google's Gemini, particularly the Gemini 1.0 Pro model, was the first to surpass human expertise in the MMLU test but displayed limitations in practical application scenarios compared to GPT-4 Turbo. In financial data analysis tests, GPT-4 Turbo showed superior comprehension and contextual alignment capabilities, albeit with longer response times, highlighting its depth of understanding over speed.

3-3. Strengths in Specific Tasks (Coding, Reasoning, Content Generation)

Claude 3.5 Sonnet and GPT-4 both show remarkable strengths in specific tasks such as coding and reasoning. Claude's performance in HumanEval benchmarks highlights its ability to independently write, edit, and execute code effectively, making it a preferred choice for updating legacy applications and migrating codebases. Additionally, Claude excels in creative problem-solving, as demonstrated through its new Artifacts feature, allowing real-time generation and editing of content like code snippets and text documents. GPT-4 shines in general language understanding and summarization tasks. It achieved state-of-the-art results on benchmarks like MGSM and DROP, indicating strong performance in multilingual math and reasoning over long contexts. Claude's extended context window (200k tokens compared to GPT-4's 128k) gives it an edge in tasks requiring analysis of extensive documents or large codebases. Lastly, ChatGLM matches or exceeds GPT-4 in several benchmarks, including a high score on GSM8K and strong multilingual capabilities, making it a formidable player in the realm of large language models.

4. Limitations of Current LLMs

4-1. Challenges in True Reasoning and Problem-Solving

Large Language Models (LLMs) face significant challenges in true reasoning and problem-solving. According to the document titled "LLMs Can’t Reason - The Reversal Curse, The Alice In Wonderland Test, And The ARC - AGI Challenge - CustomGPT", LLMs struggle with tasks that require common sense reasoning. For example, when tested on a simple prompt concerning Alice’s brothers and sisters, only Gemini Advanced among several state-of-the-art LLMs produced the correct answer. This indicates a dramatic breakdown in the reasoning capabilities of most LLMs, as they often provide incorrect answers even when encouraged to think carefully and double-check their work. This problem extends to the inability of LLMs to perform true reasoning, as they primarily rely on memorization and interpolation rather than synthesizing new ideas or solutions. Francois Chollet from Google has also highlighted that LLMs are essentially databases of patterns and information that generate outputs based on memorization rather than true reasoning, as detailed in his Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). He argues that LLMs lack the ability to adapt and learn from novel situations, further illustrating their limitations in reasoning and problem-solving.

4-2. Overfitting and Data Contamination

One significant limitation of current LLMs is overfitting and data contamination. Overfitting occurs when a model is overly trained on specific data, limiting its ability to generalize to new situations. The referenced document "LLMs Can’t Reason - The Reversal Curse, The Alice In Wonderland Test, And The ARC - AGI Challenge - CustomGPT" highlights that LLMs often perform well on specific benchmarks because their training data contaminates these benchmarks. For instance, GPT-4's reported performance on the Bar Exam was later debunked as inflated due to testing against repeat test-takers who had previously failed. Additionally, the document "The AI Plateau Is Real — How We Jump To The Next Breakthrough" discusses how public textual training data scarcity forces AI companies to scavenge from other sources, like YouTube video transcripts, leading to marginal improvements and data contamination. Francois Chollet emphasizes that AI benchmarks measure skill rather than intelligence, and LLMs have not exceeded 35% accuracy on ARC-AGI compared to the 85% human accuracy, largely due to overfitting.

4-3. Issues with Current Benchmarking Methods

Current benchmarking methods for LLMs present several significant issues. The document "LLMs Can’t Reason - The Reversal Curse, The Alice In Wonderland Test, And The ARC - AGI Challenge - CustomGPT" criticizes the current state of AI benchmarks, suggesting that they measure memorization rather than genuine reasoning abilities. For example, GPT-4's high scores on benchmarks like MMLU are misleading, as these scores were achieved by measuring against less qualified test groups, as seen in Eric Martinez's reevaluation of GPT-4's Bar Exam results. Additionally, Francois Chollet's ARC-AGI, a formal benchmark for AGI, remains unbeaten by LLMs, with these models struggling to surpass 35% accuracy, far below human capabilities. Over-reliance on such flawed benchmarks exaggerates the models' reasoning capabilities. Moreover, the document "The AI Plateau Is Real — How We Jump To The Next Breakthrough" illustrates how the limitations of public data and the incremental nature of LLM improvements contribute to benchmark issues. These faults suggest a need for new benchmarking methodologies that more accurately reflect LLMs' abilities and limitations.

5. Innovations and Emerging Models

5-1. Emerging LLM Models and Their Unique Features

Several Large Language Models (LLMs) have emerged, each showcasing unique features that push the boundaries of current AI capabilities. Mistral AI, a French startup founded by former employees of Google DeepMind and Meta, has introduced the Mistral 7B model, which excels in code generation tasks with advanced reasoning capabilities and multilingual support. Mistral models come in two types: open-weight models like Mistral 7B, which are ideal for customization and fine-tuning due to their efficiency and portability, and commercial models like Mistral Large, which are optimized for higher performance. Mistral Large, released in February 2024, offers better accuracy, an extended context window of up to 32K tokens, and inherent support for function calling, making it a formidable competitor to models like GPT-4. Additionally, Anthropic's Claude 3.5 Sonnet, released just months after Claude 3, has set new industry standards by outperforming previous models in intelligence, speed, and cost-efficiency. Claude 3.5 Sonnet excels in benchmarks such as HumanEval for coding, GPQA for graduate-level reasoning, and MMLU for undergraduate-level knowledge. It also offers enhanced visual reasoning and text transcription capabilities.

5-2. The Role of Open-Source Models

Open-source models play a crucial role in democratizing access to advanced AI capabilities. Mistral AI has been a strong proponent of open-source LLMs, offering models like Mistral 7B under the Apache 2.0 licensing, which ensures that these models are easily accessible and can be customized or fine-tuned according to specific needs. These models are particularly beneficial for tasks that require fast performance, portability, and control. The introduction of such open-source models fosters innovation and allows smaller organizations to leverage advanced AI without the high costs usually associated with proprietary models.

5-3. Innovations in Benchmarking and Evaluation

Innovations in AI benchmarking and evaluation are essential to measure and enhance the capabilities of LLMs. The introduction of MMLU-Pro by Fahd Mirza represents a significant step forward. MMLU-Pro expands the original MMLU benchmark by incorporating more challenging, reasoning-focused questions and increasing the choice set from four to ten options. This enhanced benchmark aims to address performance saturation observed in models like GPT-4, which achieved a score of 86.4% on the original MMLU in March 2023. By focusing on more complex, college-level tasks that require deliberate reasoning, MMLU-Pro elevates the assessment of multitask language understanding in LLMs. Another notable development is Anthropic's Claude 3.5 Sonnet, which excels in benchmarks like HumanEval and GPQA, demonstrating significant advancements in reasoning, coding, and knowledge proficiency. These benchmarks are critical for evaluating the real-world applicability of LLMs and driving continuous improvement in their performance.

6. Specialized Applications

6-1. LLMs in Medical and Biomedical Domains

Large Language Models (LLMs) such as GPT-4 and proprietary models like Med-PaLM have shown significant potential in specialized medical domains. One key example is OpenMedLM, which utilizes prompt engineering techniques to achieve state-of-the-art (SOTA) performance without the need for extensive fine-tuning. OpenMedLM has demonstrated SOTA results on benchmarks like MedQA and MMLU medical-subset, proving the capability of open-source models in critical medical applications. These models are assessed on benchmarks including MedQA (U.S. Medical Licensing Exam questions), MedMCQA (Indian postgraduate medical exam), and PubMedQA (questions based on PubMed-indexed abstracts), establishing their practicality in the field.

6-2. Use of Prompt Engineering Over Fine-Tuning

Recent findings highlight the benefits of prompt engineering over fine-tuning for improving the performance of generalist foundation models in specialized fields like medicine. Studies, such as the development of Medprompt by Microsoft, have shown that robust prompting techniques can enable generalist models to achieve competitive or superior performance to specialized fine-tuned models on medical Q&A benchmarks. By utilizing techniques like zero-shot, few-shot, chain-of-thought, and self-consistency voting, models like OpenMedLM have surpassed previous best-performing models, reducing the need for expensive computational costs and addressing issues like catastrophic forgetting associated with fine-tuning.

6-3. Retrieval-Augmented Generation Methods

Retrieval-Augmented Generation (RAG) methods have emerged to enhance the capabilities of LLMs in addressing domain-specific challenges by retrieving relevant documents to support the generation process. Self-BioRAG, a notable example, employs biomedical instruction sets and retrieval methods tailored to the biomedical domain. This approach has shown a 7.2% absolute improvement over the state-of-the-art open-foundation models in medical question-answering benchmarks. Utilizing components like domain-specific retrievers and instruction-tuned language models, Self-BioRAG generates proficient answers by retrieving and reflecting on relevant domain-specific information, thereby improving accuracy and reliability in biomedical contexts.

7. Conclusion

Current LLMs including GPT-4, Claude 3.5 Sonnet, Gemini, and ChatGLM have achieved significant advancements in natural language processing and artificial intelligence. Despite their impressive capabilities in areas like coding, multilingual tasks, and medical applications, these models still struggle with true reasoning and problem-solving due to their reliance on statistical patterns rather than genuine understanding. Performance issues such as overfitting and data contamination also present challenges. To overcome these, emerging models and new benchmarking methods like MMLU-Pro are being developed to enhance LLM performance and reliability. Innovations in specialized fields through prompt engineering and retrieval-augmented generation methods demonstrate potential, particularly in areas requiring access to domain-specific knowledge. Future developments must focus on balancing efficiency, contextual relevance, and accuracy. Continuous innovation in AI models and evaluation techniques is essential to address the existing limitations and unlock the full potential of LLMs in real-world applications.

8. Glossary

8-1. GPT-4 [Large Language Model]

GPT-4 is an advanced model developed by OpenAI, known for its generative pre-training and superior language understanding capabilities. It excels in tasks requiring nuanced reasoning and context comprehension.

8-2. Claude 3.5 Sonnet [Large Language Model]

Claude 3.5 Sonnet, developed by Anthropic, is recognized for its speed, reasoning abilities, and coding performance. It aims to surpass other models while emphasizing safety and privacy in its operations.

8-3. Gemini [Large Language Model]

Developed by Google, the Gemini model is known for its speed and cost-effectiveness. While it may underperform in accuracy compared to some models, it represents Google's foray into competitive AI landscapes.

8-4. ChatGLM [Large Language Model]

ChatGLM, from Tsinghua University and Zhipu AI, matches or exceeds capabilities of other advanced models like GPT-4 across several benchmarks, particularly in multilingual and domain-specific tasks.

8-5. MMLU-Pro [Benchmark Dataset]

MMLU-Pro is an enhanced benchmarking dataset for LLMs, introducing more challenging questions and reducing guessing chances, aiming to elevate language understanding assessments.

9. Source Documents

LLMs Can’t Reason - The Reversal Curse, The Alice In Wonderland Test, And The ARC - AGI Challenge - CustomGPThttps://customgpt.ai/llm-reasoning-vs-memorization/
Large language modelhttps://en.wikipedia.org/wiki/Large_language_model
Claude 3.5 sets new AI benchmarks, beating GPT-4o in coding and reasoninghttps://cryptoslate.com/claude-3-5-sets-new-ai-benchmarks-beating-gpt-4o-in-coding-and-reasoning/
OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models | Scientific Reportshttps://www.nature.com/articles/s41598-024-64827-6
Generative AI consulting & development for enterpriseshttps://www.kmeleon.tech/learn/llm-match-up-gemini-10-pro-vs-gpt-4
GPT-3 vs. GPT-4: How Do OpenAI Models Compare? | HIX.AIhttps://hix.ai/hub/chatgpt/gpt-3-vs-gpt-4
GPT-4o Benchmark - Detailed Comparison with Claude & Geminihttps://wielded.com/blog/gpt-4o-benchmark-detailed-comparison-with-claude-and-gemini
The AI Plateau Is Real — How We Jump To The Next Breakthroughhttps://www.emcap.com/thoughts/ai-s-curve-plateau-proprietary-business-data-breakthrough/
Introducing MMLU-Pro: Elevating AI Benchmarkinghttps://datatunnel.io/tldr_listing/introducing-mmlu-pro-elevating-ai-benchmarking/
A Comprehensive Guide to Mistral Large Language Modelhttps://futureskillsacademy.com/blog/mistral-large-language-model/
Anthropic Sonnet 3.5 Sets New Benchmark Standardshttps://synthedia.substack.com/p/anthropic-sonnet-35-sets-new-benchmark
Using Assistant API GPT-4o with File Search enabled automatically ups the tokens used by 3.5khttps://community.openai.com/t/using-assistant-api-gpt-4o-with-file-search-enabled-automatically-ups-the-tokens-used-by-3-5k/840737/2
Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language modelshttps://academic.oup.com/bioinformatics/article/40/Supplement_1/i119/7700892?rss=1
Claude 3.5 Sonnet: Redefining the Frontiers of AI Problem-Solvinghttps://www.unite.ai/claude-3-5-sonnet-redefining-the-frontiers-of-ai-problem-solving/
Chinese-Built ChatGLM Exceeds GPT-4 Across Several Benchmarkshttps://analyticsindiamag.com/chinese-built-chatglm-exceeds-gpt-4-across-several-benchmarks/

The Capabilities and Limitations of Current Large Language Models (LLMs)

TABLE OF CONTENTS

1. Summary

2. Introduction to Large Language Models

2-1. Overview of LLMs and Transformer Architecture

2-2. Notable Models and Developments

2-3. Key Areas of Application

3. Capabilities and Comparisons of Current LLMs

3-1. Performance Metrics and Benchmarks

3-2. Comparative Analysis: GPT-4 vs. Claude vs. Gemini

3-3. Strengths in Specific Tasks (Coding, Reasoning, Content Generation)

4. Limitations of Current LLMs

4-1. Challenges in True Reasoning and Problem-Solving

4-2. Overfitting and Data Contamination

4-3. Issues with Current Benchmarking Methods

5. Innovations and Emerging Models

5-1. Emerging LLM Models and Their Unique Features

5-2. The Role of Open-Source Models

5-3. Innovations in Benchmarking and Evaluation

6. Specialized Applications

6-1. LLMs in Medical and Biomedical Domains

6-2. Use of Prompt Engineering Over Fine-Tuning

6-3. Retrieval-Augmented Generation Methods

7. Conclusion

8. Glossary

8-1. GPT-4 [Large Language Model]

8-2. Claude 3.5 Sonnet [Large Language Model]

8-3. Gemini [Large Language Model]

8-4. ChatGLM [Large Language Model]

8-5. MMLU-Pro [Benchmark Dataset]

9. Source Documents