Current Trends and Evaluations in Large Language Models for AI

GOOVER DAILY REPORT July 7, 2024

Summary
Enabling Tools and Token Usage in AI Assistants
Evaluating Large Language Models
Current Advances and Challenges in AI Models
AI Evaluation Benchmarks
Domain-Specific Improvements in Medical Reasoning
Conclusion

1. Summary

The report titled 'Current Trends and Evaluations in Large Language Models for AI' delves into the latest developments and evaluation metrics for large language models (LLMs) in the field of artificial intelligence (AI). It highlights the significance of enabling tools such as custom functions, code interpreters, and file search modalities within AI assistants, emphasizing the need for efficient token management. Detailed comparisons between recent models like GPT-4 and Claude 3.5 Sonnet are provided, showcasing their respective strengths and minute improvements. Furthermore, the report discusses the methods used to evaluate LLMs, such as offline and online evaluations, and the metrics involved, such as fluency and coherence. Domain-specific advancements in medical reasoning, represented by the Self-BioRAG framework, are also covered, illustrating notable performance gains in medical question-answering benchmarks.

2. Enabling Tools and Token Usage in AI Assistants

2-1. Enabling custom functions, code interpreter, and file search

Enabling tools like custom functions, code interpreter, and file search in AI assistants contributes significantly to the overall token usage. When these tools are enabled, a large set of instructions is automatically appended to the system prompt. This includes both the user's system prompt and additional hidden instructions that detail the capabilities and guidelines for each tool. These hidden instructions increase token consumption significantly, often by approximately 3.5k tokens. It's necessary to manage these tools carefully to avoid excessive token usage, enabling them only when absolutely required.

2-2. Token usage implications

Enabling various tools within AI assistants can lead to a steep increase in token usage, even if the tools are not actively being used. This increase in tokens is due to the extensive instructions added to the system prompt. There isn't an effective mitigation strategy other than enabling these tools selectively. Therefore, it's crucial to be strategic about tool usage to maintain efficient token management.

2-3. Using 'myfiles_browser' tool

The 'myfiles_browser' tool allows AI assistants to search through files uploaded by the user within the same conversation. It supports the 'msearch' function, which issues queries to search the uploaded files and displays results. Queries should be constructed using Python syntax (e.g., msearch(['query'])). However, JSON syntax should be avoided as it causes 'Invalid function call in source code' errors. Additionally, users should leverage this tool only when the relevant parts of the documents don't contain sufficient information to fulfill the user's request.

2-4. Query format and citation guidelines

When using the 'myfiles_browser' tool, it's important to follow specific query formats and citation guidelines. Queries should be specific and avoid overly broad single-word searches to ensure relevance and accuracy of search results. Additionally, proper citations must be included with each response, formatted as `【{message idx}:{search idx}†{link text}】`. The 'message idx' is found at the beginning of each message from the tool, and the 'search index' is extracted from the search results. All three parts of the citation are mandatory to accurately reference the source material.

3. Evaluating Large Language Models

3-1. Importance of LLM Evaluation

The evaluation of large language models (LLMs) is crucial for ensuring trust and reliability. Businesses rely on these models for critical tasks such as content generation, customer service, and data analysis. Evaluations help identify potential biases, factual inaccuracies, and inconsistencies that could result in misleading outputs. By conducting evaluations, businesses can boost performance and ROI by fine-tuning models and optimizing workflows. Evaluation frameworks also provide a standardized approach for comparing different LLMs, enabling informed decision-making and assessing real-world applicability. Furthermore, they mitigate risks by identifying and addressing biases, ensuring fairness and responsible AI implementation.

3-2. Evaluation Metrics: Fluency and Coherence

Fluency and coherence are key metrics in evaluating LLMs. These metrics ensure that the generated text is not only factually accurate but also grammatically sound, well-structured, and easy to comprehend. Metrics such as Perplexity and Grammaticality assess the overall quality and readability of the LLM’s output. Additionally, human language metrics involve experts reviewing the generated text for factual accuracy or clarity, as well as user studies where participants interact with the LLM and provide feedback on its helpfulness and overall quality.

3-3. Evaluation Methods: Offline and Online

There are two primary methods for evaluating LLMs: offline and online evaluations. Offline evaluation involves using benchmark datasets with pre-made questions and answer sheets. Automated metrics are used to score the LLM's answers and determine its performance. This is akin to giving the LLM a practice test. Online evaluation, on the other hand, involves real-time interactions where experts review the generated text for factual accuracy and clarity. User studies are also conducted where participants interact with the LLM and provide feedback on its performance. This method assesses not just the core LLM engine but the entire system, including user interfaces and interactions.

3-4. Benchmarking Steps

The following steps are involved in benchmarking LLMs: 1. Benchmark Selection: Choose a combination of benchmarks to comprehensively evaluate LLM performance. Example benchmarks include GLUE, MMLU, AlpacaEval, and HELM, which test language understanding, reasoning skills, instruction-following, and fairness. 2. Dataset Preparation: Build high-quality, unbiased datasets to train, test, and validate the LLM on specific tasks. 3. Model Training: Train the LLM on massive text datasets, such as Wikipedia, and fine-tune it on specific benchmark tasks. 4. Evaluation: Test the trained models using benchmarks to determine their performance in terms of accuracy, coherence, and more. 5. Comparative Analysis: Compare the results of different LLMs to identify their strengths, weaknesses, and the best models for specific tasks.

4. Current Advances and Challenges in AI Models

4-1. Release of Claude 3.5 Sonnet

Anthropic recently announced the release of its new model, Claude 3.5 Sonnet. This model is an upgrade to the existing Claude 3 family of AI models. It has shown improvements in solving math, coding, and logic problems as measured by commonly used benchmarks. It also demonstrates faster performance, better understanding of language nuances, and even a better sense of humor. These enhancements make it more useful for building applications and services on top of Anthropic’s AI models.

4-2. Comparison with GPT-4

When comparing Claude 3.5 Sonnet to OpenAI's GPT-4, it is important to note that GPT-4 sent shockwaves through the tech world upon its release. GPT-4 is known for its capabilities in chatting, coding, and solving complex problems, including academic homework. While Claude 3.5 Sonnet is considered an improvement over previous models, it represents an incremental advancement rather than a revolutionary leap. It stands out for its logical reasoning skills and outperforms models from OpenAI, Google, and Facebook in several popular AI benchmarks. However, the improvements are marginal, amounting to a few percentage points.

4-3. AI Industry’s Incremental Progress

The AI industry has seen steady, incremental progress rather than revolutionary leaps in recent times. For example, Anthropic’s Claude 3.5 Sonnet, despite being more advanced than its predecessors, is not considered a major breakthrough. Michael Gerstenhaber, head of product at Anthropic, highlights that the advancements in Claude 3.5 Sonnet are primarily due to innovations in training rather than significant increases in model size or computational power. This slow but steady improvement trend is evident across the industry, with new models from various developers offering slight enhancements rather than massive leaps.

4-4. Challenges: Data Scarcity and Costs

Two significant challenges facing the AI industry are data scarcity and the high costs associated with model training. As models like GPT-4 were trained on extensive datasets comprising text, imagery, and video, finding new data sources for further training has become increasingly difficult. This scarcity poses a challenge for developing next-generation models. Additionally, the financial burden of training these large language models is substantial. For instance, GPT-4 cost more than $100 million to train, and the anticipated GPT-5 is expected to be even more expensive. These high costs add another layer of complexity to the incremental progression of AI model development.

5. AI Evaluation Benchmarks

5-1. Abstraction and Reasoning Challenge (ARC)

The Abstraction and Reasoning Challenge (ARC) was proposed by François Chollet, the creator of the deep learning library Keras at Google. The challenge is part of Chollet's arXiv paper 'On the Measure of Intelligence' and is designed as a benchmark for visual transformations such as symmetry, counting, colors, and rotation. ARC is characterized by tasks that are simple for humans but difficult for state-of-the-art AI. It reflects ideas from psychometric tests like Raven’s Progressive Matrices and traditional Bongard problems in machine vision but with a colorful Tetris-like appearance.

5-2. Tabletop World Challenge

The Tabletop World Challenge presents abstract visual reasoning tasks similar to ARC, as proposed by Lázaro-Gredilla et al. Despite its comparison to ARC, the challenge emphasizes systematic generation based on a theory of capabilities that the problems are supposed to measure. This includes a good understanding of the elements that make some instances more demanding than others and a procedural generator for held-out problems.

5-3. New Benchmarks for Language Learning, Computer Vision, and Multimodal Language Models

Several new benchmarks have been introduced, including LiveBench, which updates its questions monthly to remain contamination-free, SEAL leaderboard by Scale with periodically updated questions, and RUPBench focusing on syntactic and semantic perturbations. Additionally, DevBench is a multimodal developmental benchmark for language learning, and MMLU-Pro updates the popular MMLU benchmark with more complex, reasoning-focused questions. IrokoBench supports African languages, while NYU's new CV-Bench is a computer vision benchmark crafted alongside their Cambrian 1 multimodal LLM. LMSYS has also introduced challenges relating to human preference prediction and reward models similar to RewardBench.

5-4. Human Perspective Evaluation Complexities

Evaluating LLMs from a human perspective is challenging, as evidenced by a workshop on Human-Centered Evaluation and Auditing of Language Models. Preferences are more complex than they appear. MixEval, for instance, is a 'minibenchmark' that extracts small selections of instances with performance results correlating with ChatBot Arena. Metrics aimed at measuring geographic disparities in generated images noted that objects are depicted more realistically than backgrounds and that image generators struggle with modern vehicles in Africa.

6. Domain-Specific Improvements in Medical Reasoning

6-1. Self-BioRAG framework and its functionalities

The Self-BioRAG framework is designed specifically for the biomedical domain. It aims to improve medical reasoning by integrating domain-specific components such as biomedical instruction sets, a retriever for relevant documents, and a self-reflection language model. Self-BioRAG is trained using 84,000 filtered biomedical instruction sets, which allow it to generate explanations, retrieve domain-specific documents, and self-reflect on its generated responses. This framework ensures that the model can handle domain-related instructions effectively and provide pertinent responses based on retrieved knowledge and encoded information.

6-2. Performance gains in medical QA benchmarks

Experimental results indicate that Self-BioRAG significantly outperforms state-of-the-art models in several medical question-answering benchmarks. Notably, it achieves a 7.2% absolute improvement on average over models with a parameter size of 7B or less in three major medical QA benchmark datasets. Additionally, Self-BioRAG outperforms the RAG framework by 8% Rouge-1 score in generating more proficient answers on two long-form question-answering benchmarks. These performance gains demonstrate the effectiveness of the domain-specific components in addressing biomedical questions.

6-3. Importance of domain-specific components

The success of Self-BioRAG is largely attributed to its domain-specific components, which include a biomedical retriever, domain-related document corpus, and tailored instruction sets. These components are necessary to ensure that the model can adhere to domain-related instructions and provide accurate, well-informed responses. The framework uses the MedCPT retriever to retrieve relevant documents from sources like PubMed Abstract, PMC Full Text, Clinical Guidelines, and Medical Textbooks. This approach supplements the model's encoded knowledge with factual content, enhancing its reliability and accuracy.

6-4. Reflective tokens and factual content retrieval

Self-BioRAG leverages reflective tokens to assess and enhance its responses. Reflective tokens guide the model in deciding when retrieval of external information is necessary, evaluating the quality of the retrieved evidence, and determining whether the generated response aligns with the retrieved content. This mechanism ensures that the model can provide well-supported and accurate answers to complex medical questions. Reflective tokens play a crucial role in improving the model's reasoning ability and overall performance.

7. Conclusion

The advancements in LLMs, particularly models like GPT-4 and Claude 3.5 Sonnet, emphasize the steady albeit incremental progress in AI technology. The significance of evaluation methods and benchmarks is underscored as they provide a means to gauge performance and reliability. Despite the enhancements, managing the token usage of enabling tools remains critical for effective deployment. Furthermore, the report highlights ongoing challenges such as data scarcity and the high costs associated with training these models. Frameworks like Self-BioRAG indicate the profound impact of domain-specific LLMs, particularly in the realm of medical reasoning, achieving substantial improvements over previous benchmarks. Future efforts should focus on addressing the limitations posed by the scarcity of data and the immense resources required for model training, fostering a more robust and sustainable development trajectory for AI models.

8. Glossary

8-1. GPT-4 [Model]

GPT-4 is an advanced language model by OpenAI, known for transforming NLP with significant improvements in fluency, coherence, and understanding complex instructions. It plays a crucial role in various AI applications, including medical reasoning and general AI tasks.

8-2. Claude 3.5 Sonnet [Model]

Claude 3.5 Sonnet is an AI model by Anthropic, showing advancements in problem-solving, math, and coding. It outperforms many contemporary models and is indicative of steady progress in AI although not as revolutionary as anticipated next-generation models.

8-3. Self-BioRAG [Framework]

Self-BioRAG is a framework designed for medical question-answering using retrieval-augmented LLMs. It incorporates domain-specific components and reflective tokens to enhance performance, demonstrating substantial improvement in biomedical QA benchmarks.

9. Source Documents

Using Assistant API GPT-4o with File Search enabled automatically ups the tokens used by 3.5khttps://community.openai.com/t/using-assistant-api-gpt-4o-with-file-search-enabled-automatically-ups-the-tokens-used-by-3-5k/840737
How To Evaluate Large Language Modelshttps://www.signitysolutions.com/tech-insights/how-to-evaluate-large-language-models
We’re Still Waiting for the Next Big Leap in AI | Startup Dreamershttps://startupdreamers.com/startup/were-still-waiting-for-the-next-big-leap-in-ai/
2024 June "AI Evaluation" Digesthttps://aievaluation.substack.com/p/2024-june-ai-evaluation-digest
Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language modelshttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC11211826/

Current Trends and Evaluations in Large Language Models for AI

TABLE OF CONTENTS

1. Summary

2. Enabling Tools and Token Usage in AI Assistants

2-1. Enabling custom functions, code interpreter, and file search

2-2. Token usage implications

2-3. Using 'myfiles_browser' tool

2-4. Query format and citation guidelines

3. Evaluating Large Language Models

3-1. Importance of LLM Evaluation

3-2. Evaluation Metrics: Fluency and Coherence

3-3. Evaluation Methods: Offline and Online

3-4. Benchmarking Steps

4. Current Advances and Challenges in AI Models

4-1. Release of Claude 3.5 Sonnet

4-2. Comparison with GPT-4

4-3. AI Industry’s Incremental Progress

4-4. Challenges: Data Scarcity and Costs

5. AI Evaluation Benchmarks

5-1. Abstraction and Reasoning Challenge (ARC)

5-2. Tabletop World Challenge

5-3. New Benchmarks for Language Learning, Computer Vision, and Multimodal Language Models

5-4. Human Perspective Evaluation Complexities

6. Domain-Specific Improvements in Medical Reasoning

6-1. Self-BioRAG framework and its functionalities

6-2. Performance gains in medical QA benchmarks

6-3. Importance of domain-specific components

6-4. Reflective tokens and factual content retrieval

7. Conclusion

8. Glossary

8-1. GPT-4 [Model]

8-2. Claude 3.5 Sonnet [Model]

8-3. Self-BioRAG [Framework]

9. Source Documents