Researchers Claim that Even the Least Advanced Claude AI Outperforms GPT 3.5

The Battle of AI Chatbots: GPT-4 vs. Claude

The AI industry is currently witnessing an intense competition between two prominent models: ChatGPT and Claude AI. The Large Model Systems Organization (LMSO), the organization behind the renowned Vicuna Model and Chatbot Arena, has recently updated its Chatbot Arena Leaderboard to reflect the performance of these chatbots. Surprisingly, Anthropic’s Claude models have outperformed GPT-3.5, the engine powering the free version of ChatGPT.

GPT-4, which powers ChatGPT Plus and Bing AI, leads the leaderboard with the highest score, establishing itself as the gold standard for Large Language Models (LLMs). However, Anthropic’s Claude models (Claude 1, Claude 2, and Claude Instant) have surpassed GPT-3.5 in terms of performance. This means that every LLM developed by Anthropic can outclass the free version of ChatGPT.

Ranking and Performance Metrics

The LMSO’s ranking system provides insights into the performance metrics of these models. According to the leaderboard, GPT-4 holds an Arena Elo Rating of 1181, significantly ahead of other models. The Claude models closely follow with ratings ranging from 1119 to 1155, while GPT-3.5 lags with a rating of 1115.

To determine the rankings, the LMSO pits the models against each other in matches with similar prompts. Users decide which model provides the best answer without knowing which models are competing.

Image: LMSO

The Advantage of Claude Models

While the token processing capabilities are not considered in the LMSO ranking, it is worth noting that Claude Pro, based on the Claude 2 LLM, can process up to 100K tokens, while ChatGPT Plus handles only 8,192 tokens. This advantage allows Claude models to handle extensive contextual inputs more effectively, resulting in a nuanced and enriched user experience.

Claude 2 has also demonstrated superiority over GPT when it comes to handling long prompts efficiently. However, for comparable prompts, Claude 1 and Claude Instant provide similar or slightly better results than GPT-3.5. These models showcase their competitiveness and the ability to significantly improve initial answers with refined, larger, and richer prompts.

The Role of Open-Source Models

Open-source models are also making their mark in this competition. WizardLM, trained on Meta’s LlaMA-2 with 70 billion parameters, stands out as the best open-source LLM. Vicuna 33B and the original LlaMA-2 by Meta closely follow. Open-source models play a crucial role in the AI space as they can be run locally and finetuned by users, fostering a collective effort to perfect the model. Additionally, their licenses make them cheaper to run compared to proprietary models.

Real-World Implications

As chatbots become increasingly important in various sectors, such as customer service and personal assistants, their efficacy, adaptability, and accuracy become crucial. With Claude models ranking higher than GPT-3.5, businesses and individual users may need to carefully consider which model aligns best with their needs. Decrypt has prepared two guides to help you make an informed decision.

Hot Take: The Fierce Competition in the AI World

This leaderboard update is not just another routine update for those closely following the AI industry. It highlights the fierce competition and how quickly the tides can turn. It serves as a reminder that today’s most popular model could easily be surpassed by a more efficient one in the world of AI.