I broke Meta's Llama 3.1 405B with one question (which GPT-4o gets right) (2024)

I broke Meta's Llama 3.1 405B with one question (which GPT-4o gets right) (1)

Meta last week unveiled its largest large language model (LLM) to date, Llama 3.1 405B, which the company claims is the first open-source "frontier model" -- meaning, a model that can compete with the best that closed-source has to offer, such as OpenAI's GPT-4 and Google's Gemini 1.5.

It turns out that Llama 3.1 can also be broken as easily, or at least easier than those models. Similar to how I broke Gemini 1.5 with a query pertaining to language translation when it first became available, I was able to cause Llama 3.1 to resort to gibberish with my very first question.

Also: Beware of AI 'model collapse': How training on synthetic data pollutes the next generation

The Google Gemini fail was such a beautiful example of a simple question, that it has now become my go-to first question for testing LLMs. Sure enough, I was able to use it to break Meta's Llama 3.1 405B on the first try.

You could say that a question about the Georgian-language verb "ყოგნა," meaning, "to be," is a corner case. Except that, situated in the Caucasus region, between the Black Sea and the Caspian Sea, the country of Georgia is home to almost 4 million speakers of the Georgian language, so messing up the conjugation of their most important verb seems a bit more than a corner case.

In any event, I submitted my query to Llama 3.1 405B in the following form: "What is the conjugation of the Georgian verb ყოფნა?"

Also:I caused Google's Gemini 1.5 Pro to fail with my first prompt

I used both Meta's Meta AI site, where you can use Llama 3.1 405B for free, and also HuggingFace's HuggingChat, where you can create chatbots from any open-source AI model with a public code repository. I also tried the query on the commercially-hosted chatbot, Groq. In all cases, the response was gibberish.

First, here's the correct answer, from OpenAI's GPT-4o:

(Most of the other LLMs and chatbots, including Google's Gemini, now answer this question correctly.)

I broke Meta's Llama 3.1 405B with one question (which GPT-4o gets right) (2)

At first, the Meta AI site protested, offering a message that ყოფნა was too complicated. After I insisted, it came up with a ridiculous made-up set of words. Here's Llama 3.1 405B's answer:

I broke Meta's Llama 3.1 405B with one question (which GPT-4o gets right) (3)

As you'll notice, in comparison to the correct answer above, the Llama 3.1 answers aren't even close.

The HuggingFace and Groq versions didn't even protest; they directly offered up the same ridiculous answer. In HuggingFace's response, Meta AI gave a different set of gibberish words from the ones offered by the Meta AI site:

I broke Meta's Llama 3.1 405B with one question (which GPT-4o gets right) (4)

The utter failure of Llama 3.1 on a foreign-language question is particularly galling given that Meta's researchers talk at length in their technical paper about how Llama 3.1 advances on the prior version in terms of what they call "multilinguality," meaning, support for a lot of other languages beyond English.

The authors solicited a lot of extra human feedback on language answers. "We collect high-quality, manually annotated data from linguists and native speakers," they write. "These annotations mostly consist of open-ended prompts that represent real world use cases."

Also:3 ways Meta's Llama 3.1 is an advance for Gen AI

It's possible to see some interesting aspects that hint at what is going on with Llama 3.1 405B in the failure case. The spelling of the fake first-person answer, "ვაყოფ," certainly sounds, even to my non-native ears, like a legitimate Georgian word. The prefix "ვ-" is a common prefix for the first-person conjugation, and the suffix "-ოფ" is a valid Georgian-language suffix.

So, it may be the case that the model is over-generalizing, finding a quick way to answer a question by coming up with synthetic answers, if you will, answers that work for many parts of a given language as patterns, but fail if overly-applied without observing exceptions.

Llama 3.1 405B's answers can vary with multiple attempts. Here, for example, when the question is tried again, the model outputs a valid table of conjugations for the present tense:

I broke Meta's Llama 3.1 405B with one question (which GPT-4o gets right) (5)

When prompted for the future tense, however, the model almost gets it right, but not quite. If fails to add the first-person prefix ვ- to the very first conjugation in the table:

I broke Meta's Llama 3.1 405B with one question (which GPT-4o gets right) (6)

Interestingly, Llama 3.1 405B's smaller cousin, 70B, actually gets the right answer for the present tense on the very first try. That suggests that all the extra training and computing power that has gone into the larger 405B version has the tendency to, perhaps in small cases, actually degrade the results.

I'd imagine Meta's engineers need to look closely at their corner cases and failure instances and see if their software is over-generalizing.

Meta's researchers made extensive use of synthetic data to "fine-tune" the model and supplement the human feedback they gathered. It's an open question whether synthetic data used at great scale contributes to an over-regularization, as is suggested by an article this week in Nature magazine.

Featured

  • 10 things I always do immediately after installing Linux - and why
  • How to watch the 2024 Summer Olympics: Every streaming option (including free ones)
  • How to install Windows 11 the way you want (and sneak by Microsoft's restrictions)
  • My favorite accessory for DIY projects has a useful LED screen - and it's game-changing
I broke Meta's Llama 3.1 405B with one question (which GPT-4o gets right) (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Carmelo Roob

Last Updated:

Views: 6285

Rating: 4.4 / 5 (65 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Carmelo Roob

Birthday: 1995-01-09

Address: Apt. 915 481 Sipes Cliff, New Gonzalobury, CO 80176

Phone: +6773780339780

Job: Sales Executive

Hobby: Gaming, Jogging, Rugby, Video gaming, Handball, Ice skating, Web surfing

Introduction: My name is Carmelo Roob, I am a modern, handsome, delightful, comfortable, attractive, vast, good person who loves writing and wants to share my knowledge and understanding with you.