您现在的位置是:Amazon's Alexa scientists demonstrate bigger AI isn't always better >>正文

Amazon's Alexa scientists demonstrate bigger AI isn't always better

后花园论坛社区|2024夜上海论坛网|爱上海419论坛 -- Back garden3696人已围观

简介A simple task, to reduce all the words in an article to a compact sequence of words that explains th...

A simple task, to reduce all the words in an article to a compact sequence of words that explains the article's central point, is among the benchmark tasks in deep learning. This is where Amazon's Alexa AI scientists say they can best the efforts of vastly larger computer programs from DeepMind, Google, Meta, OpenAI, and others. The work has implications for energy use and carbon footprint-efficiency.

Amazon Alexa AI 2022

Two threads of research strongly dominate machine learning these days: making programs more general in their approach (to handle any potential task) and making them bigger.

The biggest neural nets, as measured by their parameters or "weights," are clocking in at over half a trillion weights. Models such as Google's Pathways Language Model, or PaLM, and Nvidia and Microsoft's Megatron-Turing NLG 530B are among the biggest, with 540 billion and 530 billion parameters, respectively. The more parameters a program has, in general, the greater the amount of computing power it consumes to train, and also to run for making predictions, what's called inference.

Artificial Intelligence

  • AI in 2023: A year of breakthroughs that left no human thing unchanged
  • These are the jobs most likely to be taken over by AI
  • AI at the edge: 5G and the Internet of Things see fast times ahead
  • Almost half of tech executives say their organizations aren't ready for AI or other advanced initiatives

The cognoscenti of AI insist the path is definitely up and to the right for parameter count, toward a trillion parameters and way beyond in the not-so-distant future. The figure of 100 trillion is a kind of magical target because it is believed to be the number of synapses in a human brain, so it serves as a benchmark of sorts.

Also: Nvidia clarifies Megatron-Turing scale claim

At the same time, there is a fervor to make deep neural networks that can be as general as possible. For much of the machine learning history of the last 40 years, programs were specialized for tasks such as image recognition or speech recognition. That has changed in recent years, with more and more programs offering to be generalists, such as DeepMind's Perceiver AR, and another DeepMind program, Gato, referred to as "a generalist agent" capable of solving myriad tasks.

2020-09-30-at-11-23-43-am.jpg

Which Amazon Echo to buy? How to pick the best Alexa device for your needs

Amazon now has an entire army of Echo devices. Some listen to you. Some also watch you. Which should you choose? We help you decide.

Read now

Incidentally, the authors also take pains to shape the majority of the input as natural spokentext, dropping capitalization and punctuation, which has importance in an Alexa setting. "We include more spoken than written text to satisfy our internal use cases," they write. 

Some of the Alexa AI team's technologies are used in Alexa products, although Amazon told ZDNet in an email that the group "also [does] forward-looking research." The AlexaTM 20B model, said Amazon, "is primarily a research project at this stage."

Added Amazon, "It's possible that this model will be deployed in production in the future, but only the modified version with guardrails will be used to develop Alexa features and products."

Also: Google's massive language translation work identifies where it goofs up

The authors train the AlexaTM 20B model "for 120 days on 128 [Nvidia] A100 GPUs for the total of 500k updates with the accumulated batch size of 2 million tokens (total of 1 trillion token updates)," they write. 

That might sound like a lot, but it's less than PaLM, which was trained by Google on two of its fourth-generation TPU Pods, consisting of 3,072 TPU chips in each Pod, which are attached to 768 host computers. 

As Google authors Aakanksha Chowdhery and team noted in April, that was "the largest TPU configuration described to date."

The results are spelled out in specific test results. Soltan and team place a special emphasis on their success in particular tasks as opposed to everytask conceivable. For example, Soltan and team observe that "AlexaTM 20B performs better or in par to the largest dense decoder-only model to date (i.e., PaLM 540B) in summarization both in 1-shot and fine-tuning settings." This is specifically true in a task of summarizing paragraphs known as MLSum; in German, Spanish, and French, AlexaTM 20B beat PaLM handily. 

The MLSum benchmark test, introduced in 2020 by France's National Centre for Scientific Research, comprises 1.5 million articles from newspapers. The task is for a language model to output a few sentences of text that express the idea laid out in the entire article. This requires a lot of reduction, obviously, of hundreds of words down to perhaps a few dozen.

Amazon

  • How to turn your old Fire tablet into an Echo Show
  • Trade in your old devices for Amazon gift cards. Here's how
  • The best Amazon tablets: Play with Fire
  • Amazon Kindle Scribe review: 7 months later, it's so close to perfect

On a fourth test, XSum, performed in English, the AlexaTM 20B model was a close second, and it beat out a version of PaLM that was bigger than AlexaTM 20B but smaller than the 540-billion-parameter version of PaLM. 

While it excels at summarization, the AlexTM 20B falls down on some other tasks. For example, tested on "reasoning" data sets (such as MultiArith) and "chain of thought" reasoning tasks (which are very simple arithmetic problems written in natural language), the program falls far behind what is accomplished by the much-larger models like GPT-3.

Also: The future of AI is a software story, says Graphcore's CEO

Write Soltan and team, "AlexaTM 20B performs slightly better than similar sized models, however, we did not observe the gain that much larger models like GPT3 175B show from such special prompts," meaning, clues given to the program about the next step in a problem. 

"The results indicate that scaling up the model parameters is crucial in performing well in 'reasoning' tasks as was previously demonstrated […] in decoder-only architectures using Instruct-GPT3 models."

Focusing on the successful task,s such as summarization, the main conclusion that Soltan and team arrive at is that their mixed approach to training the program -- using both objectives of de-noising and causal language modeling -- is a key to making things more efficient. 

"This suggests that mixed pre-training, and not necessarily additional multitask training […] is the key to train strong seq2seq-based Large-scale Language Models (LLM)," they write. 

To return to the original question of size, as has been noted in many contexts, the energy usage of increasingly large AI programs is an ethical concern within AI practices. The authors make a strong case for the relevance of their more-efficient approach. 

Also: Ethics of AI: Benefits and risks of artificial intelligence

Because the AlexaTM 20B "is much smaller in size than models like GPT3 175B, yet achieving similar or better performance across different tasks," they write, "the ongoing environmental impact of using AlexaTM 20B for inference is much lower than that of larger models (roughly 8.7 times lower)."

They add, "Hence, overtime, AlexaTM 20B has [a] lower carbon footprint as well."

The authors offer a table of stats showing the relative carbon footprint, and there's a big difference in the numbers. 

OpenAI logo reflected in human eye7 advanced ChatGPT prompt-writing tips you need to know Person wearing the Motorola Adaptive Display Concept on their wristMotorola wants you to wear its new bendable phone like a watch - but don't get too excited Passkey conceptWhat are passkeys? Experience the life-changing magic of going passwordless Husqvarna Automower 430X in the yard.The best robot lawn mowers you can buy OpenAI logo reflected in human eye7 advanced ChatGPT prompt-writing tips you need to know
  • Person wearing the Motorola Adaptive Display Concept on their wristMotorola wants you to wear its new bendable phone like a watch - but don't get too excited
  • Passkey conceptWhat are passkeys? Experience the life-changing magic of going passwordless
  • Husqvarna Automower 430X in the yard.The best robot lawn mowers you can buy
  • Editorial standards Show Comments

    Tags:

    相关文章

    

    友情链接