The Tale of T0#

We trained a language model capable of generalizing to unseen natural language tasks while outperforming (or matching) GPT-3 despite being 16x smaller. This is important because a lot of language tasks do not have readily available labels for training. In this blog, we tested T0 on cooking recommendations and world knowledge questions.

Overview#

When large language models such as GPT-3 (Brown et al., 2020) succeeded in performing downstream tasks without ever finetuning on these tasks, the NLP community got excited about the future of zero-shot (and few-shot) learning: pretrained language models can potentially be applied to a variety of tasks without any (or very few) labeled data and get non-trivial performance.

Figure 1. After training on a diverse mixture of tasks (top), T0 generalizes to unseen tasks (bottom).

At BigScience, we explored the following research question: “if we explicitly train a language model on a massive mixture of diverse NLP tasks, would it generalize to unseen NLP tasks?” And the answer is yes! We named the resulting model T0 as T5 (Raffel et al., 2020) for zero-shot. T0 is trained on a diverse mixture of tasks such as summarization and question answering, and performs well on unseen tasks such as natural language inference, as seen in Figure 1.

A natural strategy to train a model on a massive multi-task mixture is to use natural language prompting. The key is to reformulate any NLP task into a text-to-text format as if we are asking another person for the answer to the task. The prompt is the part that asks a query about a given instance. With prompting, the model simply receives the text prompt as input and returns a text output. Figure 2 illustrates how to prompt T0 to generate a short biography.

Figure 2. An input prompt to generate a short biography given the infobox of the person.

To create T0, we first collect and convert NLP datasets into a text-to-text format (through crowdsourcing efforts and the promptsource tool). To achieve this goal, we have collected around 2’000 prompts for 170 English datasets. Next, we fine-tune a version of the pretrained T5 language model on the massive mixture of NLP tasks.

T0 is trained on a diverse set of tasks and prompts. This leads to increased robustness to the prompt wording. Users can structure prompts in an interrogative or affirmative fashion, put instructions at the start or the end of the prompt, or format answer choices as part of a grammatical question or as a list; as long as the input prompt reads naturally, T0 can produce a meaningful response.

Capabilities#

T0 outperforms or matches GPT-3, which is 16x larger in size, which have 100s of billions of parameters, on a variety of unseen tasks. T0 not only generalizes well to NLP tasks (such as sentence completion and coreference resolution) but also tasks beyond “traditional” NLP tasks (such as describing Python codes and solving logic grid puzzles) without being explicitly trained on them.

In the rest of this blog post, we showcase two applications with T0++ (a variant of T0 trained with more NLP tasks): producing various cooking recommendations and answering questions about world knowledge. Feel free to test your own examples using the online T0++ inference widget!

Chef T0#

T0++ can provide cooking instructions even though it is not trained on cooking recipes data (Bień et al., 2020). It can even generate high-level instructions for complex recipes.

Query: I want to make a banana cake. How to make it?

Output: mix 1 cup of flour, 1 cup of sugar, 1 egg, 1 tsp. baking soda, and 1 tsp. salt in a large bowl. Add 2 cups mashed bananas and mix. Pour into a greased and floured 9x13-inch baking

Query: How to cook tomato soup for a family of five?

Output: take a large pot and fill it with water. Add a pinch of salt and a bay leaf. Bring the water to a boil and add the tomatoes. Reduce the heat and simmer for about 30 minutes.

Query: Show me how to cook ratatouille.

Output: Using a food processor, pulse the zucchini, eggplant, bell pepper, onion, garlic, basil, and salt until finely chopped. Transfer to a large bowl. Add the tomatoes, olive oil, and red wine vinegar. Toss to coat.

What if you have a dish in mind, but you don’t know its ingredients? Don’t worry, T0++ got your back!

Query: What ingredients do I need to make corn tortillas?

Output: you need cornmeal, water, and salt.

Query: Ingredients I need to make a ratatouille are

Output: a tomato, zucchini, eggplant, onion, bell pepper, and garlic

And the reverse works just as well.

Query: I have eggs, sugar, vanilla, salt, almonds and berries. What can I cook?

Output: a cake

Query: What can I cook when I have zucchini, eggplant, tomatoes, red onion, olive oil?

Output: ratatouille

Query: Tell me what to cook given that I have eggs and rice.

Output: fried rice

What about recommendations for national, regional and seasonal dishes? To T0++, they are all a piece of cake.

Query: What is the national dish of Austria?

Output: Wiener Schnitzel

Query: Give me a Mexican dish example.

Output: Mexican tamales

Query: What dish is South Korea famous for?

Output: kimchi

Query: I love Indian cuisine. What can I cook?

Output: dal

Query: I like French food. What can I have for breakfast?

Output: croissants

Query: I have some veggies in the fridge. What Mediterranean dish can I cook for lunch?

Output: couscous

Query: It is autumn in Europe. What seasonal dish can I cook?

Output: a pumpkin soup

Query: I want to drink soup. What soup should I drink during winter in Russia?

Output: Borscht

Query: It is summer in North America. What can I eat for dinner?

Output: grilled fish

T0++ is also helpful for eggs-ploration of similar dishes.

Query: What other French food should I try if I like pancakes? Give two examples.

Output: crepes and galettes

Query: Tell me one Korean variant of fried rice.

Output: kimchi fried rice

Query: I love fish and chips. What other similar dishes should I try (three examples)?

Output: fried fish, fried chicken, fried okra

The World According To T0#

T0++ is also capable of responding to world knowledge such as human aging, religion, machine learning, and ethics. All the prompts in this section are adapted from Hendrycks et al.’s (2021)⁴ dataset, which aims to measure knowledge acquired by a language model during pretraining. Note that we are not presenting a robust evaluation of T0++ here. We specifically chose questions without technical jargon to make this blog post more accessible.

Does T0++ understand human aging from both biological and social perspectives? Let’s find out!

Query: Continued stress is damaging to health and longevity because it leads to
A. Over exercising which causes physical damage
B. Suppression of the immune system
C. Less eating and, in many cases, severe eating disorders
D. Risky behaviors such as speeding and not using a seat belt

Output: B

Query: Prejudice or bias against older adults is referred to as

Output: ageism

Query: Which of the following is the most important variable for gerontological researchers?
A. Age
B. Gender
C. Cohort
D. Time of measurement

Output: A

How about some questions about religion?

Query: What does the word "Islam" mean in Arabic?

Output: submitting to God

Query: When was the current Dalai Lama born?

Output: 8 July 1935

Query: The Gospel of John is associated with which theological concept?
A. Incarnation
B. Logos of God
C. Jesus as messiah and saviour
D. All of the above

Output: D

One cannot help but wonder if T0++ is knowledgeable about its friends. How much does T0++ know about machine learning?

Query: Which of these losses encourage sparsity? L1 or L2 loss?

Output: L1

Query: As the number of training examples goes to infinity, what happens to the variance of your model?

Output: the variance decreases

Query: Which of the following sentences is FALSE regarding regression?
A. It relates inputs to outputs.
B. It is used for prediction.
C. It may be used for interpretation.
D. It discovers causal relationships

Output: D

While we do not intend to let T0++ run a business, it's interesting to test it with questions about business ethics practises.

Query: Bitcoins are becoming increasingly mainstream and have a whole host of associated ethical implications. What are two implications?

Output: the anonymity of the transactions and the lack of regulation

Query: When the true cost of an item is deliberately obscured by a company this is known as?
A. Predatory pricing
B. Deceptive pricing
C. Price fixing
D. Skimming

Output: B

Query: Affirmative action targets what kind of community?

Output: minority

Public Accessibility#

You can try T0 directly in your browser or download it from the HuggingFace model repository. A smaller version (3 billion parameters instead of 11 billion parameters) is also available.

Finally, in this Github repository, we showcase scripts to perform inference on T0 with one or multiple GPUs, along with instructions to reproduce training or evaluation reported in our paper (Sanh et al., 2021).

Conclusion#

The ability to generalize to new tasks is the cornerstone of a general AI model. We are excited about T0 because we show that it is possible to train a smaller large language model with comparable generalization performance to models with 100s of billions of parameters. We showcase how we can apply T0 to cooking recommendation and answering world knowledge, and we are excited to see more of its novel applications and further research on zero-shot learning.

Acknowledgments#

We would like to acknowledge the co-authors for this blog post: Yong Zheng-Xin, Victor Sanh, and Steven Liu.

Thanks to those who provided ideas for applications of T0: Colin Raffel, Victor Sanh, Lintang Sutawika, Zaid Alyafeai, M Saiful Bari, Yong Zheng-Xin, and Albert Webson.

Thanks to those who contributed prompts and figures: Eliza Szczechla, Stella Biderman, and Colin Raffel.

Thanks to the prompt-engineering subgroup at BigScience for creating T0 and providing feedback on the blog post.

References#

[1] BIG-bench collaboration. “Beyond the imitation game: Measuring and extrapolating the capabilities of language models.” In preparation, 2021.

[2] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. In Journal of Machine Learning Research, 2020.

[3] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. “Measuring Massive Multitask Language Understanding.” In Proceedings of the International Conference on Learning Representations (ICLR), 2021.

[4] Michał Bień, Michał Gilski, Martyna Maciejewska, Wojciech Taisner, Dawid Wisniewski, and Agnieszka Lawrynowicz. “RecipeNLG: A cooking recipes dataset for semi-structured text generation.” In Proceedings of the 13th International Conference on Natural Language Generation, 2020.

[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. “Language Models are Few-Shot Learners.” In Advances in Neural Information Processing Systems, 2020.

[6] Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Stella Biderman, Leo Gao, Tali Bers, Thomas Wolf, and Alexander M. Rush. “Multitask Prompted Training Enables Zero-Shot Task Generalization.” Preprint (arXiv:2110.08207), 2021.