BigScience Model Training Launched

The BigScience team is excited to announce that the training of the BigScience large language model has officially started. Uniquely, as an open science initiative, you can follow along and see what it’s like to train a large language model every step of the way!

By BigScience

This final launch is the culmination of a full year of collaboration, brainstoms, experiments and discussions from more than a thousand researchers from all over the world participating in the BigScience project, leading to the final design of the model: a 176 billion parameter transformer model that will be trained on roughly 300 billion words in 46 languages.

The training is expected to last 3 to 4 months but many events might happen during the journey: events happening along the way (good or bad, from unexpected behaviors of the model up to a node going down) will be shared live via @BigScienceLLM on Twitter and you can see it in action on TensorBoard. There will be an official launch party on Thursday, March 24th, as well as an AMA on Reddit’s r/machinelearning on the same day starting 5pm CET.


Training curve for the BigScience large language model

The BigScience Large Language Model

Large language models are starting to have a huge impact on the world, but very few organizations in the world have the capacity to train them. Given the potential impact of language model technology, it is important that the broader community has a good understanding of how they are constructed, how they function and how they can be further improved. Up until now, much of this knowledge has been restricted to a handful of people in elite resource-rich research groups and corporations, who have—for financial, legal or ethical reasons—not been very open about the scientific details and even if they did publish scientific papers, did not open source the actual research artifacts.

"Given the potential impact of language model technology, it is important that the broader community has a good understanding of how they are constructed, how they function and how they can be further improved."

BigScience is a collaborative open science initiative, where a large number of researchers from all over the world work together to train a large language model. Everything happens completely in the open, anyone can participate, and all research artifacts are shared with the entire research community. This initiative is designed as an interdisciplinary research workshop, gathering researchers - academic, industrial and independent - with a diversity of research interests - AI, NLP, social science, legal, ethics and public policy. With generous support from GENCI and using the France-based Jean Zay computing cluster, BigScience is the first case in the history of AI where more than a thousand researchers participate in the creation of a single model and dataset. The very large language model whose training just started is currently the first and only publicly known endeavor to create an open-source large language model at this scale.

What makes BigScience special?

BigScience is inspired by other open science initiatives where researchers have pooled their time and resources to collectively achieve a higher impact. Famously, megaprojects like the Large Hadron Collider or the Hubble Telescope have led to a wide variety of breakthroughs. The BigScience large language model, at a much smaller scale, aims to play a similar role in Artificial Intelligence research. It is by no means the first large language model, but there are some distinct features that set it apart from prior work:

  • Openness: Every single thing about the endeavor is as open as possible (discussion, working documents, code). Discussions, research findings are shared, code is open source and brainstorms around licensing take an “open first” approach. A final presentation workshop will be held at ACL in May 2022 (a premier conference in natural language processing) where the results are discussed. Many research papers by hundreds of collaborators have already come out of the project and we expect the stream to continue.
  • Multilinguality & Diversity: The large language model will be trained on geographically-diverse data from 46 languages. This is a specific design choice since most other models of this (very large) size are monolingual.
  • Accessibility: While the exact license of the model is still being drafted by the ethics and accessibility working group, the focus is on openness and the intention is that the trained weights of the model should be accessible for researchers for experimentation. Accomodation is also being currently made for the model itself to be available to anyone via an easy-to-use API for cases when researchers don’t have access to enough compute to run the model themselves.
  • Data governance: Throughout the BigScience initiative, we have taken special care around data and data governance, constructing the dataset with an inclusive approach and pushing our understanding of the legal and licensing status of ML data and derivative works. As a result, researchers will be able to examine all of the data sources that went into training the model, and to reuse a significant portion of them in future projects.
  • Collaborativeness: The language model will be the culmination of a collective effort of 30 different work groups, with roughly 1 thousand people signing up to be involved and hundreds of active participants.

What are the specs?

The language model will have 176 billion parameters and will consume more than 350 billion words during training, with data curated in a responsible way, coming from a variety of sources and in 46 different languages. The model size was chosen based on the availability of compute (we had 18 weeks of compute on the cluster, as indicated by the line):

Compute

The model will be trained on Jean Zay, the French government-funded super computer that is managed by GENCI and installed at IDRIS, the national computing center for the French National Center for Scientific Research (CNRS). It will use 384 Nvidia A100 GPUs with 80GB of memory each, for a duration of several months (roughly 1.2 million GPU hours).

For more technical details on how the architecture was chosen, please refer to this research paper and follow the BigScience chronicles of Stas Bekman, lead engineer of the project. We also wrote more in-depth blog posts that delve deeper into the specifics:

Want to follow along?

What is truly unique about this effort is that, since BigScience is an open science initiative, anyone can follow along with the model training live, as it happens! We hope not, but it could be that something goes terribly wrong, or that we find something unexpected while we’re training. You can stay up to date with the latest in a variety of ways:

For media inquiries: bigscience-contact@googlegroups.com.