What makes BigScience special?
BigScience is inspired by other open science initiatives where researchers have pooled their time and resources to collectively achieve a higher impact. Famously, megaprojects like the Large Hadron Collider or the Hubble Telescope have led to a wide variety of breakthroughs. The BigScience large language model, at a much smaller scale, aims to play a similar role in Artificial Intelligence research. It is by no means the first large language model, but there are some distinct features that set it apart from prior work:
- Openness: Every single thing about the endeavor is as open as possible (discussion, working documents, code). Discussions, research findings are shared, code is open source and brainstorms around licensing take an “open first” approach. A final presentation workshop will be held at ACL in May 2022 (a premier conference in natural language processing) where the results are discussed. Many research papers by hundreds of collaborators have already come out of the project and we expect the stream to continue.
- Multilinguality & Diversity: The large language model will be trained on geographically-diverse data from 46 languages. This is a specific design choice since most other models of this (very large) size are monolingual.
- Accessibility: While the exact license of the model is still being drafted by the ethics and accessibility working group, the focus is on openness and the intention is that the trained weights of the model should be accessible for researchers for experimentation. Accomodation is also being currently made for the model itself to be available to anyone via an easy-to-use API for cases when researchers don’t have access to enough compute to run the model themselves.
- Data governance: Throughout the BigScience initiative, we have taken special care around data and data governance, constructing the dataset with an inclusive approach and pushing our understanding of the legal and licensing status of ML data and derivative works. As a result, researchers will be able to examine all of the data sources that went into training the model, and to reuse a significant portion of them in future projects.
- Collaborativeness: The language model will be the culmination of a collective effort of 30 different work groups, with roughly 1 thousand people signing up to be involved and hundreds of active participants.
What are the specs?
The language model will have 176 billion parameters and will consume more than 350 billion words during training, with data curated in a responsible way, coming from a variety of sources and in 46 different languages. The model size was chosen based on the availability of compute (we had 18 weeks of compute on the cluster, as indicated by the line):
The model will be trained on Jean Zay, the French government-funded super computer that is managed by GENCI and installed at IDRIS, the national computing center for the French National Center for Scientific Research (CNRS). It will use 384 Nvidia A100 GPUs with 80GB of memory each, for a duration of several months (roughly 1.2 million GPU hours).
For more technical details on how the architecture was chosen, please refer to
this research paper and follow the BigScience
chronicles of Stas Bekman, lead engineer of the project. We also wrote more in-depth blog posts that delve deeper into the specifics: