Building a TB Scale Multilingual Dataset for Language Modeling

Goals, Philosophy, Challenges, and Approaches

The BigScience workshop is excited to announce that the training of the BigScience language model has officially started. After one year of experiments, discussions, and development to lead up to this, with more than 1000 collaborators worldwide, the model will have 176B parameters trained on data from 46 languages. This blog post provides an overview of the colossal efforts to develop the model’s training corpus, a 350 billion token (1.5 TB of text data) multilingual dataset.

Many of the tools, research, and artifacts mentioned in this document will be the object of future releases; we encourage you to follow @BigScienceW on Twitter for regular updates and @BigScienceLLM on Twitter for specific training updates.

Context: the many Faces of Data Work

From the start of the project, the data work within BigScience has been driven by two main categories of considerations:

Intrinsic data values: beyond its use in ML, language data exists at the intersection of data subjects, creators, owners, different legal regimes, and many other rights holders and stakeholders.
Extrinsic model requirements: the language data we gather is needed to train an extensive multilingual language model whose behavior and capabilities are guided by the BigScience project’s goals and values.

Approaching data curation, collection, and management more responsibly to meet these high-level categories of requirements took a significant amount of work from several working groups across the overall project:

The Data Governance working group helped define the specific values that should guide data work and proposed a new structure for international data governance with some supporting technical and legal tools.
The Data Sourcing working group organized hackathons worldwide to help participants with local expertise build up a catalog of 246 language resources and prepare a list of 605 relevant websites.
The Privacy working group worked on taxonomies and strategies for mitigating privacy risks.
The Legal Scholarship group developed a legal playbook covering nine jurisdictions with different privacy and data protection laws to help ML practitioners understand the legal context of their work.

Additionally to the diversity of aspects of the data work (taken into account for responsible data management), the challenge of combining these two aspects without sacrificing one for the other is also exacerbated by the staggering amount of data required to take advantage of the proposed model size fully.

Given the targeted model size and amount of computation the Jean Zay computer provided, this amount was estimated to be upward of 350 billion tokens. For a sense of scale: if we were to print it on A4 paper and stack it up, we’d get about 141 Eiffel towers, 5 Mount Everests, or 1/1000th of the distance to the Moon (or 1.7 times the circumference of the Large Hadron Collider)!

This unprecedented scale makes it nearly impossible to fully evaluate the impact of any automatic curation choices on the whole corpus. It is also challenging to gain good insights by manually inspecting data samples. To address these challenges and to foster intelligibility and accountability of the curation process, we then prioritized the following approaches in our work:

Tools to support human decisions at scale over full automation. A significant part of our curation effort went into developing tools to help find a middle ground between these two paradigms, allowing participants to better collaborate on curation decisions at the right granularity level to combine different data quality notions.
Fewer languages, more language expertise.Using the tools described above properly still requires an understanding of the contexts we want to apply them to and having both a dedicated effort and fluent speakers for each of the considered languages in the project. As a result, we decided to focus our effort on languages and language groups to which we could commit sufficient resources (as pictured above).

The remainder of this blog post specifically describes the efforts of our data preparation task force between December 2021 and the start of the training on March 11th, 2022 to implement these strategies, leverage all of thise scholarship outlined above, and identify and address additional concerns to build the actual training corpus.

Making the Language Modeling Dataset

Starting from January 2022, a dedicated sprint with +15 researchers was initiated to retrieve the data, including the ones sourced by the working groups, process it, and build the training dataset. The data came from three sources:

The Data Sourcing Catalog included many primary data sources and existing NLP datasets participants wanted to have in our training corpus.
Additional targeted websites identified by members of the Data Sourcing group as representative of a diversity of geographical language varieties, obtained through a pseudo crawl (i.e., by finding their data in an existing web crawl).
We also extracted data in our target languages from the OSCAR v2 web crawl dataset based on several language-specific data quality measures.

For all three components,

We first retrieved the source data,
Then we applied various pre-processing steps based on the data kind and some visualization and indicators of data quality,
Finally, we extracted the pre-processed text that would go to training the model as ready-to-load datasets in the Hugging Face Hub before the training started.

Filters, tools, and indicators of data quality

Our filtering process's goal and main selection criterion were to expose the model to text humans wrote for humans. To guide our curation effort to that end, we found it helpful to compute aggregated statistics for each data source, such as:

Documents with a very high ratio of special characters (mostly punctuation) were found to often correspond to failures of the retrieval or upstream pre-processing system (e.g., when ill-formed HTML code leaks through in a text extraction pipeline)
We expect specific categories of common tokenswords (known as closed class words) to appear in a specific frequency in any natural language document. Documents with a low common tokenclosed class word ratio tended to correspond to tables, lists, or other documents with unexpected formatting. The relevant lists of common tokenslosed class words were initialized from part-of-speech data and validated by fluent language speakers.
We also found the language identification confidence score for high-resource languages to be a useful general indicator of whether the text corresponds to natural language, with low scores usually reaching pre-processing or retrieval failures.
Other similar indicators included word repetition statistics, sequence length, perplexity, and flagged words in various languages.

At a minimum, the indicators above helped surface data sources that warranted additional attention and pre-processing before using them. In some cases, we also used some of these metrics to automatically filter data items in a data source after manually examining the impact of the filtering. The following sections describe these processes in more detail.

Preparing the Catalog and Pseudo-Crawl Sources

We treated data sources from both the Catalog and Pseudo-Crawl similarly as they had all been manually selected by participants, with a combination of manual and automatic curation:

We retrieved the Data Catalog entries during a collaborative event in December, where the participants’ main goal was to obtain the source data to be processed later. The process for retrieving the pseudo-crawl data was a bit more involved. We queried an index over two years’ worth of CommonCrawl dumps to find pages from the target websites, then extracted the corresponding file positions in a CommonCrawl WARC dump and ran a text extraction pipeline. We then applied the following steps:

We performed document-level and page-level deduplication in the catalog and pseudo-crawl, respectively, to remove repeated instances of the same item in a single source. We also used information from the page URLs in the pseudo-crawl to further deduplicate the data.
We examined data sources that triggered the indicators described above to see which ones may need further processing and devised heuristics that made the text more human-readable. One such heuristic applied to all pseudo-crawl websites, and some catalog sources consisted of removing lines that occurred in too many documents across a source. We found that those often corresponded to menu items or boilerplate content.
To favor learning from more extended contexts, we filtered out catalog items that were too short for higher-resource languages. We will concatenate more concise disjoint examples for lower-resource languages when they occur.
Finally, we automatically filtered out pages from the pseudo crawl data based on the values of some of the indicators described above: length, character repetitions, language ID confidence, and closed class word ratios.

OSCAR Filtering and Processing

After processing the first two categories of data, we obtained a little over 850GB of processed text out of the 1.5TB we estimated the model could learn from, given our proposed model size and computational budget. Since other efforts of a similar scale have drawn most or all of their training data from web crawls, using a similar approach to make up the remaining part of the corpus struck us as both a good way to make results easier to compare, and hopefully to expose the model to text and document structures that were still not covered after our extensive cataloging effort. We also wanted to ensure that our web crawl curation process would learn from previous successes and failures (and ensuing criticisms). We describe our efforts to that end next.

The high-level approach was the following:

Obtain a full version of the OSCARv2 dataset.
Extract the data corresponding to 13 languages for which we were reasonably confident in the precision of the language identification system (Arabic, Basque, Catalan, Chinese, English, French, Indonesian, Portuguese, Spanish, Vietnamese, Bengali, Hindi, Urdu).
Establish a clear objective for the outcome of a page-level filtering process: we wanted to keep pages where most of the content was written by humans for humans (removing menu pages, pages with mostly code, “spam” or bot-generated pages, etc.)
Define a set of page-level metrics that could help us identify pages that fell outside of this curation goal. Some of these metrics, such as character n-gram repetition or particular token incidence, were computed the same way across languages. Others, such as closed class word ratios or flagged words, had language-specific parameters provided by fluent speakers.
We created a tool to visualize the distribution of these metric values on samples of OSCAR pages in each of the languages and to check their values on specific inputs. This tool was used to help native language speakers set thresholds for each of these metrics that would allow us to filter the OSCAR dataset down to our target (human-readable text written by humans) while minimizing false positives of the exclusion process (i.e., asking participants to prioritize precision over recall and checking the behavior on high-stake suspected false positives to minimize disparate impacts).
After applying the filtering described above, we ran a deduplication step to remove repeated pages.
Finally, we identified some categories of Personal Information (also known as Privately Identifying Information, PII) which we could automatically find in text with good precision (including user IDs, emails, or IP addresses), and redacted them from the text (by replacing them with special tokens such as PI: EMAIL).

The outcome of this approach consisted of language-specific OSCAR data ranging from 1 to 150GB for most languages and over 1.1TB for English. We added all of the former data to the training corpus and randomly sub-sampled 100GB of the English portion. In the end, OSCAR data made up about 40% of the final training corpus.

Concluding Remarks and Next Steps

In this post, we have described our efforts to build a large multilingual corpus guided by the following goals and values:

Target diversity of content types and geographical diversity of sources
Prioritize human input and expertise in the curation process to foster data quality and accountability of the curation decisions
Build a corpus of a size that will allow us to make full use of our computation budget
Select, process, and document data sources to enable better data governance.

Dataset Release and Documentation Plans

The efforts of the past months were chiefly focused on preparing the aggregated training corpus in time for the training run while ensuring that the process was documented well enough to enable us later to provide complete documentation and partial release of the individual data sources. The next few months will be focused on leveraging this work to incrementally release either an extensive visualization or the full content of the sources according to the recommendations of our Data Governance work:

Most of the data sources are deemed ready to be fully released. We will work on finalizing their processing and documentation throughout the model training and release them under the care of BigScience organization of the Hugging Face hub.
Some of the data sources were entrusted to us for the explicit purpose of training the BigScience model but needed further efforts to be made more publicly available. We will keep working on the governance structure we proposed to make progress towards their release. In particular, we are looking for organizations who want to act as data hosts for these sources; reach out if your organization would be a good candidate!
A few of the sources we were able to use for training cannot be fully released, either for privacy or for licensing reasons. For these, we are working on finalizing documentation and proposing interactive visualizations for researchers who want to interrogate the full training corpus.

Stay tuned for updates on each of these aspects in the coming months!

Annexe

List of 46 languages

Niger-Congo languages:

Chi Tumbuka (0.00002%)
Kikuyu (0.00004%)
Bambara (0.00004%)
Akan (0.00007%)
Xitsonga (0.00007%)
Sesotho (0.00007%)
Chi Chewa (0.0001%)
Twi (0.0001%)
Setswana (0.0002%)
Lingala (0.0002%)
Northern Sotho (0.0002%)
Fon (0.0002%)
Kirundi (0.0003%)
Wolof (0.0004%)
Luganda (0.0004%)
Chi Shona (0.001%)
Isi Zulu (0.001%)
Igbo (0.001%)
Xhosa (0.001%)
Kinyarwanda (0.003%)
Yoruba (0.006%)
Swahili (0.02%)

Indic languages:

Assamese (0.01%)
Odia (0.04%)
Gujarati (0.04%)
Marathi (0.05%)
Punjabi (0.05%)
Kannada (0.06%)
Nepali (0.07%)
Telugu (0.09%)
Malayalam (0.1%)
Urdu (0.1%)
Tamil (0.2%)
Bengali (0.5%)
Hindi (0.7%)

Other languages:

Basque (0.2%)
Indonesian (1.1%)
Catalan (1.1%)
Vietnamese (2.5%)
Arabic (3.3%)
Portuguese (5%)
Spanish (10.7%)
Code (13%)
French (13.1%)
Chinese (17.7%)
English (30.3%)