Scaling Up Models and Data with t5x and seqio

Adam Roberts; Hyung Won Chung; Gaurav Mishra; Anselm Levskaya; James Bradbury; Daniel Andor; Sharan Narang; Brian Lester; Colin Gaffney; Afroz Mohiuddin; Curtis Hawthorne; Aitor Lewkowycz; Alex Salcianu; Marc van Zee; Jacob Austin; Sebastian Goodman; Livio Baldini Soares; Haitang Hu; Sasha Tsvyashchenko; Aakanksha Chowdhery; Jasmijn Bastings; Jannis Bulian; Xavier Garcia; Jianmo Ni; Andrew Chen; Kathleen Kenealy; Kehang Han; Michelle Casbon; Jonathan H. Clark; Stephan Lee; Dan Garrette; James Lee-Thorp; Colin Raffel; Noam Shazeer; Marvin Ritter; Maarten Bosma; Alexandre Passos; Jeremy Maitin-Shepard; Noah Fiedel; Mark Omernick; Brennan Saeta; Ryan Sepassi; Alexander Spiridonov; Joshua Newlan; Andrea Gesmundo

Scaling Up Models and Data with t5x and seqio

Adam Roberts, Hyung Won Chung, Gaurav Mishra, Anselm Levskaya, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Kehang Han, Michelle Casbon, Jonathan H. Clark, Stephan Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, Andrea Gesmundo; 24(377):1−8, 2023.

Abstract

Scaling up training datasets and model parameters have benefited neural network-based language models, but also present challenges like distributed compute, input data bottlenecks and reproducibility of results. We introduce two simple and scalable software libraries that simplify these issues: t5x enables training large language models at scale, while seqio enables reproducible input and evaluation pipelines. These open-source libraries have been used to train models with hundreds of billions of parameters on multi-terabyte datasets. Configurations and instructions for T5-like and GPT-like models are also provided. The libraries can be found at https://github.com/google-research/t5x and https://github.com/google/seqio.