Donate your corpus

We are going to standardize the evaluation process of LLMs in our languages and we need your help


The #Somos600M initiative has two ambitious goals:

  1. 🌎 Create a high-quality and diverse instruction corpus that represents a wide variety of countries, registers, and topics.
  2. ✅ Create a public LLM leaderboard that allows us to standardize how to evaluate and compare different models in Spanish and co-official languages.

Whether you have a wonderful corpus or a pile of documents, you can surely collaborate!

How can I collaborate?
  • If you don’t know what a “corpus” is but you have large amounts of documents that you would like to publish so that AI systems can express themselves better in your language and work better for your daily tasks, contact us!
  • If you have a set of documents that you would like to use to extract information or automate your daily tasks, sign up for the hackathon!
  • If you have a training corpus that you would like to donate so that the next generation of LLMs in your language works better for your use cases, keep reading!
  • If you have an evaluation corpus created by specialists and you want to participate in creating the first public LLM leaderboard in Spanish, keep reading!

We just need you to share the corpus information with us — we’ll take care of everything else!

For any questions, send us an email at info@somosnlp.org or contact us on Discord — we’re waiting for you!

📚 Donate your corpus!

💡 Motivation and frequently asked questions

We understand that corpora are veeeery precious. Why donate them?

Training corpus

Your contribution is key to creating a public, diverse, and high-quality instruction corpus that will serve as a benchmark in the field.

  1. The open-source LLMs trained by the community will achieve better results for your use cases. The base model on which you build your adaptations will be of higher quality!

  2. When you donate a corpus, its corresponding citation will be included in the table, and your organization will become a sponsor of the #Somos600M Hackathon — read on to learn about all the benefits this entails!

Evaluation corpus

Your contribution is key to creating a public and unified leaderboard that will serve as a benchmark in the field.

  1. By donating, you have the unique opportunity to shape the future of LLM evaluation in Spanish and co-official languages, establishing new quality and performance standards.

  2. It will allow you to show the entire community how your models compare with the market with greater reliability, since the results will be published by an impartial entity.

  3. By choosing to donate only the evaluation portion, you maintain your competitive advantage by keeping the training portion private. Publishing your results on the leaderboard does not mean the community has access to your models.

  4. When you donate a corpus, its corresponding citation will be included in the leaderboard citation, and your organization will become a sponsor of the #Somos600M Hackathon — read on to learn about all the benefits this entails!

Your donation not only contributes to scientific advancement but also strengthens your position as a leader in Natural Language Processing innovation in your language 💪

📸 Visibility for corpus sponsorships

We will create a public table with all donated corpora that will include, in addition to basic corpus information, the institution that created it, how to cite it, and a link to your documentation where you can include all the additional information you want.

Training corpus

We will encourage hackathon participating teams to use your corpus in their projects, which will give it visibility and encourage its use in projects with social impact 💛

Evaluation corpus

Just like in the Open LLM Leaderboard, your corpus citation will be included in the leaderboard citation. Additionally, the corpus will be cited in the article we publish describing the leaderboard creation process 📝

Extra visibility for all corpora
  • Logo on the hackathon webpage and registration page: size L
  • Logo on the “Community” page: first category
  • Acknowledgment in the “Community” section for the people who created the corpus
  • Social media acknowledgment: to the entity and particularly to the people who created the corpus
  • Tags in posts: minimum 10
  • Promotional blog article about the corpus creation
  • Promotional talk (max 45’) about the corpus creation
  • Promotional video (3’) about the company or research group
  • Mention in an article describing the hackathon sponsorships
  • Live mention at the hackathon opening and closing ceremonies

All benefits are optional — choose the ones you like best. If you have other proposals, we’d be happy to hear them.

✅ Corpus requirements

Corpora of all tasks are welcome, including both understanding (NLU) and text generation (NLG), as well as instruction corpora of all kinds. Corpora of all modalities (text, audio, and images with descriptions) are also accepted.

  • They must be high-quality corpora created by specialists. If it is a health-related corpus, the participation of people with healthcare training in its creation is essential.
  • We give priority to corpora originally created in the corresponding language (vs. translations). Translations are also accepted if a subsequent validation process is ensured.
  • Regarding evaluation corpora, since they will be used to evaluate and compare models, a clear and low-variability evaluation method must be provided with the corresponding corpus.
📚 Donate your corpus!

🙌 Acknowledgments

Corpus collection campaign organized with the support of:

Instituto de Ingeniería del ConocimientoSociedad Española para el Procesamiento del Lenguaje NaturalLenguajeNatural.AI

Collection of donated corpora:

Collection of donated corpora