Main Challenge #HackathonSomosNLP 2026: LLM and VLLM Alignment

🎯 Challenge objective

Choose one of the following options:
- A. Align a language model (LLM) to generate text in a culturally appropriate way
- B. Adapt a multimodal vision-language model (VLLM) to generate image descriptions that take cultural context into account
In Spanish, Portuguese or any language of the Iberian Peninsula or LATAM
Adapt an existing model (don’t pre-train one from scratch); we recommend starting from models around 7B (e.g. Salamandra, Mistral and Gemma)
Generate the dataset with the help of 500 USD in Cohere API credits! We recommend filtering and extending the v0 preferences dataset generated collectively in the Arena: somosnlp-hackathon-2025/dataset-preferencias-dpo-v0
Train your model directly in JupyterLab on the Hugging Face hub — we have GPUs sponsored by 🤗!
Upload the model(s) along with all the notebooks used to hf.co/somosnlp-hackathon-2026
Write the Model Card; include links to the dataset and the notebooks used (e.g. preprocessing, training)

Guide

✅ Preparation

Requirements per team

Contribute 100 quality prompts to the preferences dataset
Answer 200 questions from the evaluation dataset (BLEND)
Request the 500 USD Cohere API credits (after completing points 1 and 2, mention @mariagrandury in your team’s channel for instructions)
Create a Space in the organization hf.co/somosnlp-hackathon-2026 with the jupyterlab template
Complete the registration form

📚 Dataset

Data is the most important thing in developing a model, and we’ll also give it more weight when evaluating the projects 👀

Generate a dataset for your project:
- Use the one generated collectively in the Arena as the initial version for your dataset: somosnlp-hackathon-2025/dataset-preferencias-dpo-v0
- Take advantage of the 500 USD in Cohere API credits that each team has to filter, improve and extend it with more prompts and responses specifically designed for your use case
- Keep in mind that, since this is about cultural topics, it’s very important that everything generated synthetically is reviewed by a person (you can use Argilla)
Upload the dataset to hf.co/somosnlp-hackathon-2026 and iterate
Upload all the notebooks and scripts used to generate and process the dataset to the dataset repo
- If you prefer to create a GitHub repo with all the code, you can — just don’t forget to include a link in the Dataset Card
Fill out the Dataset Card properly
- “Dataset Card” is the name of the documentation for Hugging Face datasets — it’s the README.md of the dataset repository
- NOTE: This is taken into account when evaluating the project
- Include the project motivation and impact in the introduction
- Detail the generation and processing pipeline, include the libraries used and mention the tests done, include links to the code
- Specify the license: preferably apache-2.0; if not, explain why
- Evaluate the dataset’s biases, whether it’s balanced, what language varieties or opinions it represents, etc.

How to name datasets:

The name of the dataset with the (minimum 100) prompts you submitted to the LLM Arena must contain prompt. For example: normas_culturales_colombia_prompts
The names of preference datasets must contain the name of the main algorithm they can be used for (dpo or kto). For example: normas_culturales_colombia_dpo
If the dataset is multimodal, it must contain image. For example: utensilios_ecuador_images_kto

⚙️ Model

Create a Space in the organization hf.co/somosnlp-hackathon-2026 with the JupyterLab template
The Hugging Face team will assign an L40S grant to the Space
- Set the “auto-sleep” time to 5 minutes to ensure responsible use 🌱
Design the training notebook
- Save the resulting model directly to hf.co/somosnlp-hackathon-2026
- Use the CodeCarbon library to assess the climate impact
Run tests with small models and dataset subsets to verify the code is correct, so you don’t run into bugs after several hours of training.
Launch the training, review the results and iterate
- You can try e.g. different algorithms or base models
- You don’t need to create a different repo for each model — if you push to the same repo, the updated model will be saved as a new commit (which you can link to from the Model Card if you want)
Download the dataset processing and model training notebooks, upload them to the model repo (VERY IMPORTANT) and delete the JupyterLab Space
Fill out the Model Card properly
- “Model Card” is the name of the documentation for Hugging Face models — it’s the README.md of the model repository
- NOTE: This is taken into account when evaluating the project
- Recommendation: describe the tests as you do them, as well as the dataset improvement and model training process
- Include the project motivation and impact in the introduction
- Detail the training process, include the libraries used and mention the tests done, include links to the code
- Specify the license: preferably apache-2.0; if not, explain why
- Evaluate the model’s biases
- Evaluate the environmental impact

Resources

Below we share plenty of resources so you can develop high-quality projects. Resources marked with ⭐ correspond to talks and workshops given during the hackathon and specifically designed to help you in this edition.

📚 Dataset

The Cohere API:

⭐ Practical workshop: How to use the Cohere API given by Alejandro Rodriguez, Research Engineer at Cohere. Use Cohere’s models to clean and extend your dataset.

Dataset creation:

⭐ Red Teaming for language models, given by Luis Vasquez, from the Reinforcement Learning, Alignment & Red Teaming team at the Barcelona Supercomputing Center.
⭐ MuSeD: Creation of a multimodal corpus in Spanish for sexism detection in social media videos, given by Laura De Grazia from the Universitat de Barcelona.
How to annotate linguistic corpora to train LLMs, given by Marta Guerrero @IIC, co-creator of 3 of the corpora that form La Leaderboard.
Distilabel and Argilla, tools for creating models like Notus given by Gabriel Martín, MLE @Argilla (notebook available).

Inspiration:

⭐ Describing and interpreting interaction using cultural scripts, talk given in English by Lauren Sadow from Aarhus University.
⭐ Expressing uncertainty in multilingual tasks given by Selene Báez, postdoctoral researcher at the University of Zurich.
Environmental ethics in AI: building sustainable narratives in Spanish, talk given by Jorge Vallego, Project Lead @H4rmony. It can help you give an eco-conscious approach to your dataset.

⚙️ Model

Creating the training Space:

Docs: JupyterLab on Spaces, where you can run your notebooks as always. Be careful not to lose storage when restarting the Space — save your notebooks!

LLM Alignment:

⭐ Practical workshop: LLM Alignment using Reinforcement Learning given by Luis Vasquez, from the Reinforcement Learning, Alignment & Red Teaming team at the Barcelona Supercomputing Center.

Multimodal models:

⭐ Talk: How to build an efficient Vision-Language Model given by Andrés Marafioti, ML Engineer at Hugging Face and creator of SmolVLM.
⭐ Talk: Instruction Tuning for Sequential Multimodal Reasoning given by Danae Sanchez, postdoctoral researcher at the University of Copenhagen.

LLM Fine-tuning:

Practical workshop: The impact of data quality on LLM fine-tuning, also given by Manu Romero, creator of 500+ models on the Hugging Face Hub.
Practical workshop: Fine-tuning large language models given by Manu Romero, creator of 500+ models on the Hugging Face Hub.
Workshop + AMA on LLM training with Alejandro Vaca, founder of LenguajeNaturalAI.
unsloth notebooks for faster training (in English, let us know if you need them translated): Gemma FT on Alpaca-style instruction dataset and RLAIF via DPO on Zephyr.

Climate impact:

To evaluate the carbon footprint of your model training, you can use tools like Code Carbon (better, integrated into 🤗 Transformers) or ML CO2 Impact.
We recommend this video for motivation, this article from the HF blog, and the documentation section of 🤗 Transformers that covers this topic.

📝 Documentation

Docs: how to write a good Dataset Card: this is the official Hugging Face documentation, includes a template and a couple of good examples.
Docs: how to write a Model Card: official Hugging Face guide, includes a link to the Space to create it automatically and an explanation of each section.
Space: Model Card Creator, Space that guides you in creating your model card.
Bias detection and mitigation in language models, talk given by María Grandury, founder of SomosNLP.

Back to challenges