FLEXOLMO: Open Language Models & Data Flexibility

I wanted to read this paper since I saw the GPU Mode video where Professor Sewon Min, one of the main authors of the paper, presented the paper in great detail and very easy to understand. This article is more of my notes from the video, since I looked more into the video than the paper itself.

In standard LM training,

Centralized access to all data during training.
Data access is binary, available or unavailable

Unfortunately, this is not practical. Data availability lays on a spectrum:

Model developing cannot have direct process to the data.
Data can be stored in a very specific location and thus cannot be transferred, but can be read. I once talked to a Staff Research Scientist at Google and he mentioned when he was create models for Google’s proprietary data, he wouldn’t have direct access to the data, rather he could only build a model and use a pre-made dataloader.

Goal and Big Idea:

The goal of this paper was to create a model that supports modular distributed training.

Each independent data owner/source can train shared model on their respective data synchronisouly. Then they combine the shared model with each model that is trained by each party to create a combined model. Thus, they dont need to share raw data and synchornize training. This would support easy data addition and removal with no further training by removing the “expert” from MoE that works with that data.

Option 1: Model Merging

Set of n models trained independently and then they are merged. In this problem, each data owner trains their own model then they use merging techniques such as model soups(weighted weight merging) or ensembling where you get probable distance from each model and perform weighjted ensembling.

Option 2: MoE Merging:

Each data source train their own model. They they move each FNN from every data source into MoE. Basically, there is one common model that is pretrained by each organization with their own data. Each pretrained model is an “expert” that uses a common router, normalization, and multihead attention.

Issue: Even after merging, there needs to be joint training since the router needs to be exported to the data. Thus, data can’t stau local to train the router matrix. Training the use public data, the router will just be biased to that expert.

Solution 1: Learning to coordinate(MoE aware training)

For each organization, instead of doing pre-training, they make the given public model a MoE with 2 experts(public data and their own propretiary data). The public data expert will be frozen and shared layers. Then all orgnaizations MoE will be merged.

Solution 2: Nonparametric Router

Decompose router matrix into vectors. Then each router embedding is trained by each organization. The public embedding is created by domain embeddings.

FlexOlmo Summary:

Model that is trained on public data only. Each organization takes the model and does modular training on the data. They made it an FFN. Each organization initializes the router from domain embeddings. Half the model is frozen, while the other half is trained on the propritary data. Then all is merged and concatanating all router information. Easy to add or remove organizations by adding or removing experts and part of the router matrix.

Press enter or click to view image in full size

Experiments:

Public data: They use common crawl public data.

Private data: Times news, Reddit Data, Educational Text

Realistic data: Data for tasks that can be different, such as math, code, creating writings, and papers.

Each expert is 7 billion parameters, once merged it has 50 billion tokens. Perform pretrianing on public shared data for 1 trillion tokens. Each organization contiinious pretraining with their own data for 50 billion tokens.

Results:

Conditional training helps in-domain tasks by out-of-domain tasks.
Model merging helps with BTM being the best model merging techniques.
FlexOlmo retains 90% of the benefict of fully unrestricted training.
Opting out has minimal impact to other experts since router has fewer choices to choose from.

Press enter or click to view image in full size

Paper Insights: FLEXOLMO: Open Language Models for Flexible Data Use

Goal and Big Idea:

Written By

Shanmuka Sadhu

More From Author

Paper Insights: Emerging Properties in Self-Supervised Vision Transformers

Goal and Big Idea:

Written By

Shanmuka Sadhu

More From Author

Paper Insights: Emerging Properties in Self-Supervised Vision Transformers

You May Also Like

Top 7 Open Source OCR Models – KDnuggets

5 Emerging Trends in Data Engineering for 2026 – KDnuggets

Probability Concepts Youll Actually Use in Data Science