Bolmo architecture enables efficient byte-level LM training without sacrificing quality

0 0 2 minutes read

Bolmo architecture enables efficient byte-level LM training without sacrificing quality

Companies that want tokenizer-free multilingual models are increasingly turning to byte-level language models to reduce the fragility of noisy or low-resource text. To exploit this niche – and make it practical at scale – the Allen Institute of AI (Ai2) introduced Bolmo.a new family of models that exploit its Olmo 3 models by “bytefying” them and reusing their structure and capabilities.

The company has launched two versions, Bolmo 7B and Bolmo 1B, which are “the first fully open byte-level language model,” according to Ai2. The company said both models were competitive with — and in some cases outperformed — other byte- and character-based models.

Byte-level language models work directly on raw UTF-8 bytes, eliminating the need for a predefined vocabulary or tokenizer. This allows them to more reliably handle misspellings, rare languages, and unconventional text – key requirements for moderation, edge deployments, and multilingual applications.

For companies deploying AI in multiple languages, with noisy user input, or constrained environments, tokenizer-less models offer a way to reduce operational complexity. Ai2’s Bolmo attempts to make this approach practical at scale, without retraining from scratch.

How Bolmo works and how it was built

Ai2 said it trained the Bolmo models using its Dolma 3 data mix, which helped train its Olmo’s flagship modelsas well as some open-code datasets and character-level data.

The company said its goal “is to provide a reproducible and inspectable model for building strong subword language models in a way that the community can adopt and extend.” To achieve this goal, Ai2 will release its checkpoints, code and a complete paper to help other organizations build byte-level models on top of its Olmo ecosystem.

Since fully training a byte-level model from scratch can be expensive, the Ai2 researchers instead chose an existing Olmo 3 7B checkpoint to byteify it in two steps.

In the first stage, Ai2 froze the Olmo 3 transformer so that they only drive certain parts, such as the local encoder and decoder, the boundary predictor, and the language modeling head. This was designed to be “cheap and fast” and requires only 9.8 billion tokens.

The next step unlocks the model and trains it with additional tokens. Ai2 said the byte-level approach allows Bolmo to avoid the vocabulary bottlenecks that limit traditional subword models.

Strong performance among peers

Byte-level language models are not as common as small language models or LLMs, but it is a growing area of research. Meta released its BLT architecture research last year, aiming to propose a robust model, processing raw data and not relying on fixed vocabularies.

Other research models in this space include ByT5, MrT5 from StanfordAnd Canine.

Ai2 assessed Bolmo using its assessment suite, covering math, STEM reasoning, question answering, general knowledge, and coding.

Bolmo 7B showed solid performance, outperforming character-focused benchmarks like CUTE and EXECUTE, and also improving accuracy over the base LLM Olmo 3.

Bolmo 7B outperformed comparably sized models in coding, math, multiple-choice quality assurance, and character-level understanding.

Why Businesses May Choose Byte-Level Models

Businesses find value in a hybrid model structure, using a combination of models and model sizes.

Ai2 argues that organizations should also consider byte-level models, not only for their robustness and multilingual understanding, but also because they “naturally connect to an existing ecosystem of models.”

“One of the key benefits of the dynamic hierarchical configuration is that compression becomes a toggleable button,” the company said.

For companies already using heterogeneous model stacks, Bolmo suggests that byte-level models are no longer purely academic. By modernizing a robust subword model rather than training from scratch, Ai2 signals a lower-risk path for organizations that want robustness without abandoning existing infrastructure.

ahsan65@gmail.com2 weeks ago

0 0 2 minutes read