When it comes to open-source large language models (LLMs), model names often appear as cryptic strings that resemble code. Examples such as mixtral-8x7b-moe-exl2.gguf or llama-2–13b-chat.gguf.q4_0 may look opaque at first, but they are deliberately structured: each element conveys important metadata about the model and — crucially — provides guidance on whether your hardware can support it.
In this article, we will break down how to understand these names.
Decoding the Name Tag
When you browse models, the first thing you’ll notice is the number followed by a ‘B’. This ‘B’ stands for billion parameters, and it’s the primary indicator of the model’s size and performance — acting as a crucial sign of whether your GPU can actually run the model.
If the model name starts “doing maths” or includes the letters MoE, that means you’ve found a Mixture of Experts model.
The Quantization Conspiracy
If the parameter count tells you the size of the original model, the suffix tells you how someone managed to shrink it down. You’ll see those bizarre, almost meaningless tags like GGUF, AWQ, EXL2, or GPTQ.
These tags are not random; they represent different model formats or quantization methods designed to let you run models locally that you typically couldn’t due to their large parameter count.
The goal of these methods is quantization, which is essentially reducing the number of bits used to represent the numbers the model uses, thereby shrinking its precision. Historically, this process could result in “lobotomizing” the models, but thankfully, the performance loss isn’t as noticeable nowadays when used on better models.
Here is a quick translation of the common tags:
- GGUF: This file format (a successor to GGML) supports different quantization schemes and is popular because it allows models to run on the CPU and is neatly contained within a single file.
- EXL2: Used by ExLlama V2, this is often the fastest optimization format available. It achieves impressive speeds by mixing quantization levels between 2 and 8 bits per weight, though it is currently only available for Nvidia GPUs.
- AWQ: This is a technique that reduces model size by strategically setting smaller weights to zero before rounding the rest to the nearest quantization threshold.
- Safe Tensors: If you see this in the mix, don’t worry — it’s not a quantization method. It’s just a secure file format that prevents malicious parties from adding “sus things” to your PC when you load a model from an unknown source.
