The incredible inner world of LLMs (I)
The concept of a 'model' Have you ever heard or read phrases like “a model student” or “a role model”? A model is a name tied to a set of defined characteristics. If we define a 'TEF' as something that should weigh around a kilo, be made of Bakelite, shiny, have a dial with ten numbers, and a copper system inside... we're assigning a set of attributes to the name 'TEF'. We’ve just named a model. Hooray! Now let's take a real-world object. For example, a piano. Pianos have some copper inside their metal framework. Their keys or plastic components might contain Bakelite. Generally, they don't have a ten-digit dial (though some modern models may have a keypad). But they weigh more than a kilo — much more. Shine... yes, our piano has that classic piano-black gloss... So how similar is the piano to our 'TEF' model? There are many ways to find out. One of them is to place the mentioned attributes in a vector space and use a function to determine how close those attributes are to the model's. Let’s see: Now we have two vectors: TEF model: [1,1,1,1,1] The piano: [0, 0.2, 0.8, 0, 1] There are various ways to calculate similarity: Manhattan distance, cosine similarity, etc... Intuitively, the piano is (roughly) 40% a TEF model. In other words, it's still quite far from being a proper TEF. Now we head to the attic and find an old Bakelite telephone, but instead of a rotary dial, it has buttons (a Heraldo model with a keypad). Let’s calculate the similarity: The similarity is... 80%! Even more so if we were more flexible with the “Ten-digit rotary dial” attribute. If instead we used “Has ten numbers” as one category, and “Has rotary dial” as another, the similarity to the 'TEF' model would reach nearly 90%. So, with this simple example, you now have a rough idea of what a model is in machine learning. The neural network model A neural network is trained with input data: photos of pianos. Thousands, hundreds of thousands... no — millions of piano photos! These are broken down into vectors, and after seeing many photos, a set of attributes is discovered that can represent most pianos in images. A neural network is just that: thousands to millions or even billions of vectors with specific values after being trained on piano photos. These adjustments are what we call the weights of the neural network. And the trained network that detects pianos in photos is called a model. Now you feed it ten photos of cats, and among them, one of a cat playing a piano. The network will convert the photos into vectors, pass them through its layers (like bouncing around in a pinball machine), and return a score indicating whether a piano is present. If the training was successful, only the photo of the cat playing the piano will score high enough to be a match. What’s more, if we only trained it on black pianos, our model might hesitate if shown a piano of another color. It would recognize the shape, score it well, but dismiss it as a piano because the weights say a piano can only be black. ■ This is what's known in machine learning as overtraining or overfitting. Without variety in the dataset, other attributes may stray too far from the model. And beware — if we provide too much variety and randomness, the opposite occurs: underfitting. The network ends up asking, “Okay, lots of photos, but what exactly do you want me to learn?” In this case, it might mistake a piano for a helicopter. But what exactly is a model when we talk about LLMs? Great. Now we have a conceptual idea of what a model is. Let’s get specific about LLMs (language-specialized models). ■ LLM stands for Large Language Model. It's a model that is large and designed to handle human language — take your pick of definition: “large-scale,” “extensive,” “massive.” So why do we call them large? Remember those attributes? Copper, shiny, Bakelite... How many did we count? Five? That’s a tiny micromodel. Now imagine a model with billions (US billions) of parameters. Massive, right? That's why it’s called large — the interconnection between weights (attributes) is so enormous it’s impossible to manually calculate even the distance between two words. But today we have computers with processors and memory capabilities that were once unimaginable. Add distributed computing on top of that. It’s all massive. ■ That’s why AI has advanced so quickly in recent years, emerging from one of its winters straight into a blazing 40º summer in the shade. Can we train models on a modest laptop? Of course you can, but... they won’t have billions of parameters. Training a model involves processing (tokenizing and vectorizing) millions upon millions of documents: all the knowledge you can get. It’s a Herculean task. Training a current LLM can take months. The cost runs into the millions of euros. You need lots of power and extremely powerful processors working in sync for long periods. But if we scale down our ambitions, we can train a small model with just a few million parameters. For example, processing all of Wikipedia might only take a few days on an average GPU. Just don’t expect philosophical sunset chats in white robes, of course. ■ The alternative is to take an existing, solid, reliable model and retrain it for a specific purpose. This is called fine-tuning, which we’ll explore practically in future articles. Where are models hosted and how do we evaluate or classify them? Well, you’ve likely used several models like ChatGPT, Claude, etc. These are private models. You can use them, but you can’t download or run them locally (and even if you could, their hardware requirements are beyond the average user). The good news is that many models are available for download and use. We’ve already discussed how to do it in these articles, and here. Now let’s put on our technical glasses and look at the types of models out there — and where to find them: the land of HuggingFace. This community is packed with models, datasets for training or fine-tuning, papers, docs, libraries, forums, etc. It’s a true bazaar — and fortunately, most of it is open-source or under permissive licenses. In fact, there’s so much available that you can easily get lost in its bustling “streets,” where new models appear at a pace rivaling the JavaScript ecosystem. Take a look: Currently — and still growing — there are nearly 1.7 million models. Easy to get lost... Also (see red box on the left of the image), we have different types of models based on their purpose. That is, models specialized in computer vision: image classification (hello again, pianos!), video classification, image/video description, generation, etc. Besides media-focused models, there are the classic ones specialized in text or natural language processing: text classification, generation, translation, and more. Then there are the multitaskers: multimodal models that combine multiple capabilities under one interface. —For example, you upload an image and “chat” with the model about it. One model processes the image and generates a description; another “head” then chats with you about that description. It looks like a single AI, but it’s actually several working together. 1.7 million models, but there's a catch... Before anything else — from that staggering number, 1.7 million, we need to do some filtering. Many models are already “old”, surpassed by newer versions, and their training data cutoff is now considered obsolete. Beyond that, we also need to divide the remaining number by model variations based on fine-tuning: You take LLaMA 3.2, fine-tune it for a specific task, and publish it on HuggingFace — you now have a “new” model. —For example, Facebook’s flagship model LLaMA spawns from this version: adaptations, fine-tunings, merges with other models, different quantizations… A whole family of siblings and cousins. Take a look at the numbers: You may also have heard of “distill” or distilled models. This is essentially a technique for transferring the knowledge or behavior of a large model into a smaller one — saving training time (and costs) and producing a more practical model in terms of size and requirements. Also, a single model can have multiple versions. Check out this chart for the PHI2-GGUF model: What you see — 2-bit, 3-bit, 4-bit — refers to the model’s quantization. What is quantization? Remember when we talked about vectors and such? We won’t go into math, but this is essentially the “precision” of the neural network layer weights in the model. Let’s look at an example: A person steps on a scale and it reads 74.365777543423 kg. That’s quite precise, right? Now let’s say we don’t want that much precision, so we quantize to 8-bit: The weight would read 74.366 kg. Still too precise. Let’s lower it further, say to 4-bit: Now it reads 74.4 kg. We can go even lower, 2-bit: Now it stores the person’s weight as: 70 kg. What just happened? → As we lowered the quantization, we lost precision in the weight measurement, but gained storage efficiency. In short: We’ve compressed the model’s memory footprint at the cost of precision. Parameters Another important factor to consider is the model’s number of parameters. No matter how well-quantized a model is, if it has too many billions (US) of parameters, it simply won’t fit into an average laptop’s RAM. The model I showed earlier has 2.78 billion parameters. Quantized to 8-bit, its disk size is 2.96 GB; with 2-bit quantization, it drops to 1.17 GB. But if we try this with a model that has, say, 70 billion parameters... this happens: As you can see, my machine (16 GB of RAM) can’t handle even the most aggressive 3-bit quantization — the model size exceeds the total RAM (let alone the available RAM...) In short, no matter how much we quantize, there’s a limit — you can’t load a 70-billion-parameter model on modest RAM. ■ PART II Telefónica Tech The incredible inner world of LLMs (II) June 11, 2025
June 10, 2025