Jorge Pereira Delgado

Jorge Pereira Delgado

Graduado en Ingeniería de Imagen y Sonido y con un Máster de Machine Learning y Métodos Estadísticos para Procesado Multimedia por la Universidad Carlos III de Madrid. Apasionado de la enseñanza y la ciencia de datos desde la carrera, he trabajado en el departamento de Teoría de la Señal y Comunicaciones en la Universidad Carlos III de Madrid como docente e investigador en diferentes áreas de Machine Learning, como modelado de tópicos y algoritmos de regresión en computación distribuida. Actualmente, trabajo como formador y Data Scientist en Telefónica Tech.
AI & Data
Retrieval Augmented Generation (RAG): teaching new tricks to old models
OpenAI launched its ChatGPT tool in November 2022, sparking a revolution - well, another one actually - in the world of artificial intelligence. ChatGPT is a model from a family of so-called Large Language Models (LLMs). These models are based on Transformers architectures, which we talked about in this post, and are trained with huge amounts of text —to get an idea, it is estimated that GPT4, the latest version of OpenAI, was trained using 10,000 GPUs uninterruptedly for 150 days—, thus learning to generate text automatically. Such has been the success of LLMs that since the release of ChatGPT, all leading companies have been developing and improving their own LLMs: Meta's Llama 3, IBM's Granite, or Anthropic's Claude. Although these models are very versatile, being able to answer a wide range of questions, from general culture to even mathematical and logical question—something that was initially their Achilles heel—, sometimes we want our model to have knowledge about a very specific domain, which may even be information specific to a particular company. This is where several techniques have emerged to be able to use LLMs, but also “teach” them —note the quotation marks— new knowledge. Retraining our model If I ask any LLM what the capital of Spain is, for example, he will answer that it is Madrid. The model has been trained with a lot of public information —we can think of Wikipedia, for example— among which is the answer to the question we have asked. However, if you ask it what your dog's name is, the LLM will not know that information —unless you are Cristiano Ronaldo or Taylor Swift— because it has not been trained with it. Since training an LLM from scratch is not an option for the average Joe —we can't all afford to have 10,000 GPUs running for 150 days at a time, as expensive as electricity is!— different techniques have been explored to incorporate extra information to LLMs already trained, thus saving time and money, especially money. A first approach, very widespread in the world of Deep Learning in general, is to use Fine Tuning techniques: that is, to take an LLM already trained and retrain it with our particular data. Therefore, for example, we have models such as BloombergGPT, an LLM that, based on OpenAI models, was retrained with a large amount of financial information, thus creating a model specially designed for those working in the world of finance. However, since the models are so large —billions and billions of parameters to be adjusted— this technique is still very expensive. Moreover, sometimes things are not as pretty as they sound, as in many cases BloombergGPT performs only slightly better than ChatGPT — without having any extra knowledge, but the base model that any of us can use— and worse than GPT4, the latest version of the model. Knowing information vs. knowing how to search for information If the reader were asked the birthday of a far—off relative, he or she would probably not know it by heart— at least that is my case, I admit. However, having that information in memory is irrelevant, since we can have it written down in a calendar —for the more analogical ones— or in the cell phone. The same thing happens with LLMs. If we train an LLM with specific knowledge, the LLM will “know” that knowledge, just as we know —I hope— our parents' birthdays. However, we don't need an LLM to know something in order to ask about it, we can give it extra information to elaborate its answer, in the same way that if we are asked something we don't know, we can look up the answer in books, notes, internet, or by calling our friend. So, if I ask ChatGPT about my birthday, it obviously does not know the answer, as we can see in Figure 1. Figure 1: example of ChatGPT response to unknown information. If I however, provide it with additional information about me at the prompt —the text we pass to the LLM— we can see how the answer about the date of my birthday is correct —even ChatGPT indulges in some flourish, as we can see in Figure 2. The LLM does not know the information but, if given a context, it does know how to look it up and use it to work out the answer. Figure 2: Example of ChatGPT response when you provide it with additional information. Finding the right information: RAG systems At this point, the reader might not be very satisfied with the above example. What is the point of asking ChatGPT about my birthday if I must give that information at the prompt myself? It seems rather pointless to ask an LLM something if I must give it the answer myself. However, we could design a system that would automate the search for that context, that additional information that the LLM needs, to pass it to the LLM. And this is precisely what RAG (Retrieval Augmented Generation) systems do. RAG systems consist of an LLM and a database that will store the documents about which we are going to ask questions to the model. So, for example, if we work in an industrial machinery company, we could have a database with all the manuals, both for the machinery itself and for safety, and we could ask questions about any of these topics, which would answer us by taking the information from these handbooks. The operation of the RAG systems — illustrated in Figure 3— is as follows: The user types a prompt formulating their question, which will be converted to a numeric vector through an embedding model. If you are not familiar with this term, don't worry, you can think of embedding as nothing more than the “translation” of a text into the language of computers. This “translation” of the prompt is then taken and compared with as many documents as desired - in this case handbooks - and the most similar fragments are extracted. Here, it must be emphasized that all documents have been previously “translated” with the same embedding model. Finally, the LLM is provided with both the prompt and the information extracted from our vector database, automatically providing it with the necessary information to answer our question. Figure 3: Schematic diagram of an RAG system. Figure 3: Schematic diagram of an RAG system. So, just as we can search for information that is unknown to us, an LLM can do the same if we integrate it into a RAG. We can therefore exploit the text generation capabilities of LLMs, while automating the generation of the context and information needed to answer the questions, we want... as long as we have the necessary document at hand, of course! IA & Data Preparing your data strategy for the Generative AI era March 7, 2024 Image by rawpixel.com / Freepik.
June 17, 2024
Connectivity & IoT
IoT anomalies: how a few wrong pieces of information can cost us dearly
When we hear the term Internet of Things - or IoT in short - we often think of internet-enabled fridges or the already famous smartwatches. But what does it really mean? The term IoT refers to physical objects that are equipped with sensors, processors and are connected to other similar elements, allowing them to obtain certain information, process it, and collect it for later use. It's not all fridges and overpriced digital watches: from sensors that count the number of people entering an establishment to screens with an integrated camera that can detect if someone is paying attention to its contents and record that information. Other interesting examples can be found in the first articles of this series, where we discussed several use cases of these technologies or the advantages of using smart water meters to optimise the water lifecycle. However, sometimes the data captured deviates from the normal values. If the data received from an IoT device tells us that a person has been staring at an advertising screen for hours or that the temperature inside a building at a given moment is 60ºC, this data is dubious to say the least. These outliers must be taken into account when designing our networks and devices. The received values have to be filtered to see if they are normal values or not and act accordingly. This is a very clear example of how Artificial Intelligence (AI) and the Internet of Things (IoT) benefit from each other. The combination of both is what we know as Artificial Intelligence of Things (AIoThings), which allows us to analyse patterns in the data within an IoT network and detect anomalous values, i.e., values that do not follow these patterns and that are usually associated with malfunctions, various problems or new behaviours, among other cases. What is an outlier? The first thing to ask ourselves is what anomalous values, commonly called outliers in the world of data science and Artificial Intelligence, are. If we look at the definition of anomaly in the RAE dictionary, we get two definitions that fit surprisingly well with the data domain: "deviation or discrepancy from a rule or usage" and "defect in form or operation". The first definition refers to values that do not behave as we would expect - that is, they are far from the values we would expect. If we were to ask random people what their annual income is, we would be able to put a range of values in which the vast majority of values would lie. However, we might come across a person whose income is hundreds of millions of euros per year. This value would be an outlier, as it is far from what is expected or "normal", but it is a real value. The second definition refers to the term "defect". This gives us a clue: it refers to values that are not correct, understanding that a data is correct when its value accurately reflects reality. Sometimes it is obvious that a value is wrong: for example, a person cannot be 350 years old and no one can have joined a social network in 1970. These values are inconsistent with the reality they represent. What to do with outliers? The next question to ask is what to do about these anomalies, and the answer again varies depending on their nature. In the case of inconsistent data - for which we know the values are not correct - we could remove those values, replace them with more consistent values - for example, the mean of the remaining observations - or even use more advanced AI methods to impute the value of those outliers. The second case is more complicated. On the one hand, having such extreme data could greatly impair the capabilities of our models, and on the other hand, ignoring an outlier could have devastating consequences depending on the scenario in which we find ourselves. To illustrate the case in which we keep outliers and they hurt us, we can imagine a scenario in which we want to predict a target variable - for example, a person's annual income - as a function of one or more predictor variables - years of experience, municipality in which he/she works, level of education, etc. If we were to use linear regression - simplistically, a method that fits a line to the data in the best possible way - we can see that an outlier could greatly impair the way that line fits the data, as shown in Figure 1. Figure 1: Comparison of results of a linear regression with and without outlier. In the event that we choose to ignore these extreme values, we may have the problem of not being able to predict the consequences of such events which, although improbable, are possible. An example would be the case of the Fukushima nuclear disaster. In this case, a magnitude 9.0 earthquake was estimated to be unlikely, so the plant was not designed to withstand it (Silver, N. "The Signal and the Noise", 2012). Indeed, the probability of an earthquake of such a magnitude in that area was very small, but if the effects and damage had been analysed, it would have been possible to act differently. Anomalies in sensors and IoT networks It is the same in the IoT world: is the data real, or is it due to sensor error? In both cases, appropriate action needs to be taken. If the anomalous data is due to a malfunction of the sensors at specific points in time, what we will try to do is to locate those errors and predict or estimate the real value based on the rest of the data captured by the sensor. There are several AI algorithms that can be used here: based on recurrent neural networks, such as LTSMs, on time series, such as ARIMA models, and a long etcetera. The process here would be as follows: we have a sensor inside an office building, in order to obtain the temperature over time to optimise the energy expenditure of the building and improve the comfort of the employees. When we receive the data from the sensor, it will be compared with the data predicted by an AI model. If the discrepancy is very large at a certain point - we assume that the sensor shows a temperature of 30°C at one point in time, while the rest of the day and our model show temperatures around 20-21°C - the model will detect that data as an outlier and replace it with the value predicted by our model. In the case of receiving an anomalous data, but we do not know if it is real or not and it could be very harmful, we should act differently. It is not the same to detect a temperature in a building that is slightly higher than normal for a few moments as it is to detect very low blood sugar values in patients with diabetes - another IoT use case example. The impact of outliers on data quality As Clive Humby - one of the first data scientists in history - said, "data is the new oil". The value that data has taken on in our society explains the rapid development of fields such as IoT in recent years. However, as with oil, if this data is not of the necessary quality and does not adequately reflect the information we need, it is worthless. Having the wrong data can lead to drastically different decisions that will take time and money to rectify. That's why, when capturing data in IoT environments, getting those outliers detected and corrected is a critical task. Image: master1305 /Freepik * * * For more content on IoT and Artificial Intelligence, feel free to read other articles in our series: AI OF THINGS AI of Things(I): Multiplying the value of connected things February 28, 2022 IA & Data AI of Things (II): Water, a sea of data March 16, 2022 AI of Things AI of Things (IV): You can already maximise the impact of your campaigns in the physical space April 26, 2022 AI of Things AI of Things (V): Recommendation and optimisation of advertising content on smart displays May 17, 2022
September 18, 2023