Manuel de Luna Amat

Manuel de Luna Amat

IA & Big Data Data Scientist at Telefónica Tech

AI & Data
Mathematical revolution! LLMs break down barriers and tackle mathematical challenges (Part 2)
One of the most surprising issues when you start to play around with LLM is the great difficulty they have to solve very simple mathematical problems compared, for example, to the enormous ease of performing very complex exercises for the average human being, such as poetry. For example, at the prompt: Create an alexandrine poem about how difficult it is for an LLM to solve elementary school level math problems. Write it as if you were Federico García Lorca. The (almost instantaneous) answer is: Numbers surround me, foreign in their flight, like shadows that haunt the threshold of my sight. My verses, flowing in rivers of language, entangle, captive, in the cold of a message. One and two, the numbers of the day, laugh mockingly at my melancholy dismay. I, who master the word, the rhythm in each line, stumble in labyrinths where the mind declines. Oh, impossible sums, subtractions that resist! Multiplication tables, like swords, persist. I'd like to decipher what within me refrains, but cold calculation is a song that constrains. Reason overwhelms me, blinds me with its clarity, and in algebra's fog, I lose my liberty. I am a poet, not arithmetic, in my heart burns the light of the moon, which never spurns. So, I keep on fighting, though the echo ignores, in the infinite lament of this mind that implores. Call me metaphor, figure, or solution, but I find no redemption in numbers' revolution. All this in a matter of 5 seconds. The following line “I am a poet, not arithmetic, in my heart burns the light of the moon, which never spurns” is very Lorca, don't you think? Now, let's move on to another question. Observe the following: Sugar! What a mistake GPTLorca has made! It is essential to remember that, in all cases of errors in basic math problems, it is assumed that there are no external agents involved, such as the system calculator, etc. A normal family PC can perform between 10 and 100 billion mathematical operations per second approximately, depending on the processor, graphics card, etc. And those operations are not as basic, in most cases, as the above operation. We are talking about vector operations, matrix operations, etc., and always executed correctly, without any error. So, why do LLMs fail to answer such basic questions? Let's recall a bit of theory. LLMs are primarily designed to predict the next word in a sequence based on textual patterns learned from large amounts of data. As shown in the diagram below, LLM is a type of artificial intelligence that can therefore understand and generate text, word by word, as if it were having a conversation with you. That text is, after all, a text transformed into a numeric vector that goes into a box, and the response is another numeric vector transformed into natural text. The response is produced after some “magic” inside the box (that magic is very, very big mathematical operations). Likewise, LLMs do not understand numbers as you do, you know that 1 represents the unit, it is the neutral element, that is, any number multiplied by 1 is still the same number. It is also not prime or composite, and let's not forget that it is also a universal divisor, because any number divided by 1 is still that number. You know that if you add two units you have 3 and if you subtract one unit from 1 you have 0. And as the song said, it is also the loneliest number (One - Three Dog Night). You are also aware of the following: you can add 10 days to a Monday, and you would have another totally different day, which would be Thursday. If the Monday you practice the operation on is the 30th of the month, you would not go to the 40th (or 41st) day, you would go to the 10th or the 9th (depending on the month!). You know other things, like that 100/4 is 25 as well as how to solve systems of equations or second-degree equations, following, step by step, the logical instructions your teacher gave you for that purpose. I am also sure that you are able to correctly solve these questions that Daniel Kahneman raises in his book Think fast, think slow (although your instinct first tends to give the wrong option, you are able to come up with a logic that leads you to the right answer): If 5 machines take 5 minutes to make 5 pots, how long would it take 100 machines to make 100 pots? 100 minutes or 5 minutes. In a lake there is an area with water lilies. Every day the area doubles in size. If the zone takes 48 days to cover the whole lake, how long would it take to cover half the lake? 24 days or 47 days. For an LLM, one is an achievement of three characters, “u” + “n” + “o”, and 1 is simply the character “1”, with no numerical logic associated with it. Think of a programming language, for example, Python. In Python “one” + “one” is “one-one” and “1” + “1” is “11”. 📎 Before I go on, I'll tell you a funny curiosity: How does an LLM solve a mathematical problem like saying a random number? For example, the NumPy library picks random numbers using the Mersenne Twister algorithm, which generates deterministic sequences that appear random. These sequences can be controlled by setting a random seed. In addition, it also offers functions to obtain random numbers from various distributions (uniform, normal, etc.). An LLM often generates the number 42 as a response due to its popularity in pop culture, especially in the science fiction series “Hitchhiker's Guide to the Galaxy” by Douglas Adams. In this work, 42 is humorously described as “the answer to the fundamental question of life, the universe and everything.” In the literature from which many LLMs have drawn, the concept of “random” is linked to “42”, among other factors, because among that literature there is a lot of NumPy code in which users set the number 42 as a random seed. 😊 The following heat map analyzes the role of temperature in the selection of ChatGPT numbers from 1 to 100. Lower temperatures lead to more deterministic, predictable, biased choices, and higher temperatures offer more creative responses. The difficulty of LLMs to solve mathematical problems following logical reasoning like humans poses several challenges. Let's look at the main reasons one by one: Likelihood nature of the language. An LLM like GPT works by predicting which word (or token) is most likely to come next based on the statistical patterns it has learned from the data. While this capability is incredibly powerful for linguistic tasks, mathematical problems require accuracy and precision, not simply probabilistic approximations. Limited symbolic reasoning. Mathematics requires symbolic reasoning (working with numbers, variables, equations, and precise logical operations), something LLMs do not natively handle. LLMs, while they may have “seen” many equations and mathematical operations in their training, are not inherently designed to manipulate mathematical symbols accurately and follow strict mathematical rules reliably. Lack of deep understanding of mathematical concepts. LLMs do not “understand” mathematics in the sense that a human being would. They have been trained to handle text, not to develop a conceptual understanding of mathematical rules or the logic underlying complex mathematical operations. Cumulative errors. Solving mathematical problems often involves performing several interdependent steps with a high level of accuracy. Since LLMs can make errors at any intermediate step due to the probabilistic nature of their process, a small error at one stage can lead to completely incorrect final answers. Limited long-term memory manipulation. Although LLMs can handle relatively long contexts, they do not have persistent memory or the ability to retain and reuse previous information in the way a human does when reasoning about a mathematical problem involving multiple steps. This limits their ability to perform continuous multi-step reasoning, such as that required in many mathematical problems. Precision is key in mathematics. Unlike language, where answers can be flexible or approximate, mathematics requires absolute precision. LLMs can generate answers that “sound” correct or plausible, but do not actually meet the precision required for a correct mathematical solution. Mathematical Problem Solving Challenges in the LLM Let us test LLMs by trying to get them to solve mathematical problems. In the paper where the test is presented, the relationship between the surface form of a mathematical problem and its ease of being solved by LLMs is studied. The paper explores how the surface form (i.e., wording or presentation) of a mathematical problem can affect its ease of being solved by LLMs. The authors find that small changes in problem formulation can have a significant impact on both response distribution and solving rates, exposing the sensitivity and lack of robustness of these models to complex mathematical problems. The authors propose the Self-Consistency-over-Paraphrases (SCoP) method, which diversifies reasoning paths by generating multiple versions or paraphrases of the surface formulation of the problem, to improve performance on this type of mathematical reasoning. They evaluate this approach on four mathematical reasoning datasets and three large language models, showing that SCoP improves performance compared to the standard self-consistency approach, especially on problems that initially seemed unsolvable. Let us gradually recall that paraphrasing is rewriting or rewording what you have read or learned in your own words without distorting the meaning. When paraphrasing, it is important to make the reader feel that you have understood the topic and that you are conveying it correctly. In this image, you can see the comparison of the response distribution and the resolution rate between variations of surface forms of a mathematical problem, when GPT-3.5-turbo is asked to use Self-Consistency-over-Paraphrases. The resolution rate can vary dramatically between surface forms with equivalent semantics. It is not surprising then that the paper is titled “Paraphrasing and solving, exploring and exploiting the impact of surface form on mathematical reasoning in large linguistic models”. The table shows examples where the original problems and the paraphrase shape variations show substantial difference in the resolution rate with GPT-3.5-turbo. ✅ As I know you are very curious, here is a link to the paper: Paraphrase and Solve: Exploring and Exploiting the Impact of SurfaceForm on Mathematical Reasoning in Large Language Models → And now let's get down to the acid part, the students' notes: As you can see, in 2024 the metrics in the MATH test are very weak compared to other tests for assessing the goodness of models. However, although the current outlook looks bleak, it is, in fact, very positive. In 2021, the main models on the market barely reached 25 points in this test, which is undoubtedly one of the most difficult for a machine to pass. The LLMs of the leading AI developers on the market today (Meta, Google, Anthropic, etc.) passed the test. Look at how the LLM metrics on the MATH test have improved in a short period of time (less than 5 years). You may have noticed that, by 2024, the metrics are very different from those provided in the table above. It is common for each LLM developer to use their own metrics. This happens because of the following: You have different objectives (each model is optimized for different skills, mathematics, creativity, etc.). Companies adjust metrics to excel in specific areas and differentiate themselves in the market. Metrics are designed to better align with the capabilities that manufacturers want to showcase. Proprietary data is used for testing, leading to differences in assessments and scores. So, what's the best thing to do about it? Quite simply, be technology agnostic and dance with everyone at the party. Technology is a means, not an end, and, ultimately, you have to work with the technology that best suits your use case. Bad practice is, without a doubt, to adapt the use case to the technology. Technology is a means and not an end, you have to work with the technology that best suits your use case. Adopting such a stance towards technology in LLM evaluation is crucial to getting an unbiased and objective view of its true performance. The MATH test, like other metrics, can reveal specific strengths, but should not be viewed as an absolute verdict on the overall capability of the model. By maintaining an open and evidence-based perspective, it is possible to assess technological improvements without favoritism, allowing for fair comparisons and encouraging continued development in this unique context in the spring history of AI. ■ MORE OF THIS SERIES IA & Data Are LLMs transforming the future? Will they overtake us? Key methods to measure their power (Part 1) October 8, 2024
October 22, 2024
AI & Data
Are LLMs transforming the future? Will they overtake us? Key methods to measure their power (Part 1)
Large Language Models (LLMs) have significantly transformed the way we interact with artificial intelligence in recent years, revolutionizing industries ranging from customer service to scientific research. The rapid advancement of these models, however, raises a critical question: how do we effectively measure their performance? Since LLMs must not only generate coherent text, but also respond accurately, adapt to diverse contexts, and handle increasingly complex tasks, assessing their quality is a multidimensional challenge. In this article we will look at some of the most commonly used methods for measuring the performance of these models. We will have a clear understanding of these evaluation tools and will be able to more rigorously analyze the ability of LLMs to solve real problems and their potential for further advancement in conversational AI. Performance evaluation methods When the launch of an LLM is announced, a table like the one below is usually displayed. In this case we are showing the performance table of Meta's Llama 3 model. Here you can see the evaluations of the model in different tests. In this article, we will explore what the first two, MMLU and GPQA, are all about. Undergraduate level knowledge (MMLU) The Massive Multitask Language Understanding (MMLU) test said “Hello world” following the paper Measuring Massive Multitask Language Understanding, published in 2020 by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt. The paper proposed a test to measure the multitask accuracy of textual models of the time, which were not very fashionable at the time, but already existed. The test covers 57 tasks, including elementary mathematics, U.S. history, computer science, and law. Models must have a broad knowledge of the world and problem-solving skills to score high on this test. The paper's publication noted that while most models at the time were close to randomly accurate, the largest GPT-3 model improved randomly by nearly 20 percentage points on average. Recall that it has rained since then, the model available in the ChatGPT application is GPT-3.5 and there are more advanced versions such as 4o, 4, etc. However, in 2020, in each of the 57 tasks, the best models still needed substantial improvement before reaching expert accuracy. The models also had uneven performance and often do not know when they are wrong. Even worse, they continued to have near-random accuracy on some socially important topics, such as morality and law. ✅ By comprehensively assessing the breadth and depth of a model's academic and professional knowledge, the test can be used to analyze models on many tasks and identify important gaps. In 2024 all of these gaps and machine-human distances have narrowed significantly, with the machine, in some knowledge areas, being able to deal one-on-one with expert personnel in certain subjects. Look at how the MMLU test score for various LLMs has evolved over the years. From around 30 points in GPT-2 to 86 points in GPT-4, in just 4 years!! And now let's look at the state of the art in 2024. What does the [term] 5-shot below the results of each LLM refer to? Language models can be evaluated in different modes depending on the number of examples provided to them before responding. This is known as “few-shot”, “zero-shot” or “one-shot”. Zero-shot: The model responds directly without having seen any previous example. One-shot: The model receives one example before being evaluated. Few-shot (n-shot): The model receives several examples before answering the new question. The term 5-shot means that the model receives 5 examples of how a question should be answered before being evaluated on a new similar question. And now for the interesting part, let's take a look at some of the little questions asked to LLMs! In this case, extracted from the paper, question from macroeconomics, physics, mathematics and medicine. do you know the answer to any of them? 😊 ✅Are you interested in reading the paper where this evaluation model is presented? You can read it here: Measuring Massive Multitask Language Understanding → GPQA GPQA is a challenging dataset of 448 multiple-choice questions written by experts in biology, physics, and chemistry. In this case the questions are of high quality and extremely difficult: experts who have or are doing PhDs in the corresponding domains achieve an accuracy of 65%, while highly qualified non-expert validators only achieve an accuracy of 34%. All this despite spending an average of more than 30 minutes with unlimited access to the web (i.e., the questions are “Google-proof”). And not to be outdone, the questions are also difficult for the most advanced Artificial Intelligence systems, GPT-4, reaches an accuracy of less than 40%! This test is not intended to frustrate doctors, future doctors, and LLMs for pleasure; it is intended to go further. If we want future AI systems to help us answer very difficult questions, for example, in developing new scientific knowledge, scalable supervisory methods have to be developed that allow humans to supervise their results, which can be difficult even if the supervisors are experts and knowledgeable. The difficulty of GPQA for both non-experts and state-of-the-art AI systems should enable scalable and realistic supervision experiments, which hopefully can help devise ways for human experts to obtain truthful information from AI systems that exceed human capabilities. These are the metrics of some of the strongest LLMs on the market. Let's remember, to insist, that Zero-shot means that the model directly answers a question without having seen any previous example. ✅ And as I guess you are also very curious, I leave you the paper here: A Graduate-Level Google-Proof Q&A Benchmark → Conclusions The most advanced LLMs on the market, such as OpenAI's GPT-4, Google's PaLM, Meta's LLaMA and others, are trained on extremely large and diverse volumes of data, spanning a wide variety of sources. Within the vast amount of data, they are trained on there already exists information that can resolve the questions they face in the above tests. It is therefore mandatory to ask the following question, would more data improve the performance of these models, more computation, more billions (American) of parameters, or in a parallel line of work, should more emphasis be placed on the quality of the data, its treatment, etc.? We can conclude that measuring the performance of LLMs is a complex task that requires the use of various methods to capture their capabilities and limitations. It is possible through qualitative testing, such as MMLU, to assess more accurately the level of understanding, adaptability, and efficiency of these models across a wide range of tasks. However, as LLMs continue to evolve, it is critical that assessment methods also adapt to reflect their increasing sophistication and real-world impact. Ultimately, a rigorous and diversified approach to measuring their performance will enable these tools to continue to drive advances in artificial intelligence and across disciplines. The opening image is, of course, taken with a generative artificial intelligence, in this case Dall-E is used. Prompt: picture of a mad scientist working with an AI to solve complex problems. ■ MORE OF THIS SERIES IA & Data Mathematical revolution! LLMs break down barriers and tackle mathematical challenges (Part 2) October 22, 2024
October 8, 2024
AI & Data
An explanation of how AI is changing the world through football
Artificial intelligence has opened up a completely new horizon of technological possibilities that day after day are capable of taking our breath away. All these advances and developments are arriving at breakneck speed, and this means that keeping up to date with the full context is far from being a very difficult task, it is downright impossible. In order to be a little more up to date with what is happening, and where and how it is being applied, throughout this article we will talk about the solutions that have been found and are being used in a field that is well known and followed by millions of people: soccer. Nike's spot (2014) Let's take a look at a very popular soccer spot as a guide for this post. Many of you will remember a very famous ad that Nike released when the 2014 World Cup was held in Brazil. In this one, the stars of the moment were replaced by humanoids with perfect skills to play soccer. The humanoids that I have mentioned we are going to consider, from now on, that they are the AIs of the spot. And this was the first fantasy that, today, we can say for sure has been fulfilled from the advert. Clip from the Nike spot "The Last Game", in which the creator of the artificial intelligence that replaces real players speaks in statistical and quantitative terms of known players, in the case of the image, Zlatan Ibrahimović. The second thing that has also been fulfilled is more quantitative; the exhaustive analysis of real players, the extraction of statistics and predictions based on their game. We are going to dig through the following lines to find out how this point has been reached, which in 2014 (it seems like yesterday!) seemed so futuristic and far away. We are going to analyze how the application of AI has changed the way we enjoy soccer matches and the way the industry works in this sport that is so widely followed throughout the world. (Real) robots playing soccer (2023) The first of the surprises with which, in soccer, reality has surpassed fiction, came this year 2023. At the beginning of the year, Google DeepMind published a scientific article with complementary videos in which a huge milestone was reached: the learning by a robot of a complex sport such as soccer. The company investigated whether Deep Reinforcement Learning can teach a robot complex and safe skills to play soccer. These agents were trained in a simulated self-learning environment, which was then extrapolated to a real environment. Deep Reinforcement Learning training of a DeepMind AI in a simulated environment. One of the difficulties of moving from a simulated environment to reality is that... it is impossible to simulate reality as it is 100%. In the simulated environment, the agent learned dynamic movements, basic game strategies and Google DeepMind managed to transfer those skills to real robots, which managed to play soccer. DeepMind robots playing soccer in a real environment. Before moving on, we mentioned a type of learning, Deep Reinforcement Learning, but what exactly is it? In short, we could say that it consists of machines learning something specific on their own based on certain rules. The key elements of Deep Reinforcement Learning include: Agent: It is the decision-making entity that interacts with the environment. It can be a computer program, a robot or any other system capable of perceiving its environment and taking actions. Environment: It is the context in which the agent operates. It can be real or simulated. Actions: Are the decisions that the agent can take at a given time. State: It is the current representation of the environment at a specific time. Reward: It is the feedback signal that the environment sends to the agent after it performs an action. The agent's goal is to maximize the reward accumulated over time. Policy: The strategy that the agent uses to select actions in a state. Value function: Evaluates the utility or expected value of a state or an action in terms of the future rewards expected to be received. In short and very simply explained, Deep Reinforcement Learning could be said to be the way a child learns. Starting from an established objective, it seeks to execute actions in an environment that bring the child closer to or further away from the objective. Moving closer to the target is compensated (dopamine in the case of humans, in most learning processes), and moving away from the target is penalized. ✅ In the example of the child, if we analyze the process of learning to walk, we could say that, in an environment such as the living room of a house, a park, etc., a child (agent), receives penalties, pain due to falls, for example, if he/she does not manage to approach the target. If it manages to get closer to the target or is even able to walk, it receives rewards (applause, kisses, encouragement, satisfaction, etc.). This type of learning, using algorithms, can of course be extrapolated to machines. In the following table we can see how the rewards are according to the consequences of the actions (controlling, advancing, dribbling, dribbling, shooting, etc.) that the agents (robots) perform in the environment (soccer field). It can be seen very clearly that scoring a goal is the maximum reward. Fragment of the compensation policy table from the paper Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning by Google DeepMind. While it is true that Google DeepMind's humanoid robots are a far cry from the AIs in the Nike spot... I wouldn't be so quick to score a goal against the human species! It is worth remembering that in games as complex and famous as chess or the game Go the probability of beating a machine of a professional is, today, very small. AI also trained with Deep Reinforcement Learning, which beat the world Go champion, also bears DeepMind brand. The application of advanced analytics to football At this point we have covered one of the fulfilled prophecies of the announcement, AIs that are able to play soccer, but what about today's numerical deconstruction of players, predictions, statistics, etc.? Deconstruction into numerical indicators of a player is something that has been done for decades, but the classical methods are far from the way it is done today. Player statistics have historically been compiled by people by hand and by eye. This process, of high value for the clubs, has nowadays been automated in a paroxysmal way thanks to artificial intelligence, achieving a total and absolute scope of all players, all leagues, etc. What are the limits to the scope of this process today? As far as a video camera can go, in other words, as far as a smartphone can go. There are numerous companies that offer services to soccer clubs based on the rich and diverse information they have in their databases. Information gathered by recording hundreds and hundreds of matches, from a variety of leagues and categories, from kids to professional soccer. Their goal? To find the diamond in the rough, to find the promising young footballer ahead of the competition. Or find a player who can fill in for an injury replacement, sale, etc., who is as similar as possible (mathematical similarity here) to the injured or sold player. Or simply find a player who meets a set of exact characteristics set out by a team's coach. Analysis of strengths or weaknesses of teams (own or opposing), etc. ✅ All this is achieved thanks to the mixture of two booming technological disciplines, Computer Vision and deep learning, which is, in very generic terms, a subdivision of machine learning that uses the power of the neural network algorithm to address predictive tasks mainly related to images or text. Consequently, a computer can detect that a set of pixels is a number and in turn that this number is a 10, for example (this could be used to identify backbones, right?). Likewise, these algorithms can identify that a set of pixels that make up a round body is a soccer ball, etc. Deep Learning can identify all the elements of a match for future analysis. Deep Learning algorithms enable frame-by-frame analysis of a match recording and are able to interpret what is happening at any given moment, for example: Which player is who (identifying the numbers of the bibs, for example). Who has the ball. At what speed each player moves. Which are the areas in which players usually have more presence. How many passes (effective and ineffective) each player makes. ... The limit of KPIs lies in creativity. ✅ Many clubs in the main European leagues already rely on this type of software to make data-driven decisions, with excellent results. Conclusion There is no doubt that the capacity to surprise in terms of advances in artificial intelligence has no limits. In less than a decade it has gone from the science fiction of an advertising spot to real materialization. Such rapid and sudden advances, which in many cases can be life-changing, can at first sight be overwhelming and can cause mistrust. And so as not to panic those who view AI advances with fear and skepticism, here is a little spoiler from the Nike commercial: Humans end up beating more than 100 artificial intelligences in a soccer match together.
January 9, 2024