Mathematical revolution! LLMs break down barriers and tackle mathematical challenges (Part 2)
One of the most surprising issues when you start to play around with LLM is the great difficulty they have to solve very simple mathematical problems compared, for example, to the enormous ease of performing very complex exercises for the average human being, such as poetry. For example, at the prompt: Create an alexandrine poem about how difficult it is for an LLM to solve elementary school level math problems. Write it as if you were Federico García Lorca. The (almost instantaneous) answer is: Numbers surround me, foreign in their flight, like shadows that haunt the threshold of my sight. My verses, flowing in rivers of language, entangle, captive, in the cold of a message. One and two, the numbers of the day, laugh mockingly at my melancholy dismay. I, who master the word, the rhythm in each line, stumble in labyrinths where the mind declines. Oh, impossible sums, subtractions that resist! Multiplication tables, like swords, persist. I'd like to decipher what within me refrains, but cold calculation is a song that constrains. Reason overwhelms me, blinds me with its clarity, and in algebra's fog, I lose my liberty. I am a poet, not arithmetic, in my heart burns the light of the moon, which never spurns. So, I keep on fighting, though the echo ignores, in the infinite lament of this mind that implores. Call me metaphor, figure, or solution, but I find no redemption in numbers' revolution. All this in a matter of 5 seconds. The following line “I am a poet, not arithmetic, in my heart burns the light of the moon, which never spurns” is very Lorca, don't you think? Now, let's move on to another question. Observe the following: Sugar! What a mistake GPTLorca has made! It is essential to remember that, in all cases of errors in basic math problems, it is assumed that there are no external agents involved, such as the system calculator, etc. A normal family PC can perform between 10 and 100 billion mathematical operations per second approximately, depending on the processor, graphics card, etc. And those operations are not as basic, in most cases, as the above operation. We are talking about vector operations, matrix operations, etc., and always executed correctly, without any error. So, why do LLMs fail to answer such basic questions? Let's recall a bit of theory. LLMs are primarily designed to predict the next word in a sequence based on textual patterns learned from large amounts of data. As shown in the diagram below, LLM is a type of artificial intelligence that can therefore understand and generate text, word by word, as if it were having a conversation with you. That text is, after all, a text transformed into a numeric vector that goes into a box, and the response is another numeric vector transformed into natural text. The response is produced after some “magic” inside the box (that magic is very, very big mathematical operations). Likewise, LLMs do not understand numbers as you do, you know that 1 represents the unit, it is the neutral element, that is, any number multiplied by 1 is still the same number. It is also not prime or composite, and let's not forget that it is also a universal divisor, because any number divided by 1 is still that number. You know that if you add two units you have 3 and if you subtract one unit from 1 you have 0. And as the song said, it is also the loneliest number (One - Three Dog Night). You are also aware of the following: you can add 10 days to a Monday, and you would have another totally different day, which would be Thursday. If the Monday you practice the operation on is the 30th of the month, you would not go to the 40th (or 41st) day, you would go to the 10th or the 9th (depending on the month!). You know other things, like that 100/4 is 25 as well as how to solve systems of equations or second-degree equations, following, step by step, the logical instructions your teacher gave you for that purpose. I am also sure that you are able to correctly solve these questions that Daniel Kahneman raises in his book Think fast, think slow (although your instinct first tends to give the wrong option, you are able to come up with a logic that leads you to the right answer): If 5 machines take 5 minutes to make 5 pots, how long would it take 100 machines to make 100 pots? 100 minutes or 5 minutes. In a lake there is an area with water lilies. Every day the area doubles in size. If the zone takes 48 days to cover the whole lake, how long would it take to cover half the lake? 24 days or 47 days. For an LLM, one is an achievement of three characters, “u” + “n” + “o”, and 1 is simply the character “1”, with no numerical logic associated with it. Think of a programming language, for example, Python. In Python “one” + “one” is “one-one” and “1” + “1” is “11”. 📎 Before I go on, I'll tell you a funny curiosity: How does an LLM solve a mathematical problem like saying a random number? For example, the NumPy library picks random numbers using the Mersenne Twister algorithm, which generates deterministic sequences that appear random. These sequences can be controlled by setting a random seed. In addition, it also offers functions to obtain random numbers from various distributions (uniform, normal, etc.). An LLM often generates the number 42 as a response due to its popularity in pop culture, especially in the science fiction series “Hitchhiker's Guide to the Galaxy” by Douglas Adams. In this work, 42 is humorously described as “the answer to the fundamental question of life, the universe and everything.” In the literature from which many LLMs have drawn, the concept of “random” is linked to “42”, among other factors, because among that literature there is a lot of NumPy code in which users set the number 42 as a random seed. 😊 The following heat map analyzes the role of temperature in the selection of ChatGPT numbers from 1 to 100. Lower temperatures lead to more deterministic, predictable, biased choices, and higher temperatures offer more creative responses. The difficulty of LLMs to solve mathematical problems following logical reasoning like humans poses several challenges. Let's look at the main reasons one by one: Likelihood nature of the language. An LLM like GPT works by predicting which word (or token) is most likely to come next based on the statistical patterns it has learned from the data. While this capability is incredibly powerful for linguistic tasks, mathematical problems require accuracy and precision, not simply probabilistic approximations. Limited symbolic reasoning. Mathematics requires symbolic reasoning (working with numbers, variables, equations, and precise logical operations), something LLMs do not natively handle. LLMs, while they may have “seen” many equations and mathematical operations in their training, are not inherently designed to manipulate mathematical symbols accurately and follow strict mathematical rules reliably. Lack of deep understanding of mathematical concepts. LLMs do not “understand” mathematics in the sense that a human being would. They have been trained to handle text, not to develop a conceptual understanding of mathematical rules or the logic underlying complex mathematical operations. Cumulative errors. Solving mathematical problems often involves performing several interdependent steps with a high level of accuracy. Since LLMs can make errors at any intermediate step due to the probabilistic nature of their process, a small error at one stage can lead to completely incorrect final answers. Limited long-term memory manipulation. Although LLMs can handle relatively long contexts, they do not have persistent memory or the ability to retain and reuse previous information in the way a human does when reasoning about a mathematical problem involving multiple steps. This limits their ability to perform continuous multi-step reasoning, such as that required in many mathematical problems. Precision is key in mathematics. Unlike language, where answers can be flexible or approximate, mathematics requires absolute precision. LLMs can generate answers that “sound” correct or plausible, but do not actually meet the precision required for a correct mathematical solution. Mathematical Problem Solving Challenges in the LLM Let us test LLMs by trying to get them to solve mathematical problems. In the paper where the test is presented, the relationship between the surface form of a mathematical problem and its ease of being solved by LLMs is studied. The paper explores how the surface form (i.e., wording or presentation) of a mathematical problem can affect its ease of being solved by LLMs. The authors find that small changes in problem formulation can have a significant impact on both response distribution and solving rates, exposing the sensitivity and lack of robustness of these models to complex mathematical problems. The authors propose the Self-Consistency-over-Paraphrases (SCoP) method, which diversifies reasoning paths by generating multiple versions or paraphrases of the surface formulation of the problem, to improve performance on this type of mathematical reasoning. They evaluate this approach on four mathematical reasoning datasets and three large language models, showing that SCoP improves performance compared to the standard self-consistency approach, especially on problems that initially seemed unsolvable. Let us gradually recall that paraphrasing is rewriting or rewording what you have read or learned in your own words without distorting the meaning. When paraphrasing, it is important to make the reader feel that you have understood the topic and that you are conveying it correctly. In this image, you can see the comparison of the response distribution and the resolution rate between variations of surface forms of a mathematical problem, when GPT-3.5-turbo is asked to use Self-Consistency-over-Paraphrases. The resolution rate can vary dramatically between surface forms with equivalent semantics. It is not surprising then that the paper is titled “Paraphrasing and solving, exploring and exploiting the impact of surface form on mathematical reasoning in large linguistic models”. The table shows examples where the original problems and the paraphrase shape variations show substantial difference in the resolution rate with GPT-3.5-turbo. ✅ As I know you are very curious, here is a link to the paper: Paraphrase and Solve: Exploring and Exploiting the Impact of SurfaceForm on Mathematical Reasoning in Large Language Models → And now let's get down to the acid part, the students' notes: As you can see, in 2024 the metrics in the MATH test are very weak compared to other tests for assessing the goodness of models. However, although the current outlook looks bleak, it is, in fact, very positive. In 2021, the main models on the market barely reached 25 points in this test, which is undoubtedly one of the most difficult for a machine to pass. The LLMs of the leading AI developers on the market today (Meta, Google, Anthropic, etc.) passed the test. Look at how the LLM metrics on the MATH test have improved in a short period of time (less than 5 years). You may have noticed that, by 2024, the metrics are very different from those provided in the table above. It is common for each LLM developer to use their own metrics. This happens because of the following: You have different objectives (each model is optimized for different skills, mathematics, creativity, etc.). Companies adjust metrics to excel in specific areas and differentiate themselves in the market. Metrics are designed to better align with the capabilities that manufacturers want to showcase. Proprietary data is used for testing, leading to differences in assessments and scores. So, what's the best thing to do about it? Quite simply, be technology agnostic and dance with everyone at the party. Technology is a means, not an end, and, ultimately, you have to work with the technology that best suits your use case. Bad practice is, without a doubt, to adapt the use case to the technology. Technology is a means and not an end, you have to work with the technology that best suits your use case. Adopting such a stance towards technology in LLM evaluation is crucial to getting an unbiased and objective view of its true performance. The MATH test, like other metrics, can reveal specific strengths, but should not be viewed as an absolute verdict on the overall capability of the model. By maintaining an open and evidence-based perspective, it is possible to assess technological improvements without favoritism, allowing for fair comparisons and encouraging continued development in this unique context in the spring history of AI. ■ MORE OF THIS SERIES IA & Data Are LLMs transforming the future? Will they overtake us? Key methods to measure their power (Part 1) October 8, 2024
October 22, 2024