Retrieval-Augmented Generation (RAG): What do I need to know to get started?
Introduction Generative artificial intelligence has transformed the way we interact with technology. However, traditional models have a limitation: they can only generate responses based on knowledge acquired during their training. How can we ensure they can access up-to-date and specific information? This is where Retrieval-Augmented Generation (RAG) comes into play. RAG is a technique that combines information retrieval with generative models to improve the accuracy and relevance of responses. If this brief explanation leaves you wanting more, we have already discussed it in this blog post. Implementing RAG systems enhances the accuracy of generative models and enables smart, contextualized access to information, which is essential for adapting AI to business applications. In this article, we will delve deeper into the key elements of this approach: vector databases, embeddings, and content segmentation (chunks). Understanding these concepts is crucial to optimize RAG systems' performance and maximize their ability to retrieve relevant information. Vector databases and their role in RAG Traditional databases store information in structured tables, allowing queries based on exact keyword matches — you have probably worked with or heard about the SQL language. However, this approach falls short when it comes to semantic information search. To address this, RAG systems use vector databases. Vector databases represent the meaning of data in a multidimensional space, facilitating semantic search. This means that instead of looking for exact matches, these databases identify the similarity between concepts. For instance, if a user searches for car, the system could also find information related to automobile or vehicle even if these words do not appear exactly in the document. Some of the most widely used vector databases include ElasticSearch, Milvus, Azure AI Search, among others from hyperscalers, although there are also open options like OpenSearch. How do vector databases work? Text-to-vector conversion: Each information snippet is converted into an embedding (a mathematical representation in high-dimensional space). Storage in the database: These embeddings are stored in a vector database optimized for quick queries. Similarity search: When a user submits a query, it is also transformed into an embedding using the same model that was used to store the information. This is fundamental because otherwise, it would be like having all information stored in one language (e.g., English) and asking a question in another (e.g., Spanish); there would be no common ground for comparison. By converting both information and queries into numerical vectors, it becomes possible to calculate their similarity using metrics like cosine similarity or Euclidean distance. Essentially, because they are numerical representations, similarities are determined through mathematical operations on those vectors. After understanding the theory behind embeddings and similarity metrics, let's move on to a practical example to illustrate how semantic search works compared to traditional text-based search. Practical example: semantic search vs traditional text-based search Suppose we create a database with the following terms: truck, car, vehicle, book, snake, bicycle, motorcycle. If we perform a lexical search (for example, using SQL) querying exactly for the word vehicle, we will only get that exact match. This is because this type of search does not analyze word meanings, but only checks if the exact form matches. import sqlite3
# Create database and table
conn = sqlite3.connect('search.db')
cursor = conn.cursor()
cursor.execute("DROP TABLE IF EXISTS terms")
cursor.execute("CREATE TABLE terms (word TEXT)")
# Insert values
terms = ["truck", "car", "vehicle", "book", "snake", "bicycle", "motorcycle"]
cursor.executemany("INSERT INTO terms (word) VALUES (?)", [(t,) for t in terms])
conn.commit()
search_query = 'vehicle'
cursor.execute("SELECT word FROM terms WHERE word = ?", (search_query,))
results = cursor.fetchall()
print("Lexical search results:", results)
____________ Lexical search results: [('vehicle',)] ____________ By contrast, when applying a semantic search using embeddings and a metric such as cosine similarity, we can calculate the degree of similarity between the word vehicle and the rest of the terms in the database. As a result, we obtain a score between 0 and 1 for each word, where 1 indicates a very high semantic similarity and 0 a total lack of semantic relation. from sentence_transformers import SentenceTransformer, util
import numpy as np
import warnings
import sqlite3 # Ensure DB connection
warnings.filterwarnings("ignore")
# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")
# Retrieve terms from the database
conn = sqlite3.connect("search.db") # Adjust to your actual DB
cursor = conn.cursor()
cursor.execute("SELECT word FROM terms")
terms = [row[0] for row in cursor.fetchall()]
# Generate embeddings
terms_embeddings = model.encode(terms)
# Encode the query
query = "vehicle"
query_embedding = model.encode(query)
# Compute cosine similarity using util.pytorch_cos_sim
cos_sim = util.pytorch_cos_sim(query_embedding, terms_embeddings)
# Display sorted results
print("Semantic search results:")
for word, score in sorted(zip(terms, cos_sim[0].tolist()), key=lambda x: x[1], reverse=True):
print(f"{word}: {score:.4f}")
____________ Semantic search results:
vehicle: 1.0000
bicycle: 0.3503
car: 0.3076
motorcycle: 0.3061
truck: 0.2938
snake: 0.2471
book: 0.2063 ____________ As we can see, words like truck, car, bicycle, or motorcycle score relatively high due to belonging to the same conceptual category (means of transport), while others like book or snake score lower as they are not semantically related to the search term. An interesting detail is that, for example, car might not score as high as expected. This could be due to the specificities of the embedding model used. This is why choosing the right model is critical depending on the application domain and the types of queries expected. This is exactly the approach that RAG (Retrieval-Augmented Generation) systems leverage, enabling the retrieval of relevant information even when there is no exact match between the query terms and the stored data. What are embeddings and why are they essential? If the term embedding is unfamiliar, you can think of it as the way AI ‘translates’ human language into numbers that computers can understand. More specifically, embedding is a numerical representation of text in a high-dimensional vector space. This means that each word, phrase, or document is transformed into a vector, and the distance or similarity between these vectors reflects how similar they are in meaning. Thanks to this approach, AI models can understand, for example, that vehicle and car are closely related, even though they are not the same exact word — as we saw in the previous semantic search example. Types of embeddings There are different types of embeddings depending on the level of granularity: Word Embeddings: Represent individual words (e.g., Word2Vec, GloVe). Sentence Embeddings: Capture the meaning of full sentences (e.g., MiniLM, SBERT). These are currently the most commonly used and were used in both the previous and upcoming examples. Document Embeddings: Represent long texts as a single vector. The importance of choosing the right model Not all embedding models are created equal, and choosing the right model can significantly impact the performance of a RAG system. Some key factors to consider are: Model size: Larger models tend to be more accurate but require more computational resources. Language: Some models are trained only in English, which may limit their effectiveness in other languages such as Spanish or French. Specific domain: Some models are trained in specialized fields like medicine or law, enhancing their performance in those areas. In the previous example, we used the 'all-MiniLM-L6-v2' model to compute similarity between the word vehicle and the other terms. While the results were reasonable, some scores were lower than expected — for instance, car, which we might expect to have a much stronger relation to vehicle. from sentence_transformers import SentenceTransformer, util
import numpy as np
import warnings
import sqlite3 # Ensure DB connection
warnings.filterwarnings("ignore")
# Embedding model
model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
# Retrieve terms from the database
conn = sqlite3.connect("search.db") # Adjust to your actual DB
cursor = conn.cursor()
cursor.execute("SELECT word FROM terms")
terms = [row[0] for row in cursor.fetchall()]
# Generate embeddings
terms_embeddings = model.encode(terms)
# Encode the query
query = "vehicle"
query_embedding = model.encode(query)
# Compute cosine similarity using util.pytorch_cos_sim
cos_sim = util.pytorch_cos_sim(query_embedding, terms_embeddings)
# Display sorted results
print("Semantic search results:")
for word, score in sorted(zip(terms, cos_sim[0].tolist()), key=lambda x: x[1], reverse=True):
print(f"{word}: {score:.4f}")
____________ Semantic search results:
vehicle: 0.9830
car: 0.9422
truck: 0.7876
motorcycle: 0.4804
bicycle: 0.4235
book: 0.4017
snake: 0.2485
____________ Using exactly the same code but replacing the model with 'paraphrase-multilingual-MiniLM-L12-v2' shows a significant improvement in results. For instance, the similarity between vehicle and car is now much higher, better reflecting the semantic relationship between the two. This improvement is not just because the model is more powerful, but primarily because it is multilingual. Keep in mind that most embedding models are primarily trained in English, which often leads to better results when working in that language. This can limit performance in other languages such as Spanish. In contrast, paraphrase-multilingual-MiniLM-L12-v2 has been trained on texts in more than 50 languages, allowing it to handle queries in Spanish (and other languages) much more accurately. ■ Choosing the right embedding directly impacts the quality of the generated responses, making it a key factor in RAG system implementation. The importance of chunks and how to segment documents So far, we've seen how semantic search works at the level of words or short phrases, like in the case of vehicle and car. These examples helped us understand the logic of embeddings and vector similarity. However, in real-world scenarios, user queries are often related to much longer texts: articles, manuals, emails, reports, etc. This raises a fundamental question: How do we organize and structure that information to make it useful in a RAG system? The answer lies in a key technique: chunking or text segmentation. What is chunking and why is it so important? When working with large documents, it's neither efficient nor recommended to treat them as a single unit. Instead, they are divided into smaller fragments called chunks, which allow for: Reduced processing cost: Only the relevant fragments are analyzed, rather than the entire document. Improved accuracy: Generative models can focus more effectively on the parts of the text that really matter. Noise reduction: Well-defined chunks minimize the inclusion of irrelevant information. However, it's not just a matter of splitting at random. The way we segment text has a direct impact on the quality of the generated responses. Segmentation strategies There are two main methods for splitting text: Length-based segmentation (by words or characters): Set limits such as 200 words or 1,000 characters per chunk. —This is a quick and easy technique to implement but can interrupt sentences or ideas mid-way. Semantic segmentation: This approach uses AI models that detect topic changes within the content and split the text into logical sections, helping to better preserve meaning. —While more computationally expensive, this method allows for more precise organization. Additionally, semantic segmentation doesn't always require advanced AI models—it can also be done simply by splitting text according to headings or sections, as in Word or PDF documents where the divisions are predefined. How does segmentation affect information retrieval? Chunks that are too large: May include irrelevant information, reducing the accuracy of the generated response. Chunks that are too small: May lack context and cause the system to miss relevant information. Best approach: Combine both strategies to obtain balanced fragments that optimize semantic retrieval. Benefits of implementing RAG with well-designed vector databases, embeddings, and chunks Proper implementation of these elements in a RAG system enables: Access to up-to-date and accurate information without the need to retrain the generative model. Advanced semantic search that improves the relevance of responses. Optimized data storage and retrieval, ensuring efficiency and scalability. Improved user experience by generating contextualized and coherent answers. Conclusion Retrieval-Augmented Generation (RAG) represents a significant advancement in artificial intelligence, allowing generative models to access external information efficiently and accurately. As we've seen, the success of these systems depends on three fundamental pillars: choosing the right embedding model, designing the optimal structure of the vector database, and implementing effective document segmentation. With the tools and concepts explained in this article, you now have the basic knowledge needed to begin implementing your own RAG systems. If you'd like to explore specific applications that could benefit your business, we recommend our article on Creative AI in business. As AI continues to evolve, these techniques will play an increasingly important role in applications such as virtual assistants, intelligent information retrieval, and cognitive task automation. Investing in a solid RAG strategy not only improves the accuracy of generative models but also enables smarter, contextualized access to information — paving the way for AI that is more useful, reliable, and tailored to the specific needs of your organization. ✅ Are you ready to take the next step in implementing RAG systems in your company?
May 6, 2025