Building a Retrieval-Augmented Generation (RAG) System with Oracle 23AI, Cohere, and Flask

Building a Retrieval-Augmented Generation (RAG) System with Oracle 23AI, Cohere, and Flask
13 Feb 2025

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that combines traditional information retrieval with generative models. The idea is to enhance the relevance and accuracy of answers by retrieving context from external sources, such as a database or documents, and then using that information to generate responses.

GTech’s Custom RAG project retrieves relevant documents from the Oracle 23AI database based on the user’s query. It then uses Cohere’s LLMs to generate detailed answers. The workflow involves embedding both documents and queries into vectors, comparing them, and reranking the results to ensure that the most relevant information is used in the final response.

Key Components

Vector Embeddings with Cohere

At the core of this system are Cohere’s embeddings. These are dense vector representations of text that capture its semantic meaning. Cohere provides an embedding model (like cohere.embed-multilingual-v3.0, which we use in our project) that transforms both documents and queries into vectors. These embeddings are used to measure the similarity between documents and queries.

When processing a PDF document, the text is split into manageable chunks, cleaned, and then converted into embeddings. These embeddings are stored in an Oracle 23AI database using the vector datatype, which is specifically designed to handle such data efficiently. The use of Oracle’s 23AI DB vector datatype ensures high-performance storage, indexing, and querying, making it ideal for similarity searches in large datasets.

Initially, we tried using the Frankfurt region on Oracle Cloud Infrastructure (OCI) to generate embeddings. However, we encountered a limitation: this region does not support the cohere.embed-multilingual-v3.0 model. To resolve this, we switched to the Chicago region, which fully supports the model, allowing us to proceed smoothly with our implementation.

Knowledge Retrieval from Oracle 23AI Database

The next step is to retrieve relevant information from the Oracle 23AI database. To do this, the query vector is compared to stored document vectors using a Dot Product similarity search. This method identifies the top 10 most relevant documents based on vector distance.

The documents are stored in the Oracle database along with both their text and embeddings, enabling efficient querying for related content. This allows the system to pull contextually relevant documents in real-time. Oracle’s vector datatype ensures efficient storage and fast retrieval, even with large sets of document embeddings.

Reranking with Cohere’s Reranking Model

After retrieving a set of relevant documents, we need to rank them based on their relevance to the query. This is done using Cohere’s reranking model, which evaluates how well each document matches the user’s query.

The reranking model sorts the documents by relevance, ensuring that the most relevant ones are selected for generating the final response. The reranking model (rerank-multilingual-v3.0) is crucial for refining the results and improving the accuracy of the generated response.

Answer Generation with Cohere LLM

Once the relevant documents have been retrieved and reranked, the next step is generating the final answer. We use Cohere’s LLMs like command-r-plus for this task. The LLM takes the query and the ranked documents as input and generates a response based on the provided context.

A key feature of this system is that it combines the generative power of the LLM with specific, context-rich information from the retrieved documents. This ensures that the generated answer is not only relevant to the query but also grounded in the relevant documents.

User Interface with Flask

To make the system interactive, we built a simple web interface using Flask. The user submits a question through a web form, and the Flask server processes the query by calling backend functions. These functions retrieve relevant documents, rerank them, and generate an answer.

Flask handles the routing and communication between the frontend (HTML) and the backend, ensuring smooth interaction for the user. When a question is asked, the backend processes the query and returns the answer along with the relevant documents.

How It Works

User submits a question via the web interface.

The system converts the query into an embedding using Cohere.

The query is matched against stored document embeddings in an Oracle 23AI database, which uses the vector datatype.

The most relevant documents are retrieved and reranked using Cohere’s reranking model.

The system generates a response based on the retrieved documents using Cohere’s LLM.

The answer, along with the relevant documents, is returned to the user.

User Query—>Query Embedding (Cohere Model)—>Knowledge Retrieval (Oracle 23AI Database) —> Top 10 Relevant Documents (Dot Product Similarity)—> Reranking Documents (Cohere’s Reranking Model) —> Answer generation (staying in the context by the help of additional info/docs) – LLM—>Return Answer

Demo:

First image is proof that it can give casual responses like a normal chatbot.

Second image shows a response generated using RAG with data feeding from the document we uploaded.

Conclusion

This RAG-based system is a powerful combination of knowledge retrieval and generative AI. By integrating Cohere’s embeddings and reranking models with Oracle 23AI’s vector datatype and Flask, the system can provide contextually accurate answers to user queries, backed by external knowledge sources.

By combining retrieval-based approaches with generative models like Cohere’s LLM, we can handle a wide variety of questions and ensure users receive the most relevant and accurate answers based on the information available in the database.

Toygun Toğay, System and Database Management Consultant