30 Sep 2023

Mixed RAGs and Output Validation

From the previous blogpost about RAG with Llama Index, I’ve gone through how about developing RAGs with Llama Index, a library to create LLM applications. For this post, I have extended my experiment to mix vector and text based algorithms of vectors and bag of words.

Bag of Words and Retrievals

In most search or indexing databases like Elasticsearch, Opensearch, and Solr, the general algorithms being used are Best Matching (BM25) and the term-frequency/inverse-document-frequency (TF/IDF). It basically generates a relevance score that will be used to rank the entries in the database. Both of these fall under the bag of words type of algorithms where the existence and occurence of a word drive the relevance. The simplified steps on how bag of words pipeline are the following:

Read and parse the data source.
Identify the unique words in the data entry, stem words, and remove stop words. Depends on the word dictionary that you will use. Some stop are the following: and, are, that, then, there, such.
Store the unique words with the id of the data/document entry in an index. E.g. Hello -> [Doc1, Doc2] and World -> [Doc2]
A prompt will be fed and parsed. Removes the stopwords, tokenization, and/or implement stemming.
Using the output of 4, return the entries that have a hit.
Implement scoring using the term counts, total unique words, total words in the corpus, etc.
Return the top N results and present the results to user.

The general issue that comes up with these types of algorithms is that it does not take into account synonymous words or even the position of the term. As an example, with resource and without resource may have the same relevance score if the user is looking for resource; this is due the algorithm only uses the resource term as the input to generate the relevance. There are ways to consider this use case by setting up keywords or even modifying the score by detecting the words that will provide more context; the issue though is that you will need to hard-coding all of the use cases. That is why a lot of AI or ML workflows are using more on the vector retrieval as against to bag of words.

The good thing about these types of algorithms though is that they are less complex by term counts meaning it is less complex and more explainable to the user. ML or embedding based retrievals introduce additional complexity by adding chunking, overlaps, data where the model was trained on, conversion to vector space, etc. Given these additional complexities, the embedding approach is computationally expensive.

Vector and Embedding-based Retrievals

Due to the expansion of the new thing which is ML or AI in general, the proliferation of vector stores available either proprietary or open source are on the rise and traditional database technologies are creating use cases or adding support to execute vector retrievals.

In the my previous blogpost, to implement a sample RAG using Llama Index, it needs to convert the entries from the data source into a vector and stores it into a vector store by using embedding models. When storing these vectors the associated metadata can either be the ID or even the actual text is also stored in the vector store. The simplified steps on how to implement this pipeline are the following:

Read and parse the data source.
Pass the data source into the embedding model and thus generating a vector - we can call this as Data Vector E.g. Hello World => [0.1021, 0.3341]
Store the output into a vector store.
A prompt or question will be fed into the vector store. This will repeat step 2. We can call this as Prompt Vector.
Implement a search by comparing (clustering - e.g. KNN, cosine similarity) the distance of Prompt Vector with the various Data Vectors.
Sort the distances.
Return the top N Data Vectors together with the metadata. Present the results.

Based on the steps, the search and retrieval is dependent on the vector computation and distances instead of term occurence. As mentioned in the bag of words section, even though this will solve the hardcoding and positional issues the vector is dependent on the data the embedded model is trained on.

Vector + Bag of Words

I am quite curious how the mixing of BM25 and vector retrieval algorithms will affect in the output result of the LLM. For this, I have implemented a notebook showing the difference between the two algorithms. I will be using Llama Index and Weaviate for this experiement. Weaviate can be installed using a docker-compose.yml.

Experiment Notes

The consideration using Weaviate due to its out of the box support to BM25 and vector retrieval is good for maintainability point of view. You don’t want to have different databases for Vector and Textual searches. You can use other combination of databases as you like for this like Elasticsearch, Pinecone, and PostgreSQL + PGVector.
To evaluate the results, feeding the generated response to another LLM to make the decision which one to use is good for an automation point of view. If you can fine-tune the evaluator model based on the target audience will be very handy since it can help in re-wording some of the terms to the target audience e.g Non-domain audience.
The results of the vector retrieval did not change a lot. I think this is due that we are querying something specific like steps in a handbook or instructions. Maybe if we expand to other workloads and use cases like news articles or forum posts, the outputs will differ.

Mixed RAGs and Output Validation

Bag of Words and Retrievals

Vector and Embedding-based Retrievals

Vector + Bag of Words

Experiment Notes

Tags: