From the previous blogpost about RAG with Llama Index, I’ve gone through how about developing RAGs with Llama Index, a library to create LLM applications. For this post, I have extended my experiment to mix vector and text based algorithms of vectors and bag of words.
In most search or indexing databases like Elasticsearch, Opensearch, and Solr, the general algorithms being used are Best Matching (BM25) and the term-frequency/inverse-document-frequency (TF/IDF). It basically generates a relevance score that will be used to rank the entries in the database. Both of these fall under the bag of words type of algorithms where the existence and occurence of a word drive the relevance. The simplified steps on how bag of words pipeline are the following:
and, are, that, then, there, such
.id
of the data/document entry in an index
. E.g. Hello -> [Doc1, Doc2] and World -> [Doc2]N
results and present the results to user.The general issue that comes up with these types of algorithms is that it does not take into account synonymous words or even the position of the term. As an example, with resource
and without resource
may have the same relevance score if the user is looking for resource
; this is due the algorithm only uses the resource
term as the input to generate the relevance. There are ways to consider this use case by setting up keywords or even modifying the score by detecting the words that will provide more context; the issue though is that you will need to hard-coding all of the use cases. That is why a lot of AI or ML workflows are using more on the vector retrieval as against to bag of words.
The good thing about these types of algorithms though is that they are less complex by term counts meaning it is less complex and more explainable to the user. ML or embedding based retrievals introduce additional complexity by adding chunking, overlaps, data where the model was trained on, conversion to vector space, etc. Given these additional complexities, the embedding approach is computationally expensive.
Due to the expansion of the new thing which is ML or AI in general, the proliferation of vector stores available either proprietary or open source are on the rise and traditional database technologies are creating use cases or adding support to execute vector retrievals.
In the my previous blogpost, to implement a sample RAG using Llama Index, it needs to convert the entries from the data source into a vector and stores it into a vector store by using embedding models. When storing these vectors the associated metadata can either be the ID or even the actual text is also stored in the vector store. The simplified steps on how to implement this pipeline are the following:
Data Vector
E.g. Hello World
=> [0.1021, 0.3341]
Prompt Vector
.Prompt Vector
with the various Data Vectors
.N
Data Vectors together with the metadata. Present the results.Based on the steps, the search and retrieval is dependent on the vector computation and distances instead of term occurence. As mentioned in the bag of words section, even though this will solve the hardcoding and positional issues the vector is dependent on the data the embedded model is trained on.
I am quite curious how the mixing of BM25 and vector retrieval algorithms will affect in the output result of the LLM. For this, I have implemented a notebook showing the difference between the two algorithms. I will be using Llama Index and Weaviate for this experiement. Weaviate can be installed using a docker-compose.yml.