Categories Machine Learning

This New Embedding Model Cuts Vector DB Costs by ~200x!

It also outperforms OpenAI and Cohere models.

RAG is 80% retrieval and 20% generation.

So if RAG isn’t working, most likely, it’s a retrieval issue, which further originates from chunking and embedding.

Contextualized chunk embedding models solve this.

Press enter or click to view image in full size

Contextualized chunk embedding (Image by Author)

In this article, let’s dive in to understand what they are and how they address the common issues with RAG setups.

The problem

In RAG:

Press enter or click to view image in full size

Chunking in RAG (Image by Author)
  • No chunking drives up token costs
  • Large chunks lose fine-grained context
  • Small chunks lose global/neighbourhood context

In fact, chunking also involves determining chunk overlap, generating summaries, etc., which are tedious.

There’s another problem!

Despite tuning and balancing tradeoffs, the final chunk embeddings are generated independently with no interaction with each other.

This isn’t true with real-world docs, which have long-range dependencies.