Why Vector DBs Are the Wrong Abstraction – And What We Built Instead

Marek Galovic

March 17, 2025

We’ve spent the last three years building the most popular vector database on the market. In that time we realized that a database built around vectors as a primary key is simply the wrong abstraction, creating an unnecessary obstacle for users in production.

In most real-world applications, you want to combine vector search with traditional text search, metadata filtering, and rank results using custom scoring functions to get the best relevance.

Doing all of this is very inefficient if your database is built around a vector index that assumes vector-only queries. The primary reason for this inefficiency is that the distribution of embeddings and the distribution of metadata are not strongly correlated.

Another problem is the cost of legacy databases that couple compute and storage. In this architecture, write traffic can negatively affect queries which usually leads to over-provisioning of resources to maintain SLAs. Secondly, the data needs to be replicated to multiple nodes for durability and high availability. This gets expensive at $0.02 per GB.

Architecture

On a high level, we separate writes, indexing/compaction and query execution into multiple independent services.

First, user writes are handled by the log writer service that appends them to a durable write ahead log (WAL) backed by object storage.

Next, the compactor service consumes WAL entries and produces a read-optimized representation with additional indexes to make search queries fast. Similarly to the log, the indexed files are also durably persisted in object storage.

Finally, the most interesting part, queries. The router service receives application queries, validates them and converts to a logical plan representation used by our distributed query engine reactor. Reactor then distributes the query over a set of executor nodes that read indexed files from object storage, execute the query plan, and return partial results back to the router which computes the final result and returns it back to the client.

Executor nodes cache data locally on NVMe SSDs and in-memory to improve latencies for subsequent queries. In principle, any instance can handle requests for any collection in the cluster which gives us high availability. We don’t want random routing though, since that would result in poor cache utilization, so we consistently assign collections and data files to executors to get high cache hit rates.

Query Engine

Both query routers and query executors use the same query engine, reactor, to run distributed queries. Internally, reactor uses Apache Arrow for in-memory data extended with custom layouts for types like dense/sparse matrix and posting lists.

The choice of Arrow allows us to use off-the-shelf compute kernels for undifferentiated ops like filtering and focus our effort on developing high performance kernels for search. We initially looked at using DataFusion, which is the default choice for most new databases these days, but ultimately decided against it since it’s better suited for analytical workloads and doesn’t natively support external indexes.

Furthermore, the co-design of storage format and execution engine allows us to write compute kernels that operate directly on compressed data which massively improves performance on the search workloads that we target.

Storage Format

Using object storage as the primary storage medium posed a number of challenges for us when it comes to delivering low query latencies. First, time-to-first-byte (TTFB) request latencies are roughly 190ms p95 which is much higher than locally attached disks. Second, per-request pricing model implies optimal I/O request sizes to minimize cost and maximize throughput.

Given these constraints, we found off-the-shelf file formats such as Parquet to perform quite poorly since they have serial dependencies in their I/O and couple statistics/metadata granularity with I/O granularity. The latter is particularly problematic for reading row groups with columns that have drastically different value size because of the imbalance in column chunk sizes.

To address these issues we built a columnar file format (.bob) that has very wide I/O trees to maximize concurrent I/O and decouples the logical file structure from the physical data layout. This enabled us to achieve both optimal I/O request size for object storage and granular statistics for effective block pruning at the same time.

What does this enable?

Having a search database that’s 10-100x cheaper is great but not enough.

Ultimately, our goal is to enable developers to build production-ready search with highly relevant results in the most intuitive way possible.

True hybrid retrieval

Existing search systems that support hybrid retrieval usually do so using reciprocal rank fusion (RRF) where partial results from multiple indexes are merged and resorted based on their ranks. This requires over-fetching candidates from individual indexes and doesn’t consider relevance scores of candidates which hurts performance.

To fix these issues, TopK supports true hybrid retrieval in a single query against the same index which enables users to combine multiple vectors with text filters and metadata filters.

client.collection("books").query(
    select(
      "title",
      score=fn.vector_distance("title_embedding", [...]), # vector scoring
    )
    .filter(match("catcher")) # text filtering
    .filter(field("published_year") > 1920) # metadata filtering
    .top_k(field("score"), 100),
)

Flexible scoring

Custom scoring rules based on document attributes (e.g. boosting) are instrumental for getting highly relevant results in your application. TopK gives you the ability to combine multiple scoring functions with custom expressions to optimize the ranking of results you show to your users.

An example use case for this is a medical search application where we want to combine relevance of abstract and passage embedding with BM25 score and boost documents based on a quality of the journal they were published in.

See the code below to understand what I have in mind:

client.collection("medical_papers").query(
    select(
      "passage",
      abstract_sim=fn.vector_distance("abstract_embedding", [...]),
      passage_sim=fn.vector_distance("passage_embedding", [...]),
      bm25=fn.bm25_score(),
    )
    .filter(match("respiratory disease"))
    .filter(field("published_year") >= 2019)
    .top_k(
	    field("journal_quality")
	    * (field("abstract_sim") + field("passage_sim") + field("bm25")),
	    100
    ),
)

Benchmarks

Vector search with selective filters

Performance of vector databases degrades with highly selective filters even though this is the most common access pattern in production. TopK’s indexing and execution engine which results in highly selective queries being faster, not slower.

The benchmarks below show vector-only queries with filters that select 100%, 10% and 1% of the indexed documents. For a collection with 1M documents and 768-dimensional vectors we achieve ~62ms p99 latency, for collection with 10M documents and 768-dimensional vectors, we achieve ~115ms p99 latency.

Vector query latency benchmark for 1M documents with 768-dimensional vectors and 100 results

Vector query latency benchmark for 10M documents with 768-dimensional vectors and 100 results

Text-only filtering and ranking

Not every application needs vector-based retrieval, especially when traditional keyword filtering with BM25 scoring works fine as a first-stage retriever. Our query engine supports text-only queries with or without BM25 and achieves ~32ms p99 latency for conjunction queries (all query terms must match) and ~60ms p99 latency for disjunction queries (any query term must match).

Text query latency benchmark for 1M documents with 1000 results

If you are building a semantic search, AI-driven application, agents, RAG or similar and would like to try this for yourself, you can now head over to console.topk.io, generate your API key and start upserting and querying your data.