Neural retrieval infra (modelling side)

What was there before embeddings... sparse retrieval!

Inverted indexes for search

BM25

How to train embeddings?

Training 101, loss functions, common embedding models, pitfalls, more loss functions rather than classic triplet loss and variants (click signal is cool to add for a B2C produt?)

Maybe more advanced? https://x.com/jxmnop/status/2031051636068782402?s=20 and talk about mining hard negatives, which is big problem for embedding training.

How to mix the two?

Do you always go for neureal retreival? Sometimes sprase retrieval is all you need_?

top-K retrieval on neural serch, and in parallel you can top-K retrieval on sparse, keyword search

final score = alphascore_neural_retrieval + betascore_sparse retrieval

Retrieval vs Ranking

90% of the times you solve things at retrieval time, if retrieval is good then yay! but reranking is also important, how do you even rerank?

O(B) ---> 1k documents (maybe cheap way) --> top 20? Run the 1k documents through another model, usually more compute intense --> get approriate scores for this and get top 20 out of those

How to rerank?

Depends? You can have a similar model you did retrieval with and just use that, but bigger scaled model

You can LLMs with a prompt?

Do you still the query?....

Bi encoder way

embed the query (online). all documents already embedded (offline) -> find top documents top1k / top 20 with smaller bigger model

So it's cheap, and fast, there's not a lot of interaction between the actual tokens of the query and the document?

Cross encoder way

take the query, and then for each document in the DB, concat them together, do an embedding get a score out of the embedding to mean high similarity

all the tokens of the query interact with all the tokens in the documents, but then it's very expensive.. sad!! (but more performance) --> well we could do it only in the later stages of the funnel! top 1k retrieved cheaply... and then run this only on those to getb best performance?

In the middle operation ColBert, late stage interactions

offline, rather than en embedding for each document, you have an embedding for each token (offline), more embeddings ... more disk storage... but managable?

online: you do an emebdding for each token in the query, and find the best token match from documents, the function you optimize for is then \sum MaxSim()

Nice middle ground: cheaper than cross encoder, should be better than classic bi encoder

Optimization

Matrioska learning

Idea being, in a vector [........], top 16, top 32, 64... are higest signal scores, and then the more you go on the less signal you have.

Idea: if you are early in the funnel, you can just use :32 out of 1k embedding size to cheaply retrieve, and then later in the funnel use all the size of the embedding model to get better precision.

Quantization

classic quantization techinques, ... how to deploy in prod make sure preciiosn is the same for the operatins..