Blog
Semantic Search & Vectors
Vector visualizer + intuition for embeddings, dot product, and cosine similarity.
Why do we need semantic search?
Traditional search (keyword matching) finds documents that contain the exact words the user typed. That fails when the user uses different words or phrasing. Semantic search finds content by meaning — it can match related concepts even when the words differ.
Vectors (embeddings) are the technical tool that let us compare meanings numerically.
How vectors represent meaning
An embedding is a vector of numbers created by a model to represent the meaning of text. Two pieces of text with similar meaning should have vectors that are close in the vector space.
Simple 2D illustration (for intuition):
- cat →
[1, 2] - dog →
[2, 2](close to cat → similar meaning) - car →
[9, -1](far away → different meaning)
Dot product & cosine similarity — visual intuition
The math behind semantic search with a wide horizontal graph showing the cat, dog, and car vectors side-by-side.
Dot product (a·b) measures raw overlap — both direction and length matter. Cosine similarity divides by lengths to focus only on direction (the angle θ between arrows).
Formula (short): a·b = Σ a_i b_i
Cosine: cosine(a, b) = (a·b) / (‖a‖ · ‖b‖) = cos(θ)
Using our simple 2D examples (intuition + toy numbers):
- cat vs dog → arrows point in a very similar direction → cosine ≈ 0.95 (high similarity)
- cat vs car → arrows point very differently → cosine ≈ 0.34 (low similarity)
Real-world / practical examples
1) Synonyms / paraphrases: "How to fix bike puncture" vs "How to repair a flat tyre" — their embeddings are very close directionally even if the words differ. Toy 4-dim vectors:
q1 = [0.20, 0.10, 0.90, 0.05] q2 = [0.19, 0.11, 0.88, 0.06] cosine(q1,q2) ≈ 0.99 → very similar
2) Length bias avoided: a long document may have a larger embedding norm. Dot product would favor the long doc even if meaning is the same. Cosine normalizes length so both short and long documents that mean the same score equally.
q = [0.5, 0.5, 0.1] d_short = [0.5, 0.49, 0.11] d_long = [5.0, 4.9, 1.1] (same direction scaled) // dot(q,d_short)=0.506, dot(q,d_long)=5.06 (dot prefers long) // cosine(q,d_short) ≈ cosine(q,d_long) ≈ 0.999 (cosine treats them equal)
In practice: many vector DBs store normalized embeddings or compute cosine on the fly (Faiss, Milvus, Pinecone). That lets systems rank by semantic similarity reliably rather than by raw magnitude.
Imagine each sentence as an arrow. Semantic search finds arrows that point in the same direction — those are the meanings that match.
3D Vector Space (Interactive Demo)
Add words one by one and watch how vectors position themselves in 3D. Try similar words to see them cluster.
Comments
Ask questions, share feedback, or reply to other readers. Sign up with Google to join the discussion.
Sign in with Google to add comments and replies.