From Fruits to Data: A Beginner’s Guide to Vector Databases

Blogs

From Fruits to Data: A Beginner’s Guide to Vector Databases

4 min read

Vector databases are becoming increasingly important for various applications, including AI and machine learning. However, the concept of vector databases can be complex and abstract.

In this article, we’ll break down the fundamentals of vector databases using simple analogies to make them easier to understand. We will cover vector representation, vector embedding, vector distance, vector search, and vector indexing.

Vector Representation and Embedding

Let’s start with an analogy. Imagine we have a collection of fruits with different types, colors, and weights:

Fruit A: A red apple weighing 150 grams.
Fruit B: A yellow banana weighing 120 grams.
Fruit C: A green pear weighing 200 grams.
Fruit D: An orange weighing 250 grams.
Fruit E: A red cherry weighing 10 grams.

To convert these fruits into vectors, we need a set of rules or codes:

Color Codes: Red = 1, Yellow = 2, Green = 3, Orange = 4.
Fruit Type Codes: Apple = 1, Banana = 2, Pear = 3, Orange = 4, Cherry = 5.
Weights: We use the actual weight of the fruit.

Example of Vector Representation

Fruit A: A red apple weighing 150 grams becomes (1, 1, 150).
Fruit B: Yellow banana weighing 120 grams becomes (2, 2, 120).
Fruit C: Green pear weighing 200 grams becomes (3, 3, 200).
Fruit D: Orange weighing 250 grams becomes (4, 4, 250).
Fruit E: Red cherry weighing 10 grams becomes (1, 5, 10).

This process of converting objects to vectors is known as Vector Embedding.

Vector Distance

Next, let’s explore how we measure the distance between vectors, which helps us find similar objects. We will use Euclidean distance, a common method for measuring the distance between two points in space.

Consider a query object, a red apple weighing 140 grams, which translates to the vector (1, 1, 140). We want to find which fruit in our database is closest to this query object.

Calculating Euclidean Distance

Distance between Fruit A (1, 1, 150) and the query object (1, 1, 140)

Similarly, we can calculate the distances between the query object and all other fruits.

Vector Search: Nearest Neighbour

After calculating the distances, we sort them in ascending order to find the nearest object:

Fruit A (1, 1, 150): Distance = 10
Fruit B (2, 2, 120): Distance = 22.36
Fruit C (3, 3, 200): Distance = 64.03
Fruit D (4, 4, 250): Distance = 122.47
Fruit E (1, 5, 10): Distance = 130.35

The closest fruit to our query (a red apple weighing 140 grams) is Fruit A, a red apple weighing 150 grams. This method is known as the Nearest Neighbour Search.

Vector Indexing

Efficient searching of vectors requires indexing. Let’s understand indexing through a different analogy involving a warehouse:

Imagine a massive warehouse storing thousands of products. Without any organization, finding a specific product could take hours. However, retrieving a particular product becomes much faster if the products are categorized and stored based on specific attributes (e.g., electronics in one section, furniture in another, and within those sections, further organized by size or brand).

This is similar to indexing vectors. Indexing allows us to find relevant vectors quickly in a large database. One common indexing method is the K-d tree, which partitions data based on specific criteria.

Example of K-d Tree Indexing

For our vector database, we can partition based on the first element:

Red fruits (1): Group vectors with the first element as 1.
Yellow fruits (2): Group vectors with the first element as 2.
Green fruits (3): Group vectors with the first element as 3.
Orange fruits (4): Group vectors with the first element as 4.

For further partitioning within each group, we can use the second element. For instance, within the red fruits:

Red apples (1, 1): Group vectors with the second element as 1.
Red cherries (1, 5): Group vectors with the second element as 5.

This hierarchical partitioning allows efficient searching. When a new vector arrives, we follow the partitions to place it in the correct group.

Summary

Vector Representation and Embedding: Converting objects into vectors using predefined rules.
Vector Distance: Measuring similarity between vectors using Euclidean distance.
Vector Search: Finding the nearest neighbor to a query vector.
Vector Indexing: Organizing vectors using methods like K-d trees for efficient searching.

Most Recent Posts

All Posts
Data Engineering
Generative AI

Blogs