Here’s the table of contents:
Vector database: Use Elasticsearch to implement vector data storage and search
1. Introduction
Elasticsearch supports vector search in the version. During the calculation of vector functions, all matching documents are scanned linearly. Therefore, the query time is expected to grow linearly with the number of matching documents. For this reason, it is recommended to use query parameters to limit the number of matching documents (similar to the logic of quadratic search, use first
match query
Retrieve the relevant document and then use the vector function to calculate the document relevance).
Visit
dense_vector
The recommended method is through the cosinessimilarity, dotProduct, 1norm or l2norm functions. However, it should be noted that each DSL script can only call these functions once. For example, do not use these functions in a loop to calculate the similarity between a document vector and multiple other vectors. If this function is required, these functions can be reimplemented by directly accessing the vector values.
2. Preparation before experiment
2.1 Create index settings vector fields
Create a support vector search
mapping
, the field type isdense_vector
。
// The maximum supported dims is 1024.PUT index3 { "mappings": { "properties": { "my_vector": { "type": "dense_vector", "dims": 3 }, "my_text" : { "type" : "keyword" } } } }
2.2 Write data
PUT index3/_doc/1 { "my_text" : "text1", "my_vector" : [0.5, 10, 6] } PUT index3/_doc/2 { "my_text" : "text2", "my_vector" : [-0.5, 10, 10] }
3. Vector calculation function
3.1 Cosine Similarity: cosineSimilarity
The cosinessimilarity function calculates the cosine similarity measure between a given query vector and a document vector.
POST index3/_search { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilarity(, doc['my_vector'])+1.0", "params": { "queryVector": [-0.5, 10, 6] } } } } }
- To limit
script_score
To calculate the number of documents, a filter is required. -
script
The script is incosineSimilarity
1.0 was added to prevent the score from being negative. - In order to better utilize the DSL optimizer, a query vector can be provided using parameters.
- Check for missing values: If there is no value in the document for the vector field to execute the vector function, an error will be thrown.
- Available
doc['my_vector'].size() == 0
Check if the document existsmy_vector
The value of the field.
Script example:
"source": " doc['my_vector'].size() == 0 ? 0 : cosineSimilarity(, 'my_vector') "
If the document is
dense_vector
If the field is different from the vector dimension of the query, an exception will be thrown.
3.2 Calculate dot product: dotProduct
The dotProduct function calculates the dot product metric between a given query vector and a document vector.
POST index3/_search { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": """ double value = dotProduct(,doc['my_vector']); return sigmoid(1, , -value); """, "params": { "queryVector": [ -0.5, 10, 6 ] } } } } }
Use standardsigmoid
Functions prevent negative fractions.
3.3 Manhattan Distance: l1norm
l1norm
The function calculates the L1 distance (Manhattan distance) between a given query vector and a document vector.
POST index3/_search { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source":"1 / (1 + l1norm(, doc['my_vector']))", "params": { "queryVector": [-0.5, 10, 6] } } } } }
1. Different from the cosine similarity that represents similarity,1norm
andl2norm
Indicates distance or difference. This means that the more similar the vector,1norm
andl2norm
The lower the fractions the function produces. So when we need similar vectors to get higher scores, we will1norm
andl2norm
The output of . In addition, in order to avoid being divided by 0 when the document vector exactly matches the query, 1 is added to the denominator.
3.4 Euclidean distance: l2norm
The l2norm function calculates the L2 distance (Euclidean distance) between a given query vector and a document vector.
POST index3/_search { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "1 / (1 + l2norm(, doc['my_vector']))", "params": { "queryVector": [ -0.5, 10, 6 ] } } } } }
3.5 Custom calculation functions
Use function to access the value of the vector and customize the calculation of vector cosine similarity. The vector search doc[].vectorValue function in ES is supported starting with Elasticsearch version 7.8.0 and will fail to run in ES 7.5.1 or below 7.8.0.
The vector value can be accessed directly through the following functions:
-
doc[<field>].vectorValue
– Returns the value of the vector as a floating point array. -
doc[<field>].magnitude
– Return the size of the vector as a floating point number (for vectors created before version 7.5, the size of the vector will not be stored). So this function will be recalculated every time it is called.
POST index3/_search { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": """ float[] v = doc['my_vector'].vectorValue; float vm = doc['my_vector'].magnitude; float dotProduct = 0; for (int i = 0; i < ; i++) { dotProduct += v[i] * [i]; } return dotProduct / (vm * (float) ); """, "params": { "queryVector": [ -0.5, 10, 6 ], "queryVectorMag": 5.25357 } } } } }
This is the article about how to use Elasticsearch to implement vector data storage and search in vector database. For more related Elasticsearch vector data storage and search content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!