How to use Elasticsearch to implement vector data storage and search

Here’s the table of contents:

Vector database: Use Elasticsearch to implement vector data storage and search

1. Introduction

Elasticsearch supports vector search in the version. During the calculation of vector functions, all matching documents are scanned linearly. Therefore, the query time is expected to grow linearly with the number of matching documents. For this reason, it is recommended to use query parameters to limit the number of matching documents (similar to the logic of quadratic search, use firstmatch queryRetrieve the relevant document and then use the vector function to calculate the document relevance).

Visitdense_vectorThe recommended method is through the cosinessimilarity, dotProduct, 1norm or l2norm functions. However, it should be noted that each DSL script can only call these functions once. For example, do not use these functions in a loop to calculate the similarity between a document vector and multiple other vectors. If this function is required, these functions can be reimplemented by directly accessing the vector values.

2. Preparation before experiment

2.1 Create index settings vector fields

Create a support vector searchmapping, the field type isdense_vector。

// The maximum supported dims is 1024.PUT index3
{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 3
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

2.2 Write data

PUT index3/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [0.5, 10, 6]
}
PUT index3/_doc/2
{
  "my_text" : "text2",
  "my_vector" : [-0.5, 10, 10]
}

3. Vector calculation function

3.1 Cosine Similarity: cosineSimilarity

The cosinessimilarity function calculates the cosine similarity measure between a given query vector and a document vector.

POST index3/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(, doc['my_vector'])+1.0",
        "params": {
          "queryVector": [-0.5, 10, 6]
        }
      }
    }
  }
}

To limitscript_scoreTo calculate the number of documents, a filter is required.
scriptThe script is incosineSimilarity1.0 was added to prevent the score from being negative.
In order to better utilize the DSL optimizer, a query vector can be provided using parameters.
Check for missing values: If there is no value in the document for the vector field to execute the vector function, an error will be thrown.
Availabledoc['my_vector'].size() == 0Check if the document existsmy_vectorThe value of the field.

Script example:

"source": 
"
doc['my_vector'].size() == 0 ? 0 : 
cosineSimilarity(, 'my_vector')
"

If the document isdense_vectorIf the field is different from the vector dimension of the query, an exception will be thrown.

3.2 Calculate dot product: dotProduct

The dotProduct function calculates the dot product metric between a given query vector and a document vector.

POST index3/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": """
        double value = dotProduct(,doc['my_vector']);
        return sigmoid(1, , -value);
        """,
        "params": {
          "queryVector": [
            -0.5,
            10,
            6
          ]
        }
      }
    }
  }
}

Use standardsigmoidFunctions prevent negative fractions.

3.3 Manhattan Distance: l1norm

l1normThe function calculates the L1 distance (Manhattan distance) between a given query vector and a document vector.

POST index3/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source":"1 / (1 + l1norm(, doc['my_vector']))",
        "params": {
          "queryVector": [-0.5, 10, 6]
        }
      }
    }
  }
}

1. Different from the cosine similarity that represents similarity,1normandl2normIndicates distance or difference. This means that the more similar the vector,1normandl2normThe lower the fractions the function produces. So when we need similar vectors to get higher scores, we will1normandl2normThe output of . In addition, in order to avoid being divided by 0 when the document vector exactly matches the query, 1 is added to the denominator.

3.4 Euclidean distance: l2norm

The l2norm function calculates the L2 distance (Euclidean distance) between a given query vector and a document vector.

POST index3/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "1 / (1 + l2norm(, doc['my_vector']))",
        "params": {
          "queryVector": [
            -0.5,
            10,
            6
          ]
        }
      }
    }
  }
}

3.5 Custom calculation functions

Use function to access the value of the vector and customize the calculation of vector cosine similarity. The vector search doc[].vectorValue function in ES is supported starting with Elasticsearch version 7.8.0 and will fail to run in ES 7.5.1 or below 7.8.0.

The vector value can be accessed directly through the following functions:

doc[<field>].vectorValue– Returns the value of the vector as a floating point array.
doc[<field>].magnitude– Return the size of the vector as a floating point number (for vectors created before version 7.5, the size of the vector will not be stored). So this function will be recalculated every time it is called.

POST index3/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": """
          float[] v = doc['my_vector'].vectorValue;
          float vm = doc['my_vector'].magnitude;
          float dotProduct = 0;
          for (int i = 0; i < ; i++) {
            dotProduct += v[i] * [i];
          }
          return dotProduct / (vm * (float) );
        """,
        "params": {
          "queryVector": [
            -0.5,
            10,
            6
          ],
          "queryVectorMag": 5.25357
        }
      }
    }
  }
}

This is the article about how to use Elasticsearch to implement vector data storage and search in vector database. For more related Elasticsearch vector data storage and search content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!