A brief look at django indexing

preamble

Since databases are used to store more and more information every day, these are also key components in every Django project. Therefore it is important to understand how they work.

Of course, I can't explain all the full details of all the different databases available for Django. Not only that, because I don't know it all, but also because it would create a conversation. Or possibly an entire conference.

The only thing I want to say about the theoretical background of databases is that there is something called "relational algebra". It can be expressed in every SELECT statement you can possibly think of. Mathematical proof.

How Database Lookups Work

Instead, let's start with how database lookups work. Because that's where we spend most of our time.

Suppose we have this database of names of people and their corresponding ages when they started programming.

Now we're going to pick everyone who started at 19.

We can express this in a SQL query:

SELECT * FROM people WHERE age = 19

Now, how do we find everyone matching that query?

Table Scan Lookup

Well, this is easy. We just look at each row in the table, check if the condition applies, and if so, return the row.

This is called a "full table scan."

So far so good. We have 7 rows here. So we're looking at 7 rows. This is a small number of rows, so the query is fast.

But imagine you have 100,000, 100 million, 10 billion or even more rows. Traversing each row could be very time consuming. It's not something we want or something we can afford. We want to be able to provide guaranteed timing to find a specific row. Independent of the number of rows.

This is where the index joins the party.

What is an index?

For ever-increasing amounts of data, indexes allow fast access to a single (or multiple) item without much reduction in speed. This is also known as "random access". You'll see why it's called that. But first let's look at the most common types of indexes in modern database systems.

B-Trees / B + Trees

By far the most common index is the B-Tree index. Or more specifically, B + Tree indexes.

They are either named after one of their inventors, Rudolf Bayer, or because they self-balance. It's not very clear, but it doesn't really matter. Self-balancing means that the trees are guaranteed a certain amount of time: what's most interesting about B-trees is that the size of the index doesn't matter; for any index of the same size, the time will be the same. You will also see this.

Like all trees (both in computer science and in nature), you start with a root. The difference between computer science and nature is that in computer science we put the root of the tree at the top.

Suit yourself. It's the root note of a level 3 B+ tree.

Rank 3 means that this node can have 3 keys. This is what the 3 boxes at the top of that node are for. The 4 boxes below hold pointers to another node or a row in a database table.

Now, suppose you have keys 11 and 37 in this node and you do not have a third key.

The leftmost pointer then points to a node with a key less than 11.The second pointer points to a node with a key greater than or equal to 11 and less than 37.The third pointer points to a node with a key greater than or equal to 37.

From the age column index of the table I showed at the start, it could look something like this.

The magic now is the pointers in the second row of nodes. Each of them points to a single row in the table with a specific key, which in our example is age.

But it's not just that.

You will see how the last pointer in each underlying node points to the next node? This is used for "index scanning". I'll be back later.

Let's look at the second row of nodes in more detail.

You can now see how each pointer from one of the tree nodes points to a single row in the table.

You can also see how these pointers from the tree nodes randomly point to certain rows in the table. That's why it's called "random access". The database randomly jumps around in the database tables.

Random Access Lookup

Let's refresh our memory with our previous SQL query.

SELECT * FROM people WHERE age = 19

Now how does the index help find the appropriate row faster?

Well, let's look at the tree:

It takes 1 step from the first node to the second node. And one second from the second node to the row in the database table.

Remember, we have to look at all 7 rows to see if they match the database query.

And since there are no more keys in the 19 indexes, we're done.

index scan

Now, going back to the Index Scan, let's say you want to count the number of people who started coding between the ages of 5 and 13.

SELECT COUNT(*) FROM people WHERE age BETWEEN 5 AND 19;
SELECT COUNT(*) FROM people WHERE age >= 5 AND age <= 19;

The database will look for key 5 and then use a pointer to the next node to find more keys.

And because all the query information the database needs is in the indexes, the database doesn't even look at the tables.

The index is great.

Let's get them in Django.

In fact, we already did.

there are

db_index = True , you can set the model fields on the
index_together = (('name', 'age'),) you can set in the model's Meta class
ForeignKey() / OneToOneField() uses an index to quickly find data in a related table.
primary_key = True , Django automatically uses AutoField to represent the id column on each model.

This has been great. But this feature set is a bit limited. There are not only B + tree indexes there. There's also a ton of

2016

Let's take a look at 2016.

Marc Tambling and I have indexing ideas. In fact, Marc already had some ideas in his work. We have ideas about APIs. Things that we want to have in Django. Like, let's make Django support all indexes .

But we don't have time to realize our ideas!

But we got lucky. In fact, the Django project got lucky.

Google Summer of Code 2016

Django has once again been accepted as a Google Summer of Code organization. Thank you Google!

For those who don't know what this is: Google pays students to work on open source projects for 3 months while being mentored by project contributors.

For the most part, Tim Graham , along with Marc and I tutored student Akshesh Doshi to handle the more general indexing support in Django. From writing down proposals for APIs and such until they were finally merged into Django.

The main outcomes of GSoC 2016 are (fields, name) (docs)

It defines the base class for all indexes. You can use them through the indexing options in the Meta class of the model.

For example like this:

from  import models

class Person():
 name = (max_length=200)

 class Meta:
  indexes = [
   (
    fields=['name'],
    name='name_idx',
   ),
  ]

This creates a B + tree index on the name column of the database table.

Of course, this is nothing new. This is what can be performed when using db_index = True in the name field.

You can certainly define indexes on multiple columns:

from  import models

class Person():
 name = (max_length=200)
 age = ()

 class Meta:
  indexes = [
   (
    fields=['name', 'age'],
    name='name_age_idx',
   ),
  ]

Of course, this is nothing new. You can do it with index_together.

But you can do the same now:

from  import JSONField
from  import GinIndex
from  import models

class Doc():
 data = JSONField()

 class Meta:
  indexes = [
   GinIndex(
    fields=['data'],
    name='data_gin',
   ),
  ]

Define a GinIndex . This is PostgreSQL specific. But it's something you couldn't do before. At least not reliably, without too much pain.

GinIndex can be used to index keys and values in a JSON blob. Therefore, you can filter the rows in the table where the keys in the JSONB fields are mapped to specific values. This is like "NoSQL 1-0-1".

Another built-in index type that comes with Django 1.11 is BrinIndex , which simply allows for faster calculation of aggregations. For example, find out when each article was last purchased.

And since indexes are part of the database schema, it's obvious to keep track of indexes through migration. Therefore, indexes are created when you run python migrate:

BEGIN;
--
-- Create model Doc
--
CREATE TABLE "someapp_doc" (
 "id" serial NOT NULL PRIMARY KEY,
 "data" jsonb NOT NULL);
--
-- Create index data_gin on field(s) data of model doc
--
CREATE INDEX "data_gin" ON "someapp_doc" USING gin ("data");
COMMIT;

Featured Creative

Great.

This is Django 1.11, which was released yesterday.

But what's the point of Django 2.0?

What's on the horizon?

What do we ultimately want?

Function Index

They are useful in a variety of situations where you don't want to index the original value, but can make changes to it, such as lower case for strings. I've been working on this for a while now. I'm not there yet. I'd love to get help from someone who understands the expression API.

from  import models

class Author():
 name = (max_length=200)

 class Meta:
  indexes = [
   FuncIndex(
    expression=Lower('name'),
    name='name_lower_idx',
   ),
  ]

db_index = <IndexClass>

As mentioned earlier, using indexes for individual columns can be very cumbersome. Therefore, let's support the Index class as an attribute of db_index.

from  import models

class Author():
 name = (
  max_length=200,
  db_index=HashIndex
 )

It doesn't make sense to have a B + tree for some fields. As shown before, GinIndex is perfect for JSONField. Why not use the default_index_class for each field class when db_index = True ?

from  import JSONField
from  import GinIndex
from  import models

# Somewhere in Django's JSONField implementation:
# JSONField.default_index_class = GinIndex

class Document():
 data = JSONField(db_index=True)

Refactoring index_together and db_index

This one is more compelling than user-oriented:

I can imagine that db_index and index_together might make sense internally using Model._meta.indexes. This is something to investigate.

GiSTIndex

PostgreSQL has a GiSTIndex that can be used for geospatial queries such as "give me all the points that have a maximum distance of 10 from the given point". This is not in Django 1.11. I don't know why, but I'm guessing it's because no one has added it.

Please note that Django 2.0 no longer supports Python 2!

This is the whole content of this article.