- MongoDB Cookbook(Second Edition)
- Cyrus Dasadia Amol Nayak
- 1405字
- 2025-04-04 20:24:21
Creating and understanding sparse indexes
Schemaless design is one of the fundamental features of Mongo. This allows documents in a collection to have disparate fields, with some fields present in some documents and absent in others. In other words, these fields might be sparse, which might have already given you a clue on what sparse indexes are. In this recipe, we will create some random test data and see how sparse indexes behave against a normal index. We shall see the advantages and one major pitfall of using a sparse index.
Getting ready
For this recipe, we need to create a collection called sparseTest
. We will require a server to be up and running. Refer to the Installing single node MongoDB recipe from Chapter 1, Installing and Starting the Server for instructions on how to start the server. Start the shell with the SparseIndexData.js
script loaded. This script is available on the Packt website for download. To know how to start the shell with a script preloaded, refer to the Connecting to a single node in the Mongo shell with JavaScript recipe in Chapter 1, Installing and Starting the Server.
How to do it…
- Load the data in the collection by invoking the following. This should import 100 documents in the
sparseTest
collection.> createSparseIndexData()
- Now, take a look at the data by executing the following query, taking note of the
y
field in the top few results:> db.sparseTest.find({}, {_id:0})
- We can see that the
y
field is absent, or it is unique if it is present. Let's then execute the following query:> db.sparseTest.find({y:{$ne:2}}, {_id:0}).limit(15)
- Take a note of the result; it contains both the documents that match the condition as well as fields that do not contain the given field,
y
. - As the value of
y
seems unique, let's create a new unique index on they
field as follows:> db.sparseTest.createIndex({y:1}, {unique:1})
This throws an error complaining that the value is not unique and the offending value is the null value.
- We will fix this by making this index sparse as follows:
> db.sparseTest.createIndex({y:1}, {unique:1, sparse:1})
- This should fix our problem. To confirm that the index got created, execute the following on the shell:
> db.sparseTest.getIndexes()
This should show two indexes, the default one on
_id
and the one that we just created in the preceding step. - Now, execute the query that we executed earlier in step 3 again and see the result.
- Look at the result and compare it with what we saw before the index was created. Re-execute the query but with the following hint forcing a full collection scan:
>db.sparseTest.find({y:{$ne:2}},{_id:0}).limit(15).hint({$natural:1})
- Observe the result.
How it works…
These were a lot of steps that we just performed. We will now dig deeper and explain the internals and reasoning for the weird behavior that we see while querying the collection that used sparse indexes.
The test data that we created using the JavaScript method just created documents with a key, x
, whose value is a number starting from one, all the way up to 100. The value of y
is set only when x
is a multiple of three—its value is a running number as well, starting from one and should go up to a maximum of 33 when x
is 99
.
We then execute a query and see the following result:
> db.sparseTest.find({y:{$ne:2}}, {_id:0}).limit(15) { "x" : 1 } { "x" : 2 } { "x" : 3, "y" : 1 } { "x" : 4 } { "x" : 5 } { "x" : 7 } { "x" : 8 } { "x" : 9, "y" : 3 } { "x" : 10 } { "x" : 11 } { "x" : 12, "y" : 4 } { "x" : 13 } { "x" : 14 } { "x" : 15, "y" : 5 } { "x" : 16 }
The value where y
is 2
is missing in the result and this is what we intended. Note that the documents where y
isn't present are still seen in the result. We now plan to create an index on the y
field. As the field is either not present or has a value that is unique, it seems natural that a unique index should work.
Internally, indexes add an entry in the index by default, even if the field is absent in the original document in the collection. The value going in the index will, however, be null. This means that there will be the same number of entries in the index as the number of documents in the collection. For a unique index, the value (including the null values) should be unique across the collection, which explains why we got an exception during index creation where the field is sparse (not present in all the documents).
A solution for this problem is making the index sparse, and all we did was add sparse:1
to the options along with unique:1
. This does not put an entry in the index if the field doesn't exist in the document. Thus, the index will now contain fewer entries. It will only contain those entries where the field is present in the document. This not only makes the index smaller, making it easy to fit in the memory, but also solves our problem of adding a unique constraint. The last thing that we want is an index of a collection with millions of documents to have millions of entries, where only a few hundred have some values defined.
Though we can see that creating a sparse index did make the index efficient, it introduced a new problem where some query results were not consistent. The same query that we executed earlier yields different results. See the following output:
> db.sparseTest.find({y:{$ne:2}}, {_id:0}).hint({y:1}).limit(15) { "x" : 3, "y" : 1 } { "x" : 9, "y" : 3 } { "x" : 12, "y" : 4 } { "x" : 15, "y" : 5 } { "x" : 18, "y" : 6 } { "x" : 21, "y" : 7 } { "x" : 24, "y" : 8 } { "x" : 27, "y" : 9 } { "x" : 30, "y" : 10 } { "x" : 33, "y" : 11 } { "x" : 36, "y" : 12 } { "x" : 39, "y" : 13 } { "x" : 42, "y" : 14 } { "x" : 45, "y" : 15 } { "x" : 48, "y" : 16 }
Why did this happen? The answer lies in the query plan for this query. Execute the following to view the plan of this query:
>db.sparseTest.find({y:{$ne:2}}, {_id:0}). hint({y:1}).limit(15).explain()
This plan shows that it used the index to fetch the matching results. As this is a sparse index, all the documents that didn't have the y
field are not present in it and they didn't show up in the result, though they should have. This is a pitfall that we need to be careful of when querying a collection with a sparse index and the query happens to use the index. It will yield unexpected results. One solution is to force a full collection scan, where we provide the query analyzer a hint using the hint
function. Hints are used to force query analyzers to use a user-specified index. Though this is not recommended usually as you really need to know what you are doing, this is one of the scenarios where this is really needed. So, how do we force a full table scan? All we do is provide {$natural:1}
in the hint
function. A natural ordering of a collection is the order that it is stored in on the disk for a particular collection. This hint
forces a full table scan and now we get the results as before. The query performance will, however, degrade for large collections as it is now using a full table scan.
If the field is present in a lot of documents (There is no formal cutoff for what is a lot; it can be 50% for some or 75% for others.) and not really sparse, then making the index sparse doesn't make much sense apart from when we want to make it unique.
Note
If two documents have a null value for the same field, unique index creation will fail, and creating it as a sparse index will not help either.