Django and haystack-part I
Javier Rojas
Senior Software EngineerHaystack/Django
You can use Haystack to index and search for text in any field in any of the models you have defined in your project.
Haystack doesn't perform the search
per se; it works as a bridge between the information in your models/queries in your search forms, and a text search engine/backend. We will show here how to configure Haystack to search in different fields of a single model, using Solr as the search backend.
Preliminaries
Let's assume we have a small CMS application (
cms_base
), with a Page
model, as follows:
class Page(models.Model):
title = models.CharField(
max_length=100,
verbose_name="title",
)
content = models.TextField(
verbose_name="content",
)
attachment = models.FileField()
...
author = models.ForeignKey(
...
)
...
updated_at = models.DateTimeField(
auto_now=True,
)
class Meta:
...
And we want to have a search form that searches in the
title, the content, the author, and the content of the attachment of all the pages available.
Note that we are assuming here that you have already installed Haystack and Solr properly, and that you have configured your Django project to load Haystack.
Indexing your data
The basic procedure to index the data in a model is to:
- Define a class that tells Haystack which model to use, how to select the instances of the model to index, and other miscellaneous tidibits (an Index class).
- Haystack will index one document per model instance. We must provide a template to build such document, given the model instance.
- Create the initial index, indexing all the instances of the models configured.
- Update periodically the index.
Index class definition
Create the
cms_base/search_indexes.py
file, with the following class definition.
from cms_base.models import Page
from haystack import indexes
class PageIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
def get_model(self):
return Page
The search engines that Haystack uses are built around the idea of
documents: you feed them a document (a bunch of plain text), and some sort of ID for the document, and when searching they will give you back the matched documents IDs.
The declaration of the
text
field in this class is basically telling Haystack that this is the "main" or default field to perform search upon (document=True
), and that we will use a template (yes, a regular Django template) to build that document (use_template=True
).
Take into account that the field declarations of this class have nothing to do with the model (Page) fields. You can declare fields that take their data from a model field, but it isn't required to search the instance contents. Also, the field names in the Index class don't have to match the field names in the model class. Finally, it is a common (Haystack) practice to use the
text
name for the field that defines the main document, so you should not change it if you can help it.
The
get_model
method simply tells Haystack what model instances we are going to index.
NOTE: Haystack will detect and load automatically all the classes in the search_indexes.py
file of all the applications of your project (in this case, cms_base
); if you need to index/search other models, just define the class in the corresponding search_indexes.py
file, and you're good to go.
Document template definition
The Index class will look for a template to build the document, according to the following schema, in any of the template directories of your Django project:
search/indexes/<APP_NAME>/<MODEL_NAME>_<INDEX_FIELD>.txt
In our example, we'll create the following file:
cms_base/templates/search/indexes/cms_base/page_text.txt
(cms_base/templates
being one of the directories where Django searches for templates)
This template should build a simple text file, with all the text that you want to index for your model instance. Right now, we will index the
title
, content
, and author
fields. To do so, write your template as follows:
{{ object.title|safe }}
{{ object.content|safe }}
{{ object.author.get_full_name }}
As you can see, this is a regular Django template; you can use filters, access attributes, methods, and so on. The model instance will be available in the template as the
object
variable.
Some things to take into account when writing the template:
- Django's default behaviour is to HTML-escape the text it puts in the templates; if you want to get your text as-is, you'll have to use the
safe
template filter. - Do your best to get only the words you want indexed into the template. This might imply, for example, removing HTML tags from a field that contains HTML text. Template filters like striptags might prove useful.
Indexing contents of attached files
Indexing the content of attached files (
FileField
s) takes a bit of extra work; basically you have to:
- Extract the text of the file (.DOC, .PDF, .RTF files.)
- Make that text available in the template context, so you can put it in the document.
Extracting the text depends heavily on the kind of file you are analyzing, and is rather cumbersome. And, even though most search engines have functionality to do precisely that processing and extraction, this functionality must also be exposed via Haystack to be useful to us.
As of now, the only backend that has this functionality exposed in Haystack (v. 2.1) is Solr.
To extract the text of an arbitrary file, you have to override the
prepare
method of the Index class, as follows:
class PageIndex(indexes.SearchIndex, indexes.Indexable):
...
def prepare(self, obj):
data = super(PageIndex, self).prepare(obj)
file_data = self._get_backend(None).extract_file_contents(
obj.attachment,
)
template = loader.select_template(
("search/indexes/cms_base/page_text.txt", ),
)
data["text"] = template.render(Context({
"object": obj,
"file_data": file_data,
}))
return data
The method that extracts the text from the file is
extract_file_contents
; it connects to the backend, sends the file, and receives a response. It returns a dictionary, with the following entries:
responseHeader
: data about the connection. Not really of interest to us.metadata
: information about the file: encoding, size, content type, etc. Nothing worth indexing, in this example.contents
: The actual text inside the file.
Note: take into account that the content is delivered as an HTML document by Solr; we'll have to get rid of the tags later.
Note that we are re-rendering the
text
field, and now we are passing a new element to the template: file_data
.
In the template, we can add the contents of the file like this:
{{ object.title|safe }}
{{ object.content|safe }}
{{ object.author.get_full_name }}
{{ file_data.contents|striptags|safe }}
We use the
striptags
filter, because of the HTML tags added by the Solr service to the attached file text.
Create the initial version of the index.
After configuring the model and defining the template for each document, we are (almost) ready to index our data. Before doing that, we must generate Solr's
schema. Issue the following command:
./manage.py build_solr_schema > schema.xml
put the
schema.xml
file in Solr's configuration area (/etc/solr/conf/schema.xml
in Ubuntu's stock installation of Solr.), and restart the application server that is hosting Solr (Jetty/Tomcat).
Now we can index our model. Issue the following command:
./manage.py rebuild_index
It will check
all of the elements of your model and index the text you defined in the template.
Keeping your index up-to-date
Haystack provides us with an
update_index
command, that lets us check only the most recent elements we have added/modified, and index/reindex them. You can use it as follows:
./manage.py PageIndex --age=2
the argument to the
age
parameter tells Haystack to check for elements that have been modified in the last (in this example) 2 hours.
However, to do this, Haystack needs a bit of support in the models it is indexing; there must be a field in the model that keeps track of the last modification date of each instance, and we must tell Haystack which field it is.
In our example, the
Page.updated_at
field fills this role. To let Haystack know about it, we must override the get_updated_field
method of the index class, as follows:
class PageIndex(indexes.SearchIndex, indexes.Indexable):
...
def get_updated_field(self):
return "updated_at"
Note if we don't provide this field, Haystack will always check all the instances of the model it is indexing when performing an update, regardless of the age
parameter.
Searching for your data
Stay tuned for part 2 of this series, where we'll explain how to query the index from Django, filter results and other front-end issues related with searching :)
Written by Javier Rojas
A seasoned developer, Javier brings expertise in designing and building complex software systems. With a focus on quality and performance, they lead projects that drive innovation and deliver reliable solutions to meet user needs.