slides-elasticsearch-and-drupal

Elasticsearch and Drupal

Presented by Bec White

Prepared for BADCamp, November 2014

Bec White

Senior Engineer and Team Lead at Palantir.net

@becw on Twitter

mentors - friend who dragged me to my first drupalcamp 8 years ago, my first drupal job that sent me to lots of drupalcons and camps, everybody on the palantir.net team

What is a search server?

Analyzes text content from your site
Analyzes queries from your users
Finds matches between the two

Generally, your site sends (or queues) data to the search server when your content is updated--hook_node_save()

"analysis" is making generalizations like:

The document contains the words "search", "for", "an", "answer"
The word "search" is rare among this body of documents
The word "Searching" is equivalent to the word "search"

analysis happens when content is indexed. indexing associates the original document (via id, link, other metadata) with its analyzed content.

analysis happens on both content and queries!

When is a search server necessary?

For full text search
When you have lots of content
When you want to do something "weird"

Full text search means matching against content from different parts of a document (fields on a node), and matching documents that don't contain the exact query.

In Drupal, "lots of content" is somewhere over 5 or 10 thousand nodes/documents. This is approaching the point where core search doesn't make sense--it doesn't provide the search features useful for narrowing a large result set, or it starts to have performance issues

I've implemented Drupal core search across 20,000 documents, and the performance was... not good. Part of this was the additional filtering that we applied to the result set.

Aside: what is "weird"?

Listing tens of thousands to millions of nodes
Improved sorting
Search with different levels of precision
Autocomplete or autosuggest
Related content based on textual analysis
Add spelling suggestions and "more like this" to site search
Sharing lightly-structured data across applications

when I say "weird", I mean that the processing/generalizations a search server does can be used to implement lots of features beyond plain site search

listing = listing, sorting, and filtering. when a query seems slow due to scale and complexity, and you have a search server, you can think about shifting it to Elasticsearch instead of optimizing MySQL (or negotiating with Views...)

improve sorting = configure an analyzer to remove leading "the", "a", or punctuation

search w/ different levels of precision = query syntax, things like and/or/not, precedence, slop

sharing data = using the search server as a read-only data store for another application

What is Elasticsearch?

An open source search server
Based on the Lucene search engine (like Solr)
Distributed and scalable
Speaks JSON

"You Know, for Search"

Site search, but also working with tons of data, and also as a fast data backend for a JSON API.

Other search servers you may be familiar with:

Drupal core search (Drupal itself is the search server)
Apache Solr
Sphinx
Google Search Appliance

Running Elasticsearch for development

Run a local vagrant box (gist)
Beware: Elasticsearch itself doesn't do access control

vagrant is great for development because you don't run into collisions with other developers, and you can poke it as much as you want

talk to your ops about this!

Three stories about search

The first story is not weird, but I will try to get to the "weird" very soon.

Site search

Inline lookup widget

Auto-suggest

Site search

Do it because you want fulltext search
Use Search API + Elasticsearch Connector
Use Solr instead of Elasticsearch for the general case

Views "contains" text searches (case-insensitive because MySQL configuration but otherwise exact match only), SQL LIKE wildcard queries

Drupal core search (some analysis, tokenization), so actually can be sufficient when you don't have too much content and don't need fancy features

So that users can find content using approximate terms in big stacks of content that don't have a clear hierarchy

Search API lets you configure which fields are searchable and how

in drupal, "content" means nodes, not generally pages. so the site search won't return listing pages, landing pages, etc., unless those pages are built as nodes (or some other sort of entity).

Out-of-the-box, this isn't actually very good search...

Weird result order
Terms combined using OR

weird result order = unimportant words affect results, words match too precisely. can fix this by updating the index configuration.

third thing... the OR, even though default views filter config says AND... is because of the way the search query is built by elasticsearch_connector_search_api. this is a lot harder to change.

these issues are why you might want to use solr as your search server if you are implementing site search.

Tokens

Case and punctuation
Stemming ("searching" => "search")
Stop words ("a", "an", "the", "for")

that said, it's really easy to tweak the analyzer in elasticsearch to make your search better--way easier than solr.

in order to tweak the analyzer, you need to know a couple of things.

case and punctuation are somewhat universal; stemming and stop words are language specific! this is why they don't come configured by default.

if you're working with language that uses accents, you may also want the icu-folding elasticsearch plugin (on the ES server) to do character folding

at the elasticsearch level, you can apply different analyzers to different fields.

sometimes you don't want to analyze a field... why wouldn't you? for sorting, for facets, fields should only contain a single token.

Inline lookup widget

Why?

Content editors needed to find precise matches among lots of records

around 6 million records
match using search syntax, fuzzy OR precise diacritic matching
filter based on the state of other fields in the edit form

How?

Search API + Elasticsearch Connector
Elasticsearch javascript library
Store the rendered result markup in Elasticsearch

used the ES javascript library with jquery, this required jquery update

talking to ES with javascript means that the elasticsearch server has to be more open--again, talk to your ops about how to secure it

Same Search API index as fulltext search
Use-case-specific javascript on a field widget
Ability to build Elasticsearch queries was key

Query building

Queries vs. filters
Combine queries and filters with nesting

first of all, ES queries are really different from sql queries, where subqueries are kind of unnatural. ES api query structure is nested. means that it's hard to build a good query with Views (defaults are messed up, per the "site search" example)

queries affect search score, important for fulltext search

filters have built in caching, narrow search scope

"filtered" query to do queries on a filtered set

"bool" query to combine "must", "should", "must not" queries

many of queries are available as filters, and vice versa

In this case...

Start with a full text search query
Filter by content type and editorial status

start with a plain "simple_query_string" query that allows search syntax

configure the simple_query_string query to AND terms together by default, so that additional terms NARROW search results

nest within a "filtered" query IFF filter factors are present

Inline lookup widget

Do it to find precise results fast
Query and filter structure is important
Don't forget about controlling access to the search server

Auto suggest

Why?

Search-based navigation for a dataset that is mainly titles.

movies, tv shows, books, musicians, album titles, song titles -- in this kind of dataset, the title is often the most meaningful textual data. you can't "search" video content

How?

Elasticsearch analyzer configuration

Don't do any "analysis" in Drupal/PHP!!!

that decoupled drupal session that Larry Garfield has been giving lately? we used Elasticsearch for part of that. autosuggest is a small part of how we used elasticsearch there, but it was the funnest part.

A Silex app sat on the other side of Elasticsearch and wrapped up JSON data from Elasticsearch in HAL.

I had done a ton of "weird" Solr work (sarnia) and had been reading about elasticsearch, so we decided to use it as the data source for the API layer

our internal engineering team for this project was four people--myself, Robin, Beth, and Larry. let me just note what an incredible opportunity it is in this field to work on a team with so many women, and the gender composition was an accident of resourcing and who worked out of the chicago office. we still have miles to go on creating a professional environment that can empower and enrich folks of varied backgrounds, but I'm lucky in that right now, that work is part of our daily process and environment at palantir.

so I became the architect and expert on the elasticsearch component

because solr and elasticsearch are based on the same search engine library, luncene, the concepts and even many of the search options and behaviors carry over

How would you search for "The Dark Knight"?

The Dark Knight
the dark knight
dark knight

But what about autosuggest?

Tokenizing titles

Standard tokens
Shingles
NGrams

standard tokens = split on whitespace and punctuation, lowercase everything

shingles = make more tokens by combining sequential words

ngrams = make even more tokens by breaking tokens into progressively longer chunks

"The Dark Knight"

Tokenized: the, dark, knight
Shingled: the dark, the dark knight, dark knight
NGrams: ..., d, da, dark, dark k, dark kn, ...

ngrams generate tokens with trailing whitespace, so trim that

then remove tokens that are stop words, like "the". that should not match any titles on its own... but it should affect matching on sequenced words

then remove duplicate tokens, because those shouldn't affect relevance

and do all this with token filters on the analyzer

Checking your work

curl -XGET "http://elasticsearch.dev:9200/movie_titles/_analyze?..."
Elasticsearch head

To tweak analyzers, sometimes you need to look at what is happening.

curl -XGET "http://elasticsearch.dev:9200/movie_titles/_analyze?pretty=true&analyzer=autocomplete&text=The%20Dark%20Knight"

Elasticsearch head is a plugin that gives you clicky access to Elasticsearch. interface to explore and debug.

go to the query thing, enter /movie_titles, _analyze?analyzer=autocomplete&text=The Dark Knight

Matching

Indexed content turns into lots of progressive tokens
Search string must be ONE token

remember earlier I said that a search server analyzes content, analyzes a user's queries, then finds matches between the two? sometimes analyzing the content and the queries differently can make for more precise matches.

why should the search string be just one token? because the second word you add should narrow the search by sequence... everything you type should appear in order in the result.

to handle this, you can configure a different analyzer to handle indexing the content than the user's search queries.

applies to diacritics/accent folding, too! matching exactly what the user typed.

How would you search for "W.C. Fields"?

W.C. Fields
w. c. fields
w c fields
wc fields

so it looks solved, but then there's this. the thing about search behavior is that what we actually expect it to do is very hard to codify.

team lead hat: don't expect to complete search features without iteration. no matter solid your analysis is, the client has a better sense of how and why their dataset will be searched, and what the results should be.

Auto suggest

Do it when users navigate content by title
It's all in the analyzer configuration
Iterate on matching

In summary...

Use Elasticsearch if you want to do weird things

The API is playful, you do not have to do a lot of configuration up front. start throwing stuff at it, see how it responds, and then configure it to do what you need.

If you want site search and don't want to do heavy customization, use the Solr integration

Understanding tokens, queries, and filters will improve your matching

Just know these exist, then read the docs.

Search feature development benefits from iteration

Search feature expectations are at the mercy of Google search UX... but these features are a different kind of search tool. expectations about how search works are hard to nail down because people use lots of different search tools and explore their datasets in lots of different ways

References

Full version of this presentation (press 's' for notes)
Elasticsearch documentation
My simple Vagrant setup
Example config of shingles and NGrams
Blog post on autocomplete with Elasticsearch

Palantir.net

Let's make something good together

Keep tabs on our work at @Palantir

Want to hear about what we're doing?

Elasticsearch and Drupal – Site search – Inline lookup widget

palantirnet

Elasticsearch and Drupal – Site search – Inline lookup widget

1 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();