Presented by Bec White
Prepared for BADCamp, November 2014
Senior Engineer and Team Lead at Palantir.net
@becw on Twitter
mentors - friend who dragged me to my first drupalcamp 8 years ago, my first drupal job that sent me to lots of drupalcons and camps, everybody on the palantir.net team
Generally, your site sends (or queues) data to the search server when your content is updated--hook_node_save()
"analysis" is making generalizations like:
analysis happens when content is indexed. indexing associates the original document (via id, link, other metadata) with its analyzed content.
analysis happens on both content and queries!
Full text search means matching against content from different parts of a document (fields on a node), and matching documents that don't contain the exact query.
In Drupal, "lots of content" is somewhere over 5 or 10 thousand nodes/documents. This is approaching the point where core search doesn't make sense--it doesn't provide the search features useful for narrowing a large result set, or it starts to have performance issues
I've implemented Drupal core search across 20,000 documents, and the performance was... not good. Part of this was the additional filtering that we applied to the result set.
when I say "weird", I mean that the processing/generalizations a search server does can be used to implement lots of features beyond plain site search
listing = listing, sorting, and filtering. when a query seems slow due to scale and complexity, and you have a search server, you can think about shifting it to Elasticsearch instead of optimizing MySQL (or negotiating with Views...)
improve sorting = configure an analyzer to remove leading "the", "a", or punctuation
search w/ different levels of precision = query syntax, things like and/or/not, precedence, slop
sharing data = using the search server as a read-only data store for another application
"You Know, for Search"
Site search, but also working with tons of data, and also as a fast data backend for a JSON API.
Other search servers you may be familiar with:
vagrant is great for development because you don't run into collisions with other developers, and you can poke it as much as you want
talk to your ops about this!
The first story is not weird, but I will try to get to the "weird" very soon.
Views "contains" text searches (case-insensitive because MySQL configuration but otherwise exact match only), SQL LIKE wildcard queries
Drupal core search (some analysis, tokenization), so actually can be sufficient when you don't have too much content and don't need fancy features
So that users can find content using approximate terms in big stacks of content that don't have a clear hierarchy
Search API lets you configure which fields are searchable and how
in drupal, "content" means nodes, not generally pages. so the site search won't return listing pages, landing pages, etc., unless those pages are built as nodes (or some other sort of entity).
weird result order = unimportant words affect results, words match too precisely. can fix this by updating the index configuration.
third thing... the OR, even though default views filter config says AND... is because of the way the search query is built by elasticsearch_connector_search_api. this is a lot harder to change.
these issues are why you might want to use solr as your search server if you are implementing site search.
that said, it's really easy to tweak the analyzer in elasticsearch to make your search better--way easier than solr.
in order to tweak the analyzer, you need to know a couple of things.
case and punctuation are somewhat universal; stemming and stop words are language specific! this is why they don't come configured by default.
if you're working with language that uses accents, you may also want the icu-folding elasticsearch plugin (on the ES server) to do character folding
at the elasticsearch level, you can apply different analyzers to different fields.
sometimes you don't want to analyze a field... why wouldn't you? for sorting, for facets, fields should only contain a single token.
Content editors needed to find precise matches among lots of records
used the ES javascript library with jquery, this required jquery update
talking to ES with javascript means that the elasticsearch server has to be more open--again, talk to your ops about how to secure it
first of all, ES queries are really different from sql queries, where subqueries are kind of unnatural. ES api query structure is nested. means that it's hard to build a good query with Views (defaults are messed up, per the "site search" example)
queries affect search score, important for fulltext search
filters have built in caching, narrow search scope
"filtered" query to do queries on a filtered set
"bool" query to combine "must", "should", "must not" queries
many of queries are available as filters, and vice versa
start with a plain "simple_query_string" query that allows search syntax
configure the simple_query_string query to AND terms together by default, so that additional terms NARROW search results
nest within a "filtered" query IFF filter factors are present
Search-based navigation for a dataset that is mainly titles.
Don't do any "analysis" in Drupal/PHP!!!
that decoupled drupal session that Larry Garfield has been giving lately? we used Elasticsearch for part of that. autosuggest is a small part of how we used elasticsearch there, but it was the funnest part.
A Silex app sat on the other side of Elasticsearch and wrapped up JSON data from Elasticsearch in HAL.
I had done a ton of "weird" Solr work (sarnia) and had been reading about elasticsearch, so we decided to use it as the data source for the API layer
our internal engineering team for this project was four people--myself, Robin, Beth, and Larry. let me just note what an incredible opportunity it is in this field to work on a team with so many women, and the gender composition was an accident of resourcing and who worked out of the chicago office. we still have miles to go on creating a professional environment that can empower and enrich folks of varied backgrounds, but I'm lucky in that right now, that work is part of our daily process and environment at palantir.
so I became the architect and expert on the elasticsearch component
because solr and elasticsearch are based on the same search engine library, luncene, the concepts and even many of the search options and behaviors carry over
But what about autosuggest?
standard tokens = split on whitespace and punctuation, lowercase everything
shingles = make more tokens by combining sequential words
ngrams = make even more tokens by breaking tokens into progressively longer chunks
ngrams generate tokens with trailing whitespace, so trim that
then remove tokens that are stop words, like "the". that should not match any titles on its own... but it should affect matching on sequenced words
then remove duplicate tokens, because those shouldn't affect relevance
and do all this with token filters on the analyzer
To tweak analyzers, sometimes you need to look at what is happening.
curl -XGET "http://elasticsearch.dev:9200/movie_titles/_analyze?pretty=true&analyzer=autocomplete&text=The%20Dark%20Knight"
Elasticsearch head is a plugin that gives you clicky access to Elasticsearch. interface to explore and debug.
go to the query thing, enter /movie_titles, _analyze?analyzer=autocomplete&text=The Dark Knight
remember earlier I said that a search server analyzes content, analyzes a user's queries, then finds matches between the two? sometimes analyzing the content and the queries differently can make for more precise matches.
why should the search string be just one token? because the second word you add should narrow the search by sequence... everything you type should appear in order in the result.
to handle this, you can configure a different analyzer to handle indexing the content than the user's search queries.
applies to diacritics/accent folding, too! matching exactly what the user typed.
so it looks solved, but then there's this. the thing about search behavior is that what we actually expect it to do is very hard to codify.
team lead hat: don't expect to complete search features without iteration. no matter solid your analysis is, the client has a better sense of how and why their dataset will be searched, and what the results should be.
The API is playful, you do not have to do a lot of configuration up front. start throwing stuff at it, see how it responds, and then configure it to do what you need.
If you want site search and don't want to do heavy customization, use the Solr integration
Just know these exist, then read the docs.
Search feature expectations are at the mercy of Google search UX... but these features are a different kind of search tool. expectations about how search works are hard to nail down because people use lots of different search tools and explore their datasets in lots of different ways
Let's make something good together
Keep tabs on our work at @Palantir
Want to hear about what we're doing?