Elasticsearch

A Short Introduction

Created by Hayden Chudy / @hjc1710 Open Notes

What is Elasticsearch?

A distributed, clusterable search server powered by a JSON DSL for searching.
Really, just a layer of abstractions on top of Lucene to store more structured data.
Can also be used as a simple document-oriented NoSQL database.
- Sort of like Mongo with better queries and even less relations.
- Normally backed by another database storage engine though.
Written in Java.
Created by Shay Banon in 2010.

Layer of abstractions - Lucene only stores strings, but ES supports storing, indexing and searching through complicated JSON structures, this is all done through clever use of key names and values within Lucene, that ES turns into a structure of sort
Some people use ES as their primary storage engine, especially if their core data source is logs. However, ES is frequently joined with another database storage engine, and data is mirrored between the two.

What is Lucene?

Open Source project by Apache to write an Information Retrieval (IR) System.
IR Systems use metadata and full text search to make finding information very efficient.
Provides a robust and accurate scoring algorithm, in addition to fast queries.
A number of major search engine solutions are built on top of Lucene.
Started in 1999 by Doug Cutting, adopted by Apache in 2001, became it's own top level project in 2005.
Also written in Java, providing a robust native Java API.

IR Systems are focused on efficiently searching for and retrieving information, as opposed to being focused on efficiently storing said information (which is what normal DBMS's are focused on).
For example: a common strategy in IR to improve search robustness is to index the same field of data in multiple, different fashions so it can match more complex queries. Compare this to relational DB, where you are constantly trying to lower how many duplicate entries of a field there are.
The scoring algorithm used is: tf-idf. However, individual query objects can affect tf-idf scoring and lie at the heart of why Lucene scoring is great.
Large projects built on top of Lucene, in addition to ES, include:
- Solr
- Compass (precursor to ES)
- Swiftype (an enterprise search start up that sells search solutions to other sites, built on top of Lucene)
- KinoSearch (another big search server like Solr)
Some people even use Lucene directly and circumvent the likes of Solr and ES entirely. Examples of this include: Apple.com, LinkedIn, Jira, and, formerly, Twitter.
ES utilizes Lucene's Java API to provide fast, native access.

Let's Begin!

Installing Elasticsearch

Use the Vagrantfile in this repository.
Use your Operating System's package manager:

# latest on Ubuntu
$ apt-get install openjdk-7-jdk
$ wget -qO-  http://packages.elasticsearch.org/GPG-KEY-elasticsearch - | apt-key add -
$ echo "deb http://packages.elasticsearch.org/elasticsearch/1.7/debian stable main" > /etc/apt/sources.list.d/elasticsearch.list
$ apt-get update
$ apt-get install elasticsearch
# OSX with Brew
$ brew install elasticsearch

Vagrant requires version 1.6+, vagrant-bindfs, and vagrant-salt.
Install those and vagrant up, and vagrant will forward port 9200 for you.

Seeing if it Works

hayden@beardtop ~> curl -XGET localhost:9200
{
	"status" : 200,
	"name" : "Madam Slay",
	"cluster_name" : "elasticsearch",
	"version" : {
		"number" : "1.7.1",
		"build_hash" : "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
		"build_timestamp" : "2015-07-29T09:54:16Z",
		"build_snapshot" : false,
		"lucene_version" : "4.10.4"
	},
	"tagline" : "You Know, for Search"
}

Seeing if ES is up and running is simple: just curl port 9200 and see if you get a response!
Fun fact, all Elasticsearch node names come from Marvel super hero names.

Indexing Your First Item

hayden@beardtop ~> curl -XPOST "localhost:9200/croscon/employees/1" -d '{
	"name": "Tom Sawyer",
	"id": 1,
	"specialties": ["javascript", "php"]
}'
{"_index":"croscon","_type":"employees","_id":"1","_version":1,"created":true}

hayden@beardtop ~> curl -XGET "localhost:9200/croscon/employees/1"
{"_index":"croscon","_type":"employees","_id":"1","_version":1,"found":true,"_source":{
	"name": "Tom Sawyer",
	"id": 1,
	"specialties": ["javascript", "php"]
}}

Indexing is as simple as a POST request.
We explicitly specified the id in the path, but you can also omit it and let ES create one on its own.
Fetching is then as simple as GETing the route by id.
The URL scheme goes as follows: $ES_URL/$INDEX_NAME/$TYPE_NAME/$DOCUMENT_ID.
- An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, and another index for a product catalog.
- A type is a logical category/partition of your index whose structure is up to you. In general, a type is defined for documents that have a set of common fields. For example, let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.
- A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON, much like Mongo. Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.

Updating Data

hayden@beardtop ~> curl -XPUT "localhost:9200/croscon/employees/1" -d '{
	"name": "Tom Sawyer",
	"id": 1,
	"specialties": ["javascript", "php"],
	"date_of_birth": "1990-12-10"
}'
{"_index":"croscon","_type":"employees","_id":"1","_version":2,"created":false}

hayden@beardtop ~> curl -XGET "localhost:9200/croscon/employees/1"
{"_index":"croscon","_type":"employees","_id":"1","_version":2,"found":true,"_source":{
	"name": "Tom Sawyer",
	"id": 1,
	"specialties": ["javascript", "php"],
	"date_of_birth": "1990-12-10"
}}

Let's add a new date of birth field. To update an item, you merely PUT to the route by ID with the new, complete document. Deleting of fields is done by omission. Brand new fields are automatically mapped into the index, parsed, and saved. Known fields are parsed as described by their mappings, and saved. Some more employee data: { "name": "Ben Rogers", "id": 2, "specialties": ["css", "less", "magic"], "date_of_birth": "1987-08-10" } { "name": "Huck Finn", "id": 3, "specialties": ["php", "python", "devops"], "date_of_birth": "1990-10-17" } { "name": "Pap Finn", "id": 4, "specialties": ["php", "python", "devops", "magic"], "date_of_birth": "1984-07-16" }

Searching

Getting Everyone

hayden@beardtop ~> curl -XGET 'localhost:9200/croscon/employees/_search?pretty'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "4",
      "_score" : 1.0,
      "_source":{ "name": "Pap Finn", "id": 4, "specialties": ["php", "python", "devops", "magic"], "date_of_birth": "1984-07-16" }
    }, {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{"name": "Tom Sawyer","id": 1,"specialties": ["javascript", "php"],"date_of_birth": "1990-12-10"}
    }, {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{ "name": "Ben Rogers", "id": 2, "specialties": ["css", "less", "magic"], "date_of_birth": "1987-08-10" }
    }, {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{ "name": "Huck Finn", "id": 3, "specialties": ["php", "python", "devops"], "date_of_birth": "1990-10-17" }
    } ]
  }
}

The presence of that `_search` is integral. Running a search against just an index and a type will produce an error along the lines of: `No endpoint found for $TYPENAME`. If you'll notice, these aren't in any real order, and every document returned has a score of: 1.0. When searching, `?pretty` is your friend, otherwise, all your results are returned minified.

Searching

Getting just the PHP'rs

hayden@beardtop ~> curl -XGET 'localhost:9200/croscon/employees/_search?pretty' -d '
{
  "query": {
    "term": {
      "specialties": "php"
    }
  }
}'

{
  "hits" : {
    "total" : 3,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "1",
      "_score" : 0.19178301,
      "_source":{ "name": "Tom Sawyer","id": 1,"specialties": ["javascript", "php"],"date_of_birth": "1990-12-10" }
    }, {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "4",
      "_score" : 0.15342641,
      "_source":{ "name": "Pap Finn", "id": 4, "specialties": ["php", "python", "devops", "magic"], "date_of_birth": "1984-07-16" }
    }, {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "3",
      "_score" : 0.15342641,
      "_source":{ "name": "Huck Finn", "id": 3, "specialties": ["php", "python", "devops"], "date_of_birth": "1990-10-17" }
    } ]
  }
}

And everyone but Ben is returned, as expected. If you'll notice, we searched for a string in an array rather easily, and ES will naturally make all of its queries just work with arrays. If you notice, the score has changed! Tom Sawyer wins because he has the least specialties!

Searching

Getting everyone born in 1990 and on:

hayden@beardtop ~> curl -XGET 'localhost:9200/croscon/employees/_search' -d '
{
  "query": {
    "range": {
      "date_of_birth": {
        "gte": "1990-01-01",
        "lte": "now"
      }
    }
  }
}'
{
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{ "name": "Tom Sawyer", "id": 1, "specialties": ["javascript", "php"],"date_of_birth": "1990-12-10" }
    }, {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{ "name": "Huck Finn", "id": 3, "specialties": ["php", "python", "devops"], "date_of_birth": "1990-10-17" }
    } ]
  }
}

The `lte` for now is optional, but it is included to show you that ES has some strings with special meaning. This one turns into the current DateTime. You'll notice there's still no score here. That's because scoring a range query is hard and elasticsearch doesn't do everything These are all just very light examples, and all of the options given can be further customized to support boosting, multiple date formats, timeszones, etc. Segue into mapping with a note about: "But, wait... how did that date range search work? We just gave it a string? There were no explicit dates involved!"

Welcome to Mappings

hayden@beardtop ~> curl -XGET 'localhost:9200/_mapping?pretty'
{
  "croscon" : {
    "mappings" : {
      "employees" : {
        "properties" : {
          "date_of_birth" : {
            "type" : "date",
            "format" : "dateOptionalTime"
          },
          "id" : {
            "type" : "long"
          },
          "name" : {
            "type" : "string"
          },
          "specialties" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

Behind the scenes, ES creates a mapping for each type it has in an index. It parses each JSON payload and makes an intelligent best guess at what type each field should be. If you'll notice, it guessed correctly that our date_of_birth field was a Date with an optional Time part, even though it was sent as a string. Mapping also works on nested objects and arrays, natively.
Mapping is what make the magic ES queries possible, and also efficient.
Some supported types: string, number, date, boolean, binary (base64 representation of binary data; not stored or indexed by default), arrays, objects, arrays of objects, ip addresses (IPv4 only), geographic points, and geographic shapes.
The type of a field determines which queries can be run on it and how given queries are processed/run. For example, doing a range over numbers is done very differently then a range over IP's.

Explicit Mappings

Either a file in /etc/elasticsearch/templates or dynamically PUT via an API request.
Example:

While ES does determine automatic mappings for each type, you can also explicitly define a mapping for each type yourself, in case of ambiguous fields, complicated fields (such as multi-fields), or missing fields.

hayden@beardtop ~> curl -XPUT 'localhost:9200/croscon/_mapping/projects' -d '
{
  "projects": {
    "_id": {
      "index": "not_analyzed",
      "path": "id",
      "type": "long"
    },
    "properties": {
      "name": {
        "type": "string",
        "store": true,
        "index": "analyzed",
        "fields": {
          "raw": {
            "type": "string",
            "index": "not_analyzed"
          }
        }
      },
      "due_date": {
        "type": "date"
      },
      "id": {
        "type": "long"
      },
      "client": {
        "type": "object",
        "properties": {
          "id": {
            "type": "long"
          },
          "name": {
            "type": "string"
          }
        }
      }
    }
  }
}'

Multi-fields: Multi-fields are when you pass a given field through multiple analyzers and store multiple variations of it, for different searching reasons. A very common multi-field, is the `raw` field, which is just a string that is indexed as is, and is not parsed before being indexed.
Stored vs. indexed. For any given field, you can choose to store and/or index it. Storing a field means it is stored directly alongside the document identifier. Think of it this way: Lucene just stores pointers to ID's, these ID's are then looked up in ES, and come with extra information. By default, the indexed JSON document is stored in the _source field and it can be parsed and used to return any field you desire. Alternatively, you can disable _source, to save space, or store an additional field explicitly, allowing you to fetch it without parsing _source. Indexing a field means it is inserted into Lucene and prepared for searching. To search over a field it must be indexed. You can index a field as analyzed or not_analyzed. Analyzed means it will be transformed before being inserted (more on that later), not analyzed means it is inserted into Lucene raw. By default, fields are indexed and not stored.

Breaking Your Mappings

If you ever try to insert data that doesn't match your mappings, ES will slap you back with this:

hayden@beardtop ~> curl -XPOST 'localhost:9200/croscon/employees' -d '
{
  "name": "Ben Rogers",
  "id": 2,
  "specialties": ["css", "less", "magic"],
  "date_of_birth": "1987-XX-YY"
}'
{"error":"MapperParsingException[failed to parse [date_of_birth]];
nested: MapperParsingException[failed to parse date field
[1987-XX-YY], tried both date format [dateOptionalTime], and
timestamp number with locale []]; nested:
IllegalArgumentException[Invalid format: \"1987-XX-YY\" is
malformed at \"-XX-YY\"]; ","status":400}

I will point out though, that this is a very useful error message.

TO THE MOON WITH SCORING AND ORDERING!

Ordering

hayden@beardtop ~> curl -XGET 'localhost:9200/croscon/projects/_search' -d '
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "due_date": {
        "order": "asc",
        "missing": "_last"
      }
    },
    {
      "name.raw": {
        "order": "desc"
      }
    },
    "_score"
  ]
}'
{
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [ {
      "_index" : "croscon",
      "_type" : "projects",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{"name": "MC3 Rearch", "due_date": "2015-09-08", "id": 1, "client": { "name": "HFA", "id": 1 } },
      "sort" : [ 1441670400000, "MC3 Rearch", 1.0 ]
    }, {
      "_index" : "croscon",
      "_type" : "projects",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{"name": "MC4", "due_date": "2016-12-21", "id": 2, "client": { "name": "Croscon", "id": 2 } },
      "sort" : [ 1482278400000, "MC4", 1.0 ]
    }, {
      "_index" : "croscon",
      "_type" : "projects",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{"name": "MC5", "due_date": null, "id": 3, "client": null },
      "sort" : [ 9223372036854775807, "MC5", 1.0 ]
    } ]
  }
}

Earlier, we lightly discussed that Lucene implemented multiple scoring algorithms. Internally, Elasticsearch uses these derived scores to determine the sort order for all results.
However, we can override that sort order rather easy. We can sort on our arbitrary fields in whatever order we want, then we can let Elasticsearch's native score ordering kick in whenever.
The order syntax is rather simple, but extendable:
- Pick a field name, and then a direction.
- Alternatively, pick a field name, then open an object for more options! In the second example we're doing two things:
- We're using our name.raw multi-field. We're using this because names are composed of two parts, a first and a last, and we want to sort on both of those names. If we were to run this over just the `name` field, we would get unpredictable results, due to how analyzers work (something we'll cover later).
- We're using the special `missing` option. `missing` tells ES what to do when it finds a document without this field. Rather, it tells ES what VALUE to give that document for this field. In this case, we're telling ES to give that document whatever value it needs to be sorted last (likely a name of: ""). If we wanted all missing names to be sorted as if they were equal to: "CROSCON", that would be as simple as changing "_last" to be "CROSCON" (an example of that would be nice).
Our final example is the score sorting fallback. If you just pass `_score`, then ES knows to sort on the score in a descending fashion! AWESOME!

Ordering

 curl -XGET 'localhost:9200/croscon/projects/_search?pretty' -d '
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "due_date": {
        "order": "asc",
        "missing": "_first"
      }
    },
    {
      "name.raw": {
        "order": "desc"
      }
    },
    "_score"
  ]
}'
{
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [ {
      "_index" : "croscon",
      "_type" : "projects",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{"name": "MC5", "due_date": null, "id": 3, "client": null },
      "sort" : [ -9223372036854775808, "MC5", 1.0 ]
    }, {
      "_index" : "croscon",
      "_type" : "projects",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{"name": "MC3 Rearch", "due_date": "2015-09-08", "id": 1, "client": { "name": "HFA", "id": 1 } },
      "sort" : [ 1441670400000, "MC3 Rearch", 1.0 ]
    }, {
      "_index" : "croscon",
      "_type" : "projects",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{"name": "MC4", "due_date": "2016-12-21", "id": 2, "client": { "name": "Croscon", "id": 2 } },
      "sort" : [ 1482278400000, "MC4", 1.0 ]
    } ]
  }
}

We can also make missing fields come first.s

Ordering

curl -XGET 'localhost:9200/croscon/projects/_search?pretty' -d '
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "due_date": {
        "order": "asc",
        "missing": 1444455518000
      }
    },
    {
      "name.raw": {
        "order": "desc"
      }
    },
    "_score"
  ]
}'
{
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [ {
      "_index" : "croscon",
      "_type" : "projects",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{"name": "MC3 Rearch", "due_date": "2015-09-08", "id": 1, "client": { "name": "HFA", "id": 1 } },
      "sort" : [ 1441670400000, "MC3 Rearch", 1.0 ]
    }, {
      "_index" : "croscon",
      "_type" : "projects",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{"name": "MC5", "due_date": null, "id": 3, "client": null },
      "sort" : [ 1444455518000, "MC5", 1.0 ]
    }, {
      "_index" : "croscon",
      "_type" : "projects",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{"name": "MC4", "due_date": "2016-12-21", "id": 2, "client": { "name": "Croscon", "id": 2 } },
      "sort" : [ 1482278400000, "MC4", 1.0 ]
    } ]
  }
}

Or we can give it a specific value. In this example, we've given it the UNIX timestamp that corresponds to: 2015-10-10, putting it right in the middle

Scoring

hayden@beardtop ~> curl -XGET 'localhost:9200/tweets/tweet/_search?pretty' -d '
{
  "query": {
	"match": {
	  "tweet": "taylorswift13"
	}
  }
}'
{
  "hits" : {
    "total" : 2,
    "max_score" : 0.095891505,
    "hits" : [ {
      "_index" : "tweets",
      "_type" : "tweet",
      "_id" : "2",
      "_score" : **0.095891505**,
      "_source":{"user": "@hjc1710", "tweet": "Not convinced @carlyraejepsen is better than @taylorswift13 though."}
    }, {
      "_index" : "tweets",
      "_type" : "tweet",
      "_id" : "4",
      "_score" : **0.076713204**,
      "_source":{"user": "@hjc1710", "tweet": "You'll always have my heart @taylorswift13, no matter what @carlyraejepsen does."}
    } ]
  }
}

The other part of ordering results is scoring. Scoring can basically be summed up as "Lucene gives a document a relevance score, indicating how relevant said document was to our initial search".
Obviously, this is a wonderful thing to sort on.
The starred portion are the scores elasticsearch has derived. Since we haven't boosted much, they're rather small, but you'll notice they're different. The reason the first tweet is higher is because it is a shorter string and matches in shorter strings are worth more, under the argument that the less words there are, the more impact each individual word has on the sentence.
But, how the fuck does it work!?

A Smidge of TF-IDF

TF-IDF is the algorithm that is the primary driving force behind Lucene's relevance score.
TF-IDF = "Term Frequency - Inverse Document Frequency"
Basically: the more a word appears in a single document, the more valuable it is; however the more it appears in MULTIPLE documents all in a single index, the LESS valuable it is.
This sort of naturally handles things like Stop Words (and, the, is), but there are even better solutions to that problem later!

Mathematically, the way this all works all plays out kind of like: Calculate the term frequency. This can be done multiple ways, but the easiest way is to make it so term frequency == the raw frequency of a term in a document. Basically, if f(t, d) is the raw number of times a single term appears in a document, then tf(t, d) = f(t, d). Other, more complicated schemes apply, such as logarithmically scaled term frequencies [tf(t, d) = 1 + log(f(t, d))], or even augmented frequencies which prevent bias towards longer documents [tf(f, d) = 0.5 + ((0.5 x f(t, d)) / max(all_docs, key=f(t, d))) Calculate the inverse document frequency. The inverse document frequency is gotten by taking the log of the number of total documents in a corpus (or index, or collection of indices) divided by the total number of documents that contain this term, so: log(N / abs(sum(N, key=lambda x: term in x.terms))) Multiply those two. In a nutshell: Maths. Mucho, mucho maths. Lucene does this for every query, and it does it an efficient manner!

Peeking Into Scoring

Let's see how elasticsearch scores:

hayden@beardtop ~> curl -XGET 'localhost:9200/tweets/tweet/_search?pretty&explain' -d '
{
  "query": {
	"match": {
	  "tweet": "carlyraejepsen"
	}
  }
}'
{
  "hits" : {
    "total" : 4,
    "max_score" : 0.11506981,
    "hits" : [ {
      "_shard" : 2,
      "_node" : "08j3yVwyRaCJ3PQrQBcy3A",
      "_index" : "tweets",
      "_type" : "tweet",
      "_id" : "1",
      "_score" : 0.11506981,
      "_source":{"user": "@hjc1710", "tweet": "The new @carlyraejepsen album is top notch"},
      "_explanation" : {
        "value" : 0.11506981,
        "description" : "weight(tweet:carlyraejepsen in 0) [PerFieldSimilarity], result of:",
        "details" : [ {
          "value" : 0.11506981,
          "description" : "fieldWeight in 0, product of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "tf(freq=1.0), with freq of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "termFreq=1.0"
            } ]
          }, {
            "value" : 0.30685282,
            "description" : "idf(docFreq=1, maxDocs=1)"
          }, {
            "value" : 0.375,
            "description" : "fieldNorm(doc=0)"
          } ]
        } ]
      }
    },
    {
      "_source": {"user": "@hjc1710", "tweet": "Not convinced @carlyraejepsen is better than @taylorswift13 though."},
      "_explanation" : {
        "value" : 0.095891505,
        "description" : "weight(tweet:carlyraejepsen in 0) [PerFieldSimilarity], result of:",
        "details" : [ {
          "value" : 0.095891505,
          "description" : "fieldWeight in 0, product of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "tf(freq=1.0), with freq of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "termFreq=1.0"
            } ]
          }, {
            "value" : 0.30685282,
            "description" : "idf(docFreq=1, maxDocs=1)"
          }, {
            "value" : 0.3125,
            "description" : "fieldNorm(doc=0)"
          } ]
        } ]

...

If you throw the `explain` GET parameter into the query string, ES will return a HUGE amount of info on how the scoring went. In addition to other things.
Included in here is TF-IDF information! Also, information on the field length norm, which is how long the field is, and is why the longer string was worth less earlier!
Notice how the IDF score is the same for both documents (because we're searching for the same term), and it's really the field norm that differentiates them here.
Other info ES returns about the query include:
The Lucene Shard that the data came from (more on that later).
The ID of the node that found this result. Why is this returned? Because term and document frequencies are counted per shard! Not per index!

Let's boost!

hayden@beardtop ~> curl -XGET 'localhost:9200/tweets/tweet/_search?pretty&explain' -d '
{
  "query": {
	"bool": {
	  "should": [
		{
		  "match": {
			"tweet": {
			  "query": "taylorswift13",
			  "boost": 2
			}
		  }
		},
		{
		  "match": {
			"tweet": "carlyraejepsen"
		  }
		}
	  ]
	}
  }
}'
{
  "hits" : {
    "total" : 4,
    "max_score" : 0.12865195,
    "hits" : [ {
      "_shard" : 3,
      "_node" : "08j3yVwyRaCJ3PQrQBcy3A",
      "_index" : "tweets",
      "_type" : "tweet",
      "_id" : "2",
      "_score" : 0.12865195,
      "_source":{"user": "@hjc1710", "tweet": "Not convinced @carlyraejepsen is better than @taylorswift13 though."},
      "_explanation" : {
        "value" : 0.12865196,
        "description" : "sum of:",
        "details" : [ {
          "value" : 0.08576798,
          "description" : "weight(tweet:taylorswift13^2.0 in 0) [PerFieldSimilarity], result of:",
          "details" : [ {
            "value" : 0.08576798,
            "description" : "score(doc=0,freq=1.0), product of:",
            "details" : [ {
              "value" : 0.89442724,
              "description" : "queryWeight, product of:",
              "details" : [ {
                "value" : 2.0,
                "description" : "boost"
              }, {
                "value" : 0.30685282,
                "description" : "idf(docFreq=1, maxDocs=1)"
              }, {
                "value" : 1.4574206,
                "description" : "queryNorm"
              } ]
            }, {
              "value" : 0.095891505,
              "description" : "fieldWeight in 0, product of:",
              "details" : [ {
                "value" : 1.0,
                "description" : "tf(freq=1.0), with freq of:",
                "details" : [ {
                  "value" : 1.0,
                  "description" : "termFreq=1.0"
                } ]
              }, {
                "value" : 0.30685282,
                "description" : "idf(docFreq=1, maxDocs=1)"
              }, {
                "value" : 0.3125,
                "description" : "fieldNorm(doc=0)"
              } ]
            } ]
          } ]
        }, {
          "value" : 0.04288399,
          "description" : "weight(tweet:carlyraejepsen in 0) [PerFieldSimilarity], result of:",
          "details" : [ {
            "value" : 0.04288399,
            "description" : "score(doc=0,freq=1.0), product of:",
            "details" : [ {
              "value" : 0.44721362,
              "description" : "queryWeight, product of:",
              "details" : [ {
                "value" : 0.30685282,
                "description" : "idf(docFreq=1, maxDocs=1)"
              }, {
                "value" : 1.4574206,
                "description" : "queryNorm"
              } ]
            }, {
              "value" : 0.095891505,
              "description" : "fieldWeight in 0, product of:",
              "details" : [ {
                "value" : 1.0,
                "description" : "tf(freq=1.0), with freq of:",
                "details" : [ {
                  "value" : 1.0,
                  "description" : "termFreq=1.0"
                } ]
              }, {
                "value" : 0.30685282,
                "description" : "idf(docFreq=1, maxDocs=1)"
              }, {
                "value" : 0.3125,
                "description" : "fieldNorm(doc=0)"
              } ]
            } ]
          } ]
        } ]
      }
    }, {
      "_shard" : 0,
      "_node" : "08j3yVwyRaCJ3PQrQBcy3A",
      "_index" : "tweets",
      "_type" : "tweet",
      "_id" : "4",
      "_score" : 0.10292157,
      "_source":{"user": "@hjc1710", "tweet": "You'll always have my heart @taylorswift13, no matter what @carlyraejepsen does."},
      "_explanation" : {
        "value" : 0.10292157,
        "description" : "sum of:",
        "details" : [ {
          "value" : 0.06861438,
          "description" : "weight(tweet:taylorswift13^2.0 in 0) [PerFieldSimilarity], result of:",
          "details" : [ {
            "value" : 0.06861438,
            "description" : "score(doc=0,freq=1.0), product of:",
            "details" : [ {
              "value" : 0.89442724,
              "description" : "queryWeight, product of:",
              "details" : [ {
                "value" : 2.0,
                "description" : "boost"
              }, {
                "value" : 0.30685282,
                "description" : "idf(docFreq=1, maxDocs=1)"
              }, {
                "value" : 1.4574206,
                "description" : "queryNorm"
              } ]
            }, {
              "value" : 0.076713204,
              "description" : "fieldWeight in 0, product of:",
              "details" : [ {
                "value" : 1.0,
                "description" : "tf(freq=1.0), with freq of:",
                "details" : [ {
                  "value" : 1.0,
                  "description" : "termFreq=1.0"
                } ]
              }, {
                "value" : 0.30685282,
                "description" : "idf(docFreq=1, maxDocs=1)"
              }, {
                "value" : 0.25,
                "description" : "fieldNorm(doc=0)"
              } ]
            } ]
          } ]
        }, {
          "value" : 0.03430719,
          "description" : "weight(tweet:carlyraejepsen in 0) [PerFieldSimilarity], result of:",
          "details" : [ {
            "value" : 0.03430719,
            "description" : "score(doc=0,freq=1.0), product of:",
            "details" : [ {
              "value" : 0.44721362,
              "description" : "queryWeight, product of:",
              "details" : [ {
                "value" : 0.30685282,
                "description" : "idf(docFreq=1, maxDocs=1)"
              }, {
                "value" : 1.4574206,
                "description" : "queryNorm"
              } ]
            }, {
              "value" : 0.076713204,
              "description" : "fieldWeight in 0, product of:",
              "details" : [ {
                "value" : 1.0,
                "description" : "tf(freq=1.0), with freq of:",
                "details" : [ {
                  "value" : 1.0,
                  "description" : "termFreq=1.0"
                } ]
              }, {
                "value" : 0.30685282,
                "description" : "idf(docFreq=1, maxDocs=1)"
              }, {
                "value" : 0.25,
                "description" : "fieldNorm(doc=0)"
              } ]
            } ]
          } ]
        } ]
      }
    },
    {
      "_shard" : 4,
      "_node" : "08j3yVwyRaCJ3PQrQBcy3A",
      "_index" : "tweets",
      "_type" : "tweet",
      "_id" : "3",
      "_score" : 0.005816851,
      "_source":{"user": "@hjc1710", "tweet": "Yea, I think I like 1989 more than anything @carlyraejepsen has done."},
      "_explanation" : {
        "value" : 0.005816851,
        "description" : "product of:",
        "details" : [ {
          "value" : 0.011633702,
          "description" : "sum of:",
          "details" : [ {
            "value" : 0.011633702,
            "description" : "weight(tweet:carlyraejepsen in 0) [PerFieldSimilarity], result of:",
            "details" : [ {
              "value" : 0.011633702,
              "description" : "score(doc=0,freq=1.0), product of:",
              "details" : [ {
                "value" : 0.15165187,
                "description" : "queryWeight, product of:",
                "details" : [ {
                  "value" : 0.30685282,
                  "description" : "idf(docFreq=1, maxDocs=1)"
                }, {
                  "value" : 0.49421698,
                  "description" : "queryNorm"
                } ]
              }, {
                "value" : 0.076713204,
                "description" : "fieldWeight in 0, product of:",
                "details" : [ {
                  "value" : 1.0,
                  "description" : "tf(freq=1.0), with freq of:",
                  "details" : [ {
                    "value" : 1.0,
                    "description" : "termFreq=1.0"
                  } ]
                }, {
                  "value" : 0.30685282,
                  "description" : "idf(docFreq=1, maxDocs=1)"
                }, {
                  "value" : 0.25,
                  "description" : "fieldNorm(doc=0)"
                } ]
              } ]
            } ]
          } ]
        }, {
          "value" : 0.5,
          "description" : "coord(1/2)"
        } ]
      }
    } ]
  }
}

Legit, just walk through this. This is a truncated resultset too.

Analyzers, Tokenizers, and TokenFilters

Analyzers, Tokenizers, and TokenFilters allow you to transform your textual data into a more searchable format.
An Analyzer merely consists of a series of Tokenizers and TokenFilters.
Customizing Analyzers, TokenFilters, and Tokenizers is quite possible, but is FAR beyond the scope of this talk.
Example analyzers include: standard, whitespace, and language.

Analyzers are what make elasticsearch powerful and good for searching. They take your very specific sentences and turn them into more generic and searchable pieces.
Example Analyzer:
- Standard Analyzer: breaks words into tokens based on UAX #29, lowercases everything, and removes stop words.
- Whitespace Analyzer: Splits words into tokens based on whitespace.
- Language: An implementation of the standard analyzer for other languages besides English. Supported languages include: arabic, persian, portuguese, and norwegian.
When we're creating that `name.raw` field, all we're telling ES to do is... DON'T analyze that field at all, so we have the full string in tact and can work with it.
Analyzers are also why sorting on non-raw fields gives odd results. When this is done ES sorts on the first token it finds, which is basically undefined behavior.

Tokenizers

Tokenizers take a string of text and split it up into individual tokens, each of which is indexed and can be searched for.
Some example Tokenizers include: standard, keyword, ngram, and pattern.

Example ngram:

ngram("croscon", max=5, min=4) == [
  "cros",
  "crosc",
  "rosc",
  "rosco",
  "osco",
  "oscon",
  "scon"
]

Tokenizers split a single string into multiple, smaller constituent pieces, each of which is indexed and can be searched for.
Tokenizers are the crux of search. Without tokenizers, you would have to search for an exact string to get any match out of ES. With tokenizers, ES will break each word in a string up into its own item, allowing you to search for just that.
Example tokenizers:
- Standard - a tokenizer good for most European languages, implements the Unicode Text Segmentation algorithm which is described in Unicode Annex #29, which describes guidelines for determining default boundaries between major text elements (characters, words, and sentences). I did not have the time to fully read through the algorithm, but it was badass.
- Keyword - indexes the entire string as is.
- NGram - breaks words up into smaller pieces, as specified by you. For example, you set a ngram tokenizer up with a max ngram of 5 and min of 4 and pass it 'croscon', you get: cros, crosc, rosc, rosco, osco, oscon, and scon. This is effective in breaking words up for partial matches. You pick the characters eligible for ngramming.
- Pattern - breaks up strings into tokens that match a regex. Example: regex of: \$\d+\.\d+, with string of: "Cost: $25.00" would create a single token of: $25.00.

TokenFilters

TokenFilters further parse the results of tokens, stripping unwanted tokens and transforming tokens into more searchable forms.
Some examples of making something more searchable: stemming, lowercasing, and trimming.
Example TokenFilters include: stop token, reverse token, snowball token, asciifolding.

Example stop token:

['The', 'fox', 'is', 'brown', 'and', 'warm'] -> ['fox', 'brown', 'warm']

There are 37 TokenFilters built into ElasticSearch natively, many provided by Lucene.

TokenFilters get rid of all the noise in a search that we just don't need. The best example is a stop word. In the English language, a stop word is just a word that is used as a tool and really provides no semantic meaning to a sentence, such as and, or, or but. If we were to index this, it would become almost immediately useless due to how TF-IDF works, meaning it just takes up space. Well, instead of storing it... how about we strip it!? That's exactly what filters are for! Example TokenFilter:

Stop Token - removes stop words from a token stream. Examples include: and, is, and the. There are stop word dictionaries for a multitude of languages built into ES natively.
Reverse Token - just reverses the token.
Snowball Token - stems tokens using a Snowball generated stemming algorithm. Snowball is a small programming language geared around string properties and generating stemming algorithms.
asciifolding - collapses unicode symbols above code point 127 into their ASCII counterparts. Example: ü turns into u.

Stemming

Stemming is taking a word and reducing it to its "root stem".
For example, consider the word: skiing. The root of skiing is really just ski, and almost all searches for ski or skiing should return both. Another example is "dogs", whose root is dog. A search for dogs or dog should almost always be stemmed down because rarely do people care how many adorable puppies are in their returned search results.

Stemming

Stemming rules differ per language, but are generally pretty easy.
The algorithm is rather simple and basically works as a replacement table.

SSES -> SS caresses -> caress IES -> I ponies -> poni SS -> SS caress -> caress S -> cats -> cat Basically, you look at the end of words, and follow the rules. This is not all the rules for the English language, and you need to come up with rules for each language for this to work. This is called the Porter Stemming Algorithm and it is what Snowball implements and is quite good at.

In fact, I searched for puppies on Google, and this single puppy is what showed up at the top, due to stemming.

Making Search Work for You

So far, ES has been quite powerful, but it's not perfect. There are some things ES doesn't handle at all, such as array ordering.
There are always workarounds though!
The solution to your problems in ES are almost always: STORE MORE DATA!

Internally, it would be an ENORMOUS pain in the ass for ES to store information on the ordering of each array entry, so it just doesn't (why? Because ES stores arrays as: {"user_alias": "bob"}, {"user_alias": "tim"}, {"user_alias": "joe"}, with just the array key, because this is easy to search over, storing information about the ordering of the array in the key would, by definition, make every key for every array entry different, no es bueno).

Making Search Work for You

Let's fix our problem of searching over the first item in an array.

We'll do this by simply indexing the first item in that array:

doc = create_doc(id)
doc.top_specialty = specialities[0]
index_doc(doc)

Making Search Work for You

We can now search for everyone who's top specialty is php with:

hayden@beardtop ~> curl -XGET localhost:9200/croscon/employees/_search -d '
{
  "query": {
	"term": {
	  "top_specialty": "php"
	}
  }
}'
{
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{ "name": "Huck Finn", "id": 3, "specialties": ["php", "python", "devops"], "date_of_birth": "1990-10-17", "top_specialty": "php" }
    }, {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "4",
      "_score" : 0.30685282,
      "_source":{ "name": "Pap Finn", "id": 4, "specialties": ["php", "python", "devops", "magic"], "date_of_birth": "1984-07-16", "top_specialty": "php" }
    } ]
  }
}

But wait... why the score difference? While the IDF value was the same earlier, as I said, doc count is calculated PER SHARD and not PER INDEX. In this example, Huck lives on a shard that has two items in it, and Pap lives on a shard that has one item in it (just Pap), since IDF is the total number of documents divded by the frequency of a term, the score goes up as there are more documents! In reality, the distinction that document count is per shard and not per index has little meaning. You generally have so many items, and ES does a fine enough job of spreading them out, that this all evens out.

Making Search Work for You

We've now made our search more robust and powerful!
Need to see if a document has exact contents of an array? Alpha sort it, join them all with commas, and index that! Rebuild this at search time and... PROFIT!!!!
If scoring doesn't work for you, make your data work for you.

No really, that's what an ES core contributor told me to do when I had this problem!

Filters

Filters are basically queries that don't score anything.
They run after queries, and filter the final resultset.
They are excellent for removing items based on simple boolean tests, or exact matches.
Since they don't score, they can be cached, unlike queries.
Anytime you don't need to derive a score from a search condition, a filter should be used.

Filters are what make ES fast.
In general, filters should always be preferred over queries, due to this caching mechanism.
The only time you should use queries is when you want to score something.

Filter Caching

Many filters, such as term and prefix, are cached by default, others, such as geo and script, are not cached by default because the cost of caching them introduces additional processing overhead. Grouping filters, such as bool and and, themselves are not cached, but their internal filters frequently are.
Either way, you can determine which filters are and are not cached for a given query, and even set the key they are cached under.

A good deal of your memory will be dedicated to your filter cache and managing your filter cache is the best way to improve ES performance.
You can even cache the results of grouping filters, if you really want to!
As stated, use filters over queries, whenever possible, due to this very caching reason.
As you'll soon see, you can actually turn ANY query into a filter!

Example Filters

Some example filters include: and, bool, missing, and ids.
An example:

hayden@beardtop ~> curl -XGET 'localhost:9200/croscon/employees/_search?pretty' -d '
{
  "query": {
    "filtered": {
      "query": { "match_all": {} },
      "filter": {
        "term": {
          "specialties": "php"
        }
      }
    }
  }
}'
{
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "4",
      "_score" : 1.0,
      "_source":{ "name": "Pap Finn", "id": 4, "specialties": ["php", "python", "devops", "magic"], "date_of_birth": "1984-07-16", "top_specialty": "php" }
    }, {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{ "name": "Tom Sawyer", "id": 1, "specialties": ["javascript", "php"], "date_of_birth": "1990-12-10" }
    }, {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{ "name": "Huck Finn", "id": 3, "specialties": ["php", "python", "devops"], "date_of_birth": "1990-10-17", "top_specialty": "php" }
    } ]
  }
}

Example filters:
- and - a filter that filters documents by joining multiple other filters together with a boolean `AND`.
- bool - similar in concept to the bool query, this filter filters out documents based on boolean combinations of other filters. must == and; should == or; must_not == negation.
- missing - a filter that filters documents based on the absence of a field. If the field is missing, the document matches. It's counterpart is exists.
- ids - a filter that lets you specify an id and a type, matching any document in that type with that id.
To use filters, you must use the filtered query, which will let you specify a query to run, and then a filter to filter those results through.
Within the `filtered` query, you specify a single `query`, and a single `filter`. Much like with queries (as evinced earlier with the bool query), if you need to use multiple filters, then you use one of the grouping filters, such as and or bool, that let you decide which boolean opeartor joins each filter.
Since filters are more performant and should be used over queries whenever possible, the `filtered` query is most likely the query you will use the most.

The Query Filter

A special filter that lets you turn any query into a filter, by running said query and merely discarding the scoring results.

{
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "query": {
          "match": {
            "name": "Finn"
          }
        }
      }
    }
  }
}

Queries that are turned into filters lose the following features: scoring, and highlighting. Keep that in mind.
Query filters are NOT cached by default, but this can be changed.
In our example, we are using match as a filter, when match is not really a filter. @TODO: Flesh out. Maybe talk about match all.

A Complex Filter

hayden@beardtop ~> curl -XGET 'localhost:9200/croscon/employees/_search?pretty' -d '
{
 "query": {
   "filtered": {
     "filter": {
       "bool": {
         "must": {
           "query": {
             "bool": {
               "should": [
                 {
                   "match": {
                     "specialties": "css"
                   }
                 },
                  {
                      "match": {
                        "specialties": "php"
                      }
                  },
                  {
                     "match": {
                        "specialties": "javascript"
                      }
                  },
                  {
                    "match": {
                      "specialties": "python"
                    }
                  }
                ],
                "minimum_should_match": 2
              }
            }
          },
          "should": [
            {
              "exists": {
                "field": "date_of_birth"
              }
            },
            {
              "missing": {
                "field": "field_that_dne"
              }
            }
          ],
          "must_not": {
            "ids": {
              "type": "employees",
              "values": [1]
            }
          }
        }
      }
    }
  }
}'
{
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "4",
      "_score" : 1.0,
      "_source":{ "name": "Pap Finn", "id": 4, "specialties": ["php", "python", "devops", "magic"], "date_of_birth": "1984-07-16", "top_specialty": "php" }
    }, {
      "_index" : "croscon",
      "_type" : "employees",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{ "name": "HuckFinn", "id": 3, "specialties": ["php", "python", "devops"], "date_of_birth": "1990-10-17", "top_specialty": "php" }
    } ]
  }
}

Let's find everyone who has at least two specialties in: php, python, javascript, and css, who has the date_of_birth field, but not the `field_that_dne` field, and let's omit the employee with the id 1, if they happen to match.
If you omit the query, like we did here, it is automatically assumed to be a `match_all`.
Should's are great for scoring when used in queries, but their role changes when used in filters. You lose the ability to do `minimum_should_match` and they become one big OR condition. So, here we've made a `must` filter that is a `query` filter that then uses the bool query, which CAN use `minimum_should_match`.
After running this, we only get Pap and Huck Finn back. Ben is omitted because he only has one matching specialty, and Tom Sawyer is gone because he has the ID of 1.
While this is very complicated, it's result can actually be cached and this can be pretty performant.
Notice that everything has a score of 1.0.

A Complex Search

hayden@beardtop ~> curl -XGET localhost:9200/inventory/products,services/_search
{
    "sort": [{
        # list services over products
        "product_id": {
            "order": "desc",
            "missing": 9223372036854775806
        }
    },
    # sort by natural score and then name
    "_score",
     {
        "name.raw": {
            "order": "asc"
        }
    }],
    "query": {
        "filtered": {
            "query": {
                "function_score": {
                    "query": {
                        "function_score": {
                            "score_mode": "max",
                            "boost_mode": "replace",
                            "query": {
                                "bool": {
                                    # create an OR on products for a vendor and unfrozen services
                                    "should": [{
                                        "bool": {
                                            "must": [{
                                                "term": {
                                                    # only return products for this vendor
                                                    "vendor_id": {
                                                        "value": 160,
                                                        "boost": 1
                                                    }
                                                }
                                            }, {
                                                "term": {
                                                    "_type": {
                                                        # services don't have this field
                                                        "value": "products",
                                                        "boost": 1
                                                    }
                                                }
                                            }],
                                            "must_not": [{
                                                "term": {
                                                    # must be in stock
                                                    "out_of_stock": {
                                                        "value": true,
                                                        "boost": 1
                                                    }
                                                }
                                            }]
                                        }
                                    }, {
                                        # only return unfrozen services
                                        "bool": {
                                            "must": [{
                                                "term": {
                                                    "_type": {
                                                        "value": "services",
                                                        "boost": 1
                                                    }
                                                }
                                            }],
                                            # should not have frozen as true
                                            "must_not": [{
                                                "term": {
                                                    "frozen": {
                                                        "value": true,
                                                        "boost": 1
                                                    }
                                                }
                                            }]
                                        }
                                    }],
                                    "minimum_number_should_match": 1
                                }
                            },
                            # weight by category
                            "functions": [{
                                "weight": 13,
                                "filter": {
                                    "term": {
                                        "category_id": "18"
                                    }
                                }
                            },
                            {
                                "weight": 12,
                                "filter": {
                                    "term": {
                                        "category_id": "17"
                                    }
                                }
                            }]
                        }
                    },
                    # take the first score from a function, and
                    # then multiply the that with the score from the query
                    "score_mode": "first",
                    "boost_mode": "multiply",
                    "functions": [{
                        # if your stock is 0, we drop your score to 0
                        "weight": 0,
                        "filter": {
                            "term": {
                                "stock": 0
                            }
                        }
                    }, {

Elasticsearch – A Short Introduction – Searching

hjc1710

Elasticsearch – A Short Introduction – Searching

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

es-intro-pres

Elasticsearch

A Short Introduction

What is Elasticsearch?

What is Lucene?

Let's Begin!

Installing Elasticsearch

Seeing if it Works

Indexing Your First Item

Updating Data

Searching

Getting Everyone

Searching

Getting just the PHP'rs

Searching

Getting everyone born in 1990 and on:

Welcome to Mappings

Explicit Mappings

Breaking Your Mappings

TO THE MOON WITH SCORING AND ORDERING!

Ordering

Ordering

Ordering

Scoring

A Smidge of TF-IDF

Peeking Into Scoring

Let's boost!

Analyzers, Tokenizers, and TokenFilters

Tokenizers

TokenFilters

Stemming

Stemming

Making Search Work for You

Making Search Work for You

Making Search Work for You

Making Search Work for You

Filters

Filter Caching

Example Filters

The Query Filter

A Complex Filter

A Complex Search

0 0