Elasticsearch Workshop – 2016 – How does a search engine work?



Elasticsearch Workshop – 2016 – How does a search engine work?

2 0


elasticsearch-workshop-slides

Slides for Elasticsearch workshop

On Github BouvetNord / elasticsearch-workshop-slides

Elasticsearch Workshop

2016

  • Velkommen, i dag skal vi holde en Elasticsearch workshop.

Agenda

  • How does a search engine work
  • Elasticsearch
    • What is
    • Use cases
    • Own experience
    • How to get started
    • Mapping
    • Queries and filters
    • Aggregations
    • Highlightning
  • Workshop
  • Agendaen idag er som følger...

How does a search engine work?

  • Hvordan fungerer en søkemotor?
Your document collection is big! Scan through all the documents every time you search for something?
  • Dette ville tatt evigheter
Pre-process the documents and create an index!
  • For å gjøre dine søk raskt og effektivt vil en søkemotor forhåndsbehandle dokumentene og lage en index

Create an inverted index

  • Man lager seg da noe som heter en "invertert index"
  • På venstre siden har vi tre dokumenter...
  • Siden dette er en BigOne konferanse, så vil mye av innholdet i dag være pizza relatert...
  • Det som skjer er at man lager seg en invertert index av disse dokumentene (dokumentene blir indeksert, som det heter)
  • En invertert index (som vi ser på høyre side her nå) inneholder alle ordene som finnes i dokumentene, og for hvert ord så lister man opp hvilke dokumenter som inneholder ordet...
  • Så ordet "pizza" finnes i dokument 0 og 2

Find unique terms

  • Så hvordan finner man unike ord?
    • Hvis man f.eks tar for seg dokumentet "Turles loves pizza", så vil det gå igjennom forskjellige steg...
      • Man splitter opp dokumentet i ord
      • Man gjør alle bokstaver små
      • Man finner grunnstammer for ord, f.eks "Loves" blir "love"
    • Dette er ett forenklet eksempel...

Search against the inverted index

Sort by relevance

How well each document matches the query

By default, Elasticsearch sorts matching results by their relevance score, that is, by how well each document matches the query.
  • Jeg skal nå bruke sirka 1 minutt på å si hva Elasticsearch er.
    • Lucene
      • Cumbersome to use directly
      • Provides few features for scaling past a single machine
    • Real time
      • Det går fort å indeksere dokumenter
      • Data er tilgjenglig for søk nesten med en gang etter indeksering

Use Cases

What can Elasticsearch be used for?

For Big Data

Github uses Elasticsearch to search 20TB data, including 1.3 billion files and 130 billion code lines

Relationship databases:
  • This works well with smaller data sets, but is not very scalable
  • When the volume goes up, performance down (write operations)

Text search

With filtering, aggregations, highlightning, pagination...

Pure Analytics

Count things and summarize your data, lots of data, often on timestamped data!

Centralized Logging

Logs > Logstash > Elasticsearch > Kibana

Geolocation

Own experience with Elasticsearch

  • Grunnen til at jeg ønsket å lære meg om søkemotorer var...
  • At jeg lenge har samlet på mat oppskrifter...
  • Og etter hvert har det blitt veldig mange mat oppskrifter, nesten 400 sider...
  • Som gjør at det nesten er umulig å finne det man søke etter...
  • F.eks jeg ønsker å finne alle hovedretter som inneholder kylling...

Alt Mulig Mat

  • Text based searching
  • Structured searching (get all "Dessert" recipes)

How to use Ealsticsearch?

Commonly used in addition to another database...

How to get started with Elasticsearch?

  • Så hvordan kan man komme igang med Elasticsearch...

It is that easy

  • Download Elasticsearch from www.elastic.co
  • Elasticsearch only requires Java to run
wget https://download.elasticsearch.org/elasticsearch/release/...
tar -zxvf elasticsearch-2.2.0.tar.gz
cd elasticsearch-2.2.0/bin
./elasticsearch.sh

Zero configurations

  • Elasticsearch just works
    • No configuration is needed
    • It has sensible defaults settings
It is easy to get started with Elasticsearch!

Is Elasticsearch alive?

You can access it at http://localhost:9200 on your web browser, which returns this:

{
   "status":200,
   "name":"Cypher",
   "cluster_name":"elasticsearch",
   "version":{
      "number":"1.5.2",
      "build_hash":"62ff9868b4c8a0c45860bebb259e21980778ab1c",
      "build_timestamp":"2015-04-27T09:21:06Z",
      "build_snapshot":false,
      "lucene_version":"4.10.4"
   },
   "tagline":"You Know, for Search"
}

REST API

  • Elasticsearch hides the complexities of Lucene behind a REST API
    • POST (create)
    • GET (read)
    • PUT (update)
    • DELETE (delete)

CURL works just fine!

  • An index is like a database
  • An type is like a SQL table

What is stored in Elasticsearch?

JSON documents!

{
   "title": "Elasticsearch Worshop",
   "date": "2016-04-08"
}

Let's do an example - A book website

  • We are building a website to find books
  • We have a collection of books
  • We want simple text based searching

How to store the books?

The act of storing data in Elasticsearch is called indexing.

$curl -X POST localhost:9200/books/computer/1 --data
'{
    "name": "The Pragmatic Programmer",
    "category": "Programming",
    "price": 29.90
}'

$curl -X POST localhost:9200/books/computer/2 --data
'{ 
    "name": "Clean Code",
    "category": "Programming",
    "price": 14.90
}'

$curl -X POST localhost:9200/books/computer/3 --data
'{
    "name": "Working Effectively with Legacy Code",
    "category": "Refactoring",
    "price": 45.50
}'
It is much like the INSERT keyword in SQL except that, if the document already exists, the new document would replace the old. The second part indicates on which index (an index could be compared to an SQL database, though I don’t like this comparison) your query will be performed, and what is the type (a type could be compared to an SQL table, though I don’t like this comparison either) of the document. From now, I will write indices and types in orange

Get

$curl -X GET localhost:9200/books/computer/1

Result:

{
   "_index": "books",
   "_type": "computer",
   "_id": "1",
   "_version": 2,
   "found": true,
   "_source": {
      "name": "The Pragmatic Programmer",
      "category": "Programming",
      "price": 29.9
  }
}

Update

$curl -X PUT localhost:9200/books/computer/1 --data
'{
   "name":"The Awesome Programmer"
}'

Result:

{
   "_index":"books",
   "_type":"computer",
   "_id":"1",
   "_version":2,
   "created":false
}

Delete

$curl -X DELETE localhost:9200/books/computer/1

So far

  • All we have is NoSQL document store which is
    • Fast
    • Scalable
    • Easy to use
  • Now to the really cool part, full-text search...

Full-text search

Find all books that contains the word "code"

$curl -X GET localhost:9200/books/computer/_search?q=code

Full-text search - Result

Sorted by relevance!

{
   "took":6,
   "timed_out":false,
   "_shards":{
      "total":5,
      "successful":5,
      "failed":0
   },
   "hits":{
      "total":2,
      "max_score":0.15342641,
      "hits":[
         {
            "_index":"books",
            "_type":"computer",
            "_id":"2",
            "_score":0.15342641,
            "_source":{
               "name":"Clean Code",
               "category":"Programming",
               "price":14.9
            }
         },
         {
            "_index":"books",
            "_type":"computer",
            "_id":"3",
            "_score":0.11506981,
            "_source":{
               "name":"Working Effectively with Legacy Code",
               "category":"Refactoring",
               "price":45.5
            }
         }
      ]
   }
}

Mapping

What is mapping?

Mapping is used to define how a document, and the fields it contains, are stored and indexed.

This is similar to a database schema.

Mapping example

Define the data types of the document fields

{
  "mappings": {
    "computer": {
      "properties": {
        "name": {
          "type": "string"
        },
        "category": {
          "type": "string",
          "index": "not_analyzed"
        },
        "price": {
          "type": "float"
        }
      }
    } 
  }
}

Queries and Filters

Query DSL

  • Alternative way of building queries
  • Allows us to build queries using JSON

Query

Find the books with a name that contains the word "code"

$ curl -XGET ‘localhost:9200/books/book/_search’ -d
'{
   "query": {
      "match": {
         "name": "code"
      }
   }
}'

Filtering

Find books belonging to the "Programming" category

$ curl -XGET ‘localhost:9200/books/book/_search’ -d
'{
    "query": {
        "filtered": {
            "filter": {
                "term": { "category": "Programming" }
            }
        }
    }
}'

Query vs Filter

Query Filter Full text search Exact match Relevance scoring Binary yes/no Relatively slow Fast Not cacheable Cacheable

Choosing between query and filter

  • Use queries for full-text search, or for cases where you want a relevance score
  • Use filters for everything else

Aggregations

  • Used to perform analysis on the data
  • Broken into 3 "families"
    • Metric
    • Bucket
    • Pipeline
Metric Bucket Min Range Max Terms Sum Histogram Avg Stats

Buckets

Range
...
"aggs" : {
  "price_ranges" : {
    "range" : {
      "field" : "price",
      "ranges" : [
        { "to" : 10 },
        { "from" : 10, "to" : 30 },
        { "from" : 30 }
      ]
    }
  }
}
...
...
"buckets": {
  "*-10.0": {
    "to": 10,
    "doc_count": 0
  },
  "10.0-30.0": {
    "from": 10,
      "to": 30,
      "doc_count": 2
  },
  "30.0-*": {
    "from": 30,
    "doc_count": 1
  }
}
...

Buckets

Histogram
...
"aggs" : {
  "prices" : {
    "histogram" : {
      "field" : "price",
      "interval" : 15
    }
  }
}
...
...
"prices" : {
  "buckets": [
    {
      "key": 0,
      "doc_count": 1
    },
    {
      "key": 15,
      "doc_count": 1
    },
    {
      "key": 30,
      "doc_count": 0
    },
    {
	  "key": 45,
	  "doc_count": 1
	}
    ]
  }
}
...

Buckets

Terms
...
"aggs" : {
  "categories" : {
    "terms" : {
      "field" : "category"
    }
  }
}
...
...
"buckets": [
  {
    "key": "programming",
    "doc_count": 2
  },
  {
    "key": "refactoring",
    "doc_count": 1
  }
...

Metrics

Min
...
"aggs" : {
  "min_price" : {
    "min" : {
     "field" : "price"
    }
  }
}
...
...
"aggregations": {
  "min_price": {
    "value": 14.9
  }
}
...

Metrics

Avg
...
"aggs" : {
  "avg_price" : {
    "avg" : {
      "field" : "price"
    }
  }
}
...
...
"aggregations": {
  "avg_price": {
    "value": 30.099999999999998
  }
}
...

Metrics

Stats
...
"aggs" : {
  "price_stats" : {
    "stats" : {
      "field" : "price"
    }
  }
}
...
...
"aggregations": {
  "prices_stats": {
    "count": 3,
	"min": 14.9,
    "max": 45.5,
    "avg": 30.099999999999998,
    "sum": 90.3
  }
}
...

Highlighting

{
  "query": {
    "match": {
      "name": "legacy code"
    }
  },
  "highlight": {
    "fields": {
      "name": {}
    }
  }
}
...
"highlight": {
  "name": [
  "Working Effectively with
   <em>Legacy</em> <em>Code</em>"
   ]
}
...
"highlight": {
  "name": [
  "Clean <em>Code</em>"
  ]
}
...

Workshop

What will you learn?

19 tasks - learning Query DSL

  • Intro task - match all (task 0)
  • Full-text search (task 1-4)
  • Filtering (task 5-8)
  • Aggregations (task 9-13)
  • Combine full-text search and aggregations (task 14)
  • Sorting (task 15)
  • Highlightning (task 16)
  • Pagination (task 17-18)

List of pizzas

The data that are used during the workshop is a list of pizzas, with the mapping

Tasks look like this

Feature: Topic of the task

 // Use https://www.elastic.co/guide/en/...

 Scenario: Description of the task
  Given all pizzas are indexed
  When I make a query
  """
  { todo }
  """
  Then the response should contain
  """
  { subset }
						
  • Your task is to replace the `{ todo }` with the correct query
  • A query needs to return a correct response { subset } to be passed

Compare against subsets

Total

{
   "workshop": "Elasticsearch",
   "date" : "2016-04-08"
}
							

Subset

{
   "date" : "2016-04-08"
}
							

Running

  • Manual installation
    • Windows: `run-tasks.cmd`
    • Linux: `./run-tasks.sh`
  • Docker
    • make run-tasks
Elasticsearch Workshop 2016 Velkommen, i dag skal vi holde en Elasticsearch workshop.