Introduction to – Glossary – Examples



Introduction to – Glossary – Examples

0 0


lumesse-lucene-101


On Github Xaerxess / lumesse-lucene-101

Introduction to

- What's Lucene? - Lucene vs Solr vs Elasticsearch - history - When to use Lucene? - Glossary - Implementation overview - Examples - Summary

So what's this "Lucene"?

Apache Lucene is a free open source information retrieval software library, originally written in Java (...).

At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format.

Lucene vs Solr vs Elasticsearch (?)

Then

- 1999 - Lucene created - 2001 - Lucene joined ASF (Jakarta) - 2004 - Solr created; Compass created - 2005 - Lucene bacame top-level Apache project - 2010 - Lucene + Solr merged; first versions of ES released - 2012 - Solr 4.0.0 with SolrCloud

Now

Solr vs. Elasticsearch — How to Decide? by Otis Gospodnetić

No mention of Lucene in the article

When to use Lucene?

  • You are a search engineer AND
  • You are a programmer AND
  • You want full control over almost all the internals of Lucene AND
  • Your requirements demand you to do all sorts of geeky customization to Lucene AND
  • You are willing to take care of infrastructure elements of your search like scaling, distribution, etc.

When to use Solr?

  • At least one of the above didn't make sense. OR
  • You want something that is ready to use out-of-the-box (even without knowledge of Java) OR
  • Your infrastructure requirements outweigh search customization requirements.

Glossary

Document Basic object in Lucene, conceptually a set of fields stored in index Field Stores a piece of information (content) Term A Term represents a word from text. This is the unit of search. It is composed of two elements, the text of the word, as a string, and the name of the field that the text occurred in.

Glossary - cont.

Index Database which stores documents and fields in Lucene-specific format (extras: index reader / writer, segments, inverted index, term vectors) Indexed field A field which is applicable for searching Stored field A field which original content is stored as-is in index - Lucene index internals: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal - What's in segment? http://image.slidesharecdn.com/howdoeslucenestoreyourdata-130610085548-phpapp02/95/berlin-buzzwords-2013-how-does-lucene-store-your-data-7-638.jpg?cb=1370855248

Glossary - cont.

Query / search query Used for retrieving matching documents from index, ex.:
'book_title:Tadeusz^2 AND text:ojczyzna'
'+book_title:Tadeusz^2 +text:ojczyzna'
TermQuery titleQuery = new TermQuery(new Term("book_title", "Tadeusz"));
titleQuery.setBoost(2);
TermQuery textQuery = new TermQuery(new Term("text", "ojczyzna"));
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(titleQuery, BooleanClause.Occur.MUST);
booleanQuery.add(textQuery, BooleanClause.Occur.MUST);
                
Extras: IndexSearcher.search, collectors, Weight, QueryParser - QueryParser syntax: https://tipsthoughtsnotes.wordpress.com/2014/09/01/what-is-the-query-parser-syntax-in-apache-lucene/

Glossary - cont.

Similarity Determines scores for documents matched by query (extras: scoring formula) Scoring / Score A set of parameters / formulas which produce a way to determine how "accurate" is a match of a document (per query) Boost A way of changing scrore of a field (it's applied to a computed score)

Implementation overview

Examples

  • Demo
  • Comparison with our spikes - Eclipse

Summary

  • When you need scalable text search engine, which you can setup easily - don't use Lucene
  • When you're not that interested in low-level API optimization - don't use Lucene
  • When you expect some features to be included out-of-the box - don't use Lucene

Summary - inversed

  • When you need embedded text search engine, which you can scal to your needs by yourself - use Lucene
  • When you'll be doing low-level API - use Lucene
  • When you want to tune queries, scores, index format, offsets, highlights, etc. to your specific needs - use Lucene (and have a long adventure with Lucene)

THE END

Questions?