textsearch102



textsearch102

0 0


textsearch102


On Github georgms / textsearch102

Text Search 102

Georg M. Sorst, CTO FINDOLOGIC

Agenda

I

  • Motivation
  • Boolean retrieval
  • Term-document matrix
  • Inverted index
  • Querying the index
  • Final projects
  • Solr Workshop

II

  • Tokenization
  • Precision & Recall
  • Ranked Retrieval
  • Vocabulary Structure
  • Wildcard Queries
  • Spelling Correction
  • Phonetic Correction
  • Field Weights
  • Project status
  • Solr Q&A

III

  • Scaling
  • Vector Space Model
  • Missing Topics
  • Feedback
  • Final projects

I

Motivation

Search is everywhere

  • Internet
  • eCommerce
  • CMS, Blogs
  • Intranet
  • Academia
  • ...

Boolean retrieval

Find all documents containing the terms and satisfying the conditions:

fh AND salzburg
mmt OR mma
(information AND retrieval) OR search

Nomenclature

Documents

Books, chapters, pages, web pages, news posts...

Document collection, Corpus

All the documents

Terms

Like words, but maybe "FH Salzburg" and "A1" as well

Will it match?

(information AND retrieval) OR search
  • #1: a book about information retrieval
  • Match
  • #2: a book about the search for information
  • Match
  • #3: a book about information
  • No Match
Audience question

Nomenclature

Information need

What the user is looking for

"learn about information retrieval and search"

Query

How the user talks to the computer

(information AND retrieval) OR search

How to search?

Grep Intersect, Union

How to search?

  • #1: a book about information retrieval
  • #2: a book about the search for information
  • #3: a book about information
(information AND retrieval) OR search

Complexity?

Ο(num terms in corpus)

Audience question

Term-document matrix

  • Aka incidence matrix
  • Rows = Terms, Columns = Docs
  • Cell = 1 if Term in Doc, else 0
  • Needs to pre-built
  • #1: a book about information retrieval
  • #2: a book about the search for information
  • #3: a book about information
#1 #2 #3 Book 1 1 1 Information 1 1 1 Retrieval 1 0 0 Search 0 1 0 Audience question
#1 #2 #3 Book 1 1 1 Information 1 1 1 Retrieval 1 0 0 Search 0 1 0
  • Document vector
    • #1: (1, 1, 1, 0)
  • Term vector
    • Retrieval: (1, 0, 0)

How to query

  • (information AND retrieval) OR search
  • = (111 ∧ 100) ∨ 010
  • = 100 ∨ 010
  • = 110
  • = #1 and #2
Audience question

Size

  • num(distinct terms)*num(docs)
  • Very sparse (lots of 0s)

Inverted index

  • Store only 1s = Term occurences
  • Terms -> Docs
  • #1: a book about information retrieval
  • #2: a book about the search for information
  • #3: a book about information
Term Doc IDs Book #1, #2, #3 Information #1, #2, #3 Retrieval #1 Search #2

Nomenclature

Vocabulary / Dictionary / Lexicon

List of terms

Posting

Document which term occurs in

Postings list

All documents which term occurs in

Postings

All postings lists

Index time

  • Extract terms from docs
  • Sort vocabulary alphabetically
  • Sort postings lists by document ID

Query time

  • Look up each query term in vocabulary
  • Retrieve postings
  • Intersect / Union
Term Doc IDs Book #1, #2, #3 Information #1, #2, #3 Retrieval #1 Search #2

Query: Information AND Search

→ #2

Final projects

Solr Workshop

II

Tokenization

Tokenization

a book about information retrieval

[a, book, about, information, retrieval]

Challenge

  • Inverted index can only find exact tokens
Term Doc IDs book #1, #2, #3 information #1, #2, #3 retrieval #1 search #2
  • books will not find book

Challenge

  • How can books find book?
  • wi-fi ↔ wifi?
  • Jack's ↔ Jack?
  • MMT ↔ Multimediatechnology?
  • U.S.A. ↔ USA?

Text analysis

  • Analyse docs and query
  • Add, remove, change terms
Index Query a wi-fi router The Wireless Routers Tokenization a wi-fi router The Wireless Routers Lowercasing a wi-fi router the wireless routers Stop words wi-fi router wireless routers Word delimiter(Index only) wi-fi wi fi wifi router wireless routers Stemming wi-fi wi fi wifi router wireless router Synonymswifi → wireless(Index only) wi-fi wi fi wifi wireless router wireless router Will increase relevance, but reduce precision

Nomenclature

Token

  • Character sequence, meaningful semantic unit
  • No analysis yet
  • the, routers

Type

  • Distinct tokens, same token counts only once
  • the, routers

Terms

  • Index tokens
  • router

Outcome

  • Find more results: wireless → wi-fi
  • Find more imprecise results: wi-fi → wireless
  • Miss some results: to be or not to be
Will this find more results? Will this find less accurate results?

Precision & Recall

Precision

  • Are all results relevant?

Recall

  • Are all relevant documents in the results?

* Am Whiteboard machen?

Precision & Recall

  • Evaluation requires human effort
    • List of information needs
    • Annotated corpus
  • Will never be 100% both
  • Rank results according to relevance
How to achieve 100% recall? How to achieve 100% precision?

Ranked retrieval

Idea

Multiple matches for query term = more important document

A shop where books are bought and sold is a bookshop or bookstore. Books can also be borrowed from libraries.

Idea

Infrequent terms in corpus are more important

apple iphone → iphone more important than apple

Term frequency

At index time:

  • Count term occurrences per doc
  • Ignore order of terms
  • Bag of words

TF

  • #1: a book providing information about information retrieval
  • #2: a book about the search for books
  • #3: a book about information
Term Doc IDs Book #1:1, #2:2, #3:1 Information #1:2, #3:1 Retrieval #1:1 Search #2:1 * Audience participation

TF pitfalls

  • Not always true
  • The Bible does not contain "the bible"
  • Web spam
  • Normalize long documents

Inverse document frequency

  • Rank uncommon terms higher
  • Only relevant for OR search
  • "apple iphone" → "apple" more important than "iphone"
  • idf(t) = num_docs / df(t)

IDF

  • #1: a book providing information about information retrieval
  • #2: a book about the search for books
  • #3: a book about information
Term IDF Doc IDs Book 1 #1:1, #2:2, #3:1 Information 1.5 #1:2, #3:1 Retrieval 3 #1:1 Search 3 #2:1 * idf(t) = 1 is special case

TF-IDF Ranking

Term IDF Doc IDs Book 1 #1:1, #2:2, #3:1 Information 1.5 #1:2, #3:1 Retrieval 3 #1:1 Search 3 #2:1
information retrieval search

#1

2 × 1.5 + 1 × 3 + 0 × 3 = 6

#2

0 × 1.5 + 0 × 3 + 1 × 3 = 3

#3

1 × 1.5 + 0 × 3 + 0 × 3 = 1.5

Vocabulary structure

Vocabulary structure

  • Dictionary / Hash table
  • Search tree
* Suggestions?

Hash table

Hash table

  • + Fast lookup: Ο(1)
  • + Fast insert: Ο(1)
  • - Collisions
  • - Rehashing
  • - High worst case complexity: Ο(n)
* What is the complexity of lookup / insert? Ο(1)

Binary tree

http://nlp.stanford.edu/IR-book/

Binary tree

  • + Fast lookup: Ο(log n)
  • + Fast insertion: Ο(log n)
  • - Rebalancing
  • - High worst case complexity: Ο(n)
* What is the lookup / insert complexity? Ο(log n)

B-tree

http://nlp.stanford.edu/IR-book/

B-tree

  • + Fast lookup: Ο(log n)
  • + Fast insertion: Ο(log n)
  • + No worst case complexity
  • Less frequent rebalancing

Free wildcard queries!

* Why?

Wildcard queries

Wildcard queries

Expand query:

salz*

salzburg OR salzgitter OR salzach

Wildcard queries

salz*

Comes free with search tree

*burg

Build index with reversed terms

sal*urg

Intersect regular and reverse tree

* How to get prefix wildcard queries?

N-gram queries

search

↓ (3-gram)

$se, sea, ear, arc, rch, ch$

N-gram queries

At index time

Term Doc IDs $se 1, 2, 3 arc 1 ch$ 1, 8 ear 1, 7 rch 1 sea 1, 5, 9

N-gram queries

At query time

Expand query

sea*

$se AND sea

N-gram queries

$se AND sea

Term Doc IDs $se 1, 2, 3 arc 1 ch$ 1, 8 ear 1, 7 rch 1 sea 1, 5, 9

Spelling correction

Spelling correction

Spelling correction

Find alternatives Evaluate alternatives Return alternative results

1. Find alternatives

bock

book, back, lock

Find alternatives

  • Compare query with vocabulary
  • Use index or query log

Levenshtein

  • Edit distance between two words
  • Count inserts, deletes, replaces, swaps

Levenshtein

iformmetoin~ 1. Add n informmetoin 2. Delete m informetoin 3. Replace e with a informatoin 4. Swap o and i information

Levenshtein distance = 4

* Audience question

Levenshtein

  • Weighted (keyboard distance)

Levenshtein

  • Expensive
  • Cannot be precomputed
  • num(query terms) × num(vocabulary terms)

2. Evaluate alternatives

  • Isolated
  • Context

Isolated

  • seach endine?
    • → peach engine
    • → search ending
    • → search engine

Context

  • Find alternatives for every misspelled query term → collations
  • q(collation)
  • seach endine?
    • → peach engine? 0 results
    • → search ending? 5 results
    • → search engine? 10 results

Evaluate alternatives

  • Most results OR
  • Most queries

3. Return alternative results

  • Feedback
  • Transparency

Phonetic correction

Soundex

  • For proper names, brand names, drugs
  • Encode strings according to their sounds
  • Meier → M600

Index

Term (not in index) Soundex Doc IDs meier, maier, mayer, meyer M600 1, 3, 5 müller, mueller M460 2, 4, 6 goethe, göthe G300 7, 8, 9

Query

Meier → M600

Spelling correction

  • Recall up
  • Precision down
* What does spelling correction do to recall / precision?

Soundex alternatives

  • Daitch–Mokotoff Soundex
  • Metaphone
  • Double Metaphone

Field weights

Field weights

Weighted index

Doc Author Title #1 Arthur McAuthor A book providing information about information retrieval #2 Shakesbeer A book about the search for arthur Term Doc IDs arthur #1:Author, #2:Title book #1:Title, #2:Title information #1:Title mcauthor #1:Author shakesbeer #2:Author ...

Field queries

Term Doc IDs arthur #1:Author, #2:Title shakesbeer #2:Author ...
  • title:arthur? #1
  • author:shakesbeer? #2
* Audience question

Field weights

Term Doc IDs arthur #1:Author, #2:Title book #1:Title, #2:Title ...
  • weight(author) = 10
  • weight(title) = 1
  • arthur book?
    • #1 → author + title = 10 + 1 = 11
    • #2 → title + title = 1 + 1 = 2
* Audience question

Field weights

  • Determining weights is hard
  • Use annotated corpus

III

Scaling

Scaling reasons

  • Query volume
  • Index volume
  • Resiliency

Scaling

What are the two directions we can scale to?

Scaling up / Vertical

Scaling up

  • + Easy
  • + Little overhead (network, hardware)
  • - No fault-tolerance
  • - No parallelization
  • - Scaling limited by hardware

Scaling out / Horizontal

Scaling out

  • + Fault-tolerance, resiliency
  • + Parallelization
  • + "Unlimited" scaling
  • - Complex
  • - Overhead: Network, Monitoring

Distributed search

Distributed search

  • Partitioned index
  • Distributed queries

Partitioned index

  • By term
  • By doc
What can the index be partitioned by? Think of the Inverted Index?

Term-partitioned index

Term-partitioned index

  • - Uneven distribution
  • - Inter-node communication
  • - Hard to update index

Document-partitioned index

Document-partitioned index

  • Number of nodes is fixed
  • node(doc) = id(doc) % num(nodes)

Distributed query

Sharding

Sharding nomenclature

Shard

  • Slice of document collection

Master / Leader / Primary

  • Distribute requests to Replicas
  • Distribute requests to other Masters

Replica

  • Contains all documents of shard
  • Will actually handle queries
  • Can become Master

Shards

Shards

Replicas

Replicas

Vector Space Model

Vector Space Model

  • Calculate similarity between
    • queries and docs
    • docs and docs
    • queries and queries

Vector Space Model

  • Queries / docs are vectors
  • Term-document matrix → document vector
How can a document be a vector?

Document vector

#1 #2 #3 Book 10 3 0 Information 5 2 1 Retrieval 0 0 2
  • $\vec{V}(\#1)$ = (10, 5, 0)
  • $\vec{V}(\#2)$ = (3, 2, 0)
  • $\vec{V}(\#3)$ = (0, 1, 2)

Vector distance

$$\textrm{d}(p, q) = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2}$$

Vector distance

Vector Difference

\[\begin{aligned} \textrm{d}(\#1, \#2) & = \sqrt{(10 - 3)^2 + (5 - 2)^2} \\ & = \sqrt{7^2 + 3^2} \\ & = \sqrt{49 + 9} \\ & = \sqrt{58} \\ & \approx 7.62 \end{aligned} \]

Unit vector

$$\vec{v} = \frac{\vec{V} }{ |\vec{V}| }$$

Unit vector

\[\begin{aligned} \vec{v}(\#1) & = \frac{ \begin{pmatrix}10\\5\end{pmatrix} }{ \sqrt{10^2 + 5^2} } \\ \\ & = \frac{ \begin{pmatrix}10\\5\end{pmatrix} }{ 11.18 } \\ \\ & = \begin{pmatrix}0.89\\0.45\end{pmatrix} \end{aligned} \]

Unit vector

\[\begin{aligned} |\vec{v}(\#1)| & = \sqrt{0.89^2 + 0.45^2} \\ & = \sqrt{1} \\ & = 1 \end{aligned}\]

Cosine similarity

$$\textrm{sim}(d_1, d_2) = \frac{ \vec{V}(d_1) }{ |\vec{V}(d_1)|} \cdot \frac{\vec{V}(d_2) }{ |\vec{V}(d_2)| } = \frac{ \vec{V}(d_1) \vec{V}(d_2) }{ |\vec{V}(d_1)| |\vec{V}(d_2)| }$$

Cosine similarity

\[\begin{aligned} \textrm{sim}(d_1, d_2) & = \frac{ \begin{pmatrix}10 \\ 5\end{pmatrix} \cdot \begin{pmatrix}3 \\ 2\end{pmatrix} }{ \left|\begin{pmatrix}10 \\ 5\end{pmatrix}\right| \left|\begin{pmatrix}3 \\ 2\end{pmatrix}\right| } \\ \\ & = \frac{30 + 10}{\sqrt{10^2 + 5^2} \sqrt{3^2 + 2^2}} \\ & = \frac{40}{\sqrt{125} \sqrt{13}} \\ & = \frac{40}{40.31} \\ \\ & \approx 0.99 \end{aligned}\]

Query vector

  • "book"
  • $\vec{V}(q) = \begin{pmatrix}1 \\ 0\end{pmatrix}$
What does the query vector look like?

What does the query vector look like?

Query vector

Query vector

$$sim(\#1, q) = \frac{ \begin{pmatrix}10 \\ 5\end{pmatrix} \cdot \begin{pmatrix}1 \\ 0\end{pmatrix} }{ \left|\begin{pmatrix}10 \\ 5\end{pmatrix}\right| \left|\begin{pmatrix}1 \\ 0\end{pmatrix}\right| } = 0.89$$

$$sim(\#2, q) = \frac{ \begin{pmatrix}3 \\ 2\end{pmatrix} \cdot \begin{pmatrix}1 \\ 0\end{pmatrix} }{ \left|\begin{pmatrix}3 \\ 2\end{pmatrix}\right| \left|\begin{pmatrix}1 \\ 0\end{pmatrix}\right| } = 0.83$$

Missing topics

  • Clustering
  • Web search
  • Web crawling
  • Pagerank
  • Search Result Evaluation
  • Relevance feedback

Feedback

We're hiring

Final projects