Text Search 102

Georg M. Sorst, CTO FINDOLOGIC

Agenda

I

Motivation
Boolean retrieval
Term-document matrix
Inverted index
Querying the index

Final projects
Solr Workshop

II

Tokenization
Precision & Recall
Ranked Retrieval
Vocabulary Structure
Wildcard Queries
Spelling Correction
Phonetic Correction
Field Weights

Project status
Solr Q&A

III

Scaling
Vector Space Model
Missing Topics
Feedback
Final projects

I

Motivation

Search is everywhere

Internet
eCommerce
CMS, Blogs
Intranet
Academia
...

Boolean retrieval

Find all documents containing the terms and satisfying the conditions:

fh AND salzburg

mmt OR mma

(information AND retrieval) OR search

Nomenclature

Documents

Books, chapters, pages, web pages, news posts...

Document collection, Corpus

All the documents

Terms

Like words, but maybe "FH Salzburg" and "A1" as well

Will it match?

(information AND retrieval) OR search

#1: a book about information retrieval
Match
#2: a book about the search for information
Match
#3: a book about information
No Match

Audience question

Nomenclature

Information need

What the user is looking for

"learn about information retrieval and search"

Query

How the user talks to the computer

(information AND retrieval) OR search

How to search?

Grep Intersect, Union

How to search?

#1: a book about information retrieval
#2: a book about the search for information
#3: a book about information

(information AND retrieval) OR search

Complexity?

Ο(num terms in corpus)

Audience question

Term-document matrix

Aka incidence matrix
Rows = Terms, Columns = Docs
Cell = 1 if Term in Doc, else 0
Needs to pre-built

#1: a book about information retrieval
#2: a book about the search for information
#3: a book about information

#1 #2 #3 Book 1 1 1 Information 1 1 1 Retrieval 1 0 0 Search 0 1 0 Audience question

#1 #2 #3 Book 1 1 1 Information 1 1 1 Retrieval 1 0 0 Search 0 1 0

Document vector
- #1: (1, 1, 1, 0)
Term vector
- Retrieval: (1, 0, 0)

How to query

(information AND retrieval) OR search
= (111 ∧ 100) ∨ 010
= 100 ∨ 010
= 110
= #1 and #2

Audience question

Size

num(distinct terms)*num(docs)
Very sparse (lots of 0s)

Inverted index

Store only 1s = Term occurences
Terms -> Docs

#1: a book about information retrieval
#2: a book about the search for information
#3: a book about information

Term Doc IDs Book #1, #2, #3 Information #1, #2, #3 Retrieval #1 Search #2

Nomenclature

Vocabulary / Dictionary / Lexicon

List of terms

Posting

Document which term occurs in

Postings list

All documents which term occurs in

Postings

All postings lists

Index time

Extract terms from docs
Sort vocabulary alphabetically
Sort postings lists by document ID

Query time

Look up each query term in vocabulary
Retrieve postings
Intersect / Union

Term Doc IDs Book #1, #2, #3 Information #1, #2, #3 Retrieval #1 Search #2

Query: Information AND Search

→ #2

Final projects

Solr Workshop

II

Tokenization

a book about information retrieval

↓

[a, book, about, information, retrieval]

Challenge

Inverted index can only find exact tokens

Term Doc IDs book #1, #2, #3 information #1, #2, #3 retrieval #1 search #2

books will not find book

Challenge

How can books find book?
wi-fi ↔ wifi?
Jack's ↔ Jack?
MMT ↔ Multimediatechnology?
U.S.A. ↔ USA?

Text analysis

Analyse docs and query
Add, remove, change terms

Index Query a wi-fi router The Wireless Routers Tokenization a wi-fi router The Wireless Routers Lowercasing a wi-fi router the wireless routers Stop words wi-fi router wireless routers Word delimiter(Index only) wi-fi wi fi wifi router wireless routers Stemming wi-fi wi fi wifi router wireless router Synonymswifi → wireless(Index only) wi-fi wi fi wifi wireless router wireless router Will increase relevance, but reduce precision

Nomenclature

Token

Character sequence, meaningful semantic unit
No analysis yet
the, routers

Type

Distinct tokens, same token counts only once
the, routers

Terms

Index tokens
router

Outcome

Find more results: wireless → wi-fi
Find more imprecise results: wi-fi → wireless
Miss some results: to be or not to be

Will this find more results? Will this find less accurate results?

Precision & Recall

Precision

Are all results relevant?

Recall

Are all relevant documents in the results?

* Am Whiteboard machen?

Precision & Recall

Evaluation requires human effort
- List of information needs
- Annotated corpus
Will never be 100% both
Rank results according to relevance

How to achieve 100% recall? How to achieve 100% precision?

Ranked retrieval

Idea

Multiple matches for query term = more important document

A shop where books are bought and sold is a bookshop or bookstore. Books can also be borrowed from libraries.

Idea

Infrequent terms in corpus are more important

apple iphone → iphone more important than apple

Term frequency

At index time:

Count term occurrences per doc
Ignore order of terms
Bag of words

TF

#1: a book providing information about information retrieval
#2: a book about the search for books
#3: a book about information

Term Doc IDs Book #1:1, #2:2, #3:1 Information #1:2, #3:1 Retrieval #1:1 Search #2:1 * Audience participation

TF pitfalls

Not always true
The Bible does not contain "the bible"
Web spam
Normalize long documents

Inverse document frequency

Rank uncommon terms higher
Only relevant for OR search
"apple iphone" → "apple" more important than "iphone"
idf(t) = num_docs / df(t)

IDF

#1: a book providing information about information retrieval
#2: a book about the search for books
#3: a book about information

Term IDF Doc IDs Book 1 #1:1, #2:2, #3:1 Information 1.5 #1:2, #3:1 Retrieval 3 #1:1 Search 3 #2:1 * idf(t) = 1 is special case

TF-IDF Ranking

Term IDF Doc IDs Book 1 #1:1, #2:2, #3:1 Information 1.5 #1:2, #3:1 Retrieval 3 #1:1 Search 3 #2:1

information retrieval search

#1

2 × 1.5 + 1 × 3 + 0 × 3 = 6

#2

0 × 1.5 + 0 × 3 + 1 × 3 = 3

#3

1 × 1.5 + 0 × 3 + 0 × 3 = 1.5

Vocabulary structure

Dictionary / Hash table
Search tree

* Suggestions?

Hash table

+ Fast lookup: Ο(1)
+ Fast insert: Ο(1)
- Collisions
- Rehashing
- High worst case complexity: Ο(n)

* What is the complexity of lookup / insert? Ο(1)

Binary tree

http://nlp.stanford.edu/IR-book/

Binary tree

+ Fast lookup: Ο(log n)
+ Fast insertion: Ο(log n)
- Rebalancing
- High worst case complexity: Ο(n)

* What is the lookup / insert complexity? Ο(log n)

B-tree

http://nlp.stanford.edu/IR-book/

B-tree

+ Fast lookup: Ο(log n)
+ Fast insertion: Ο(log n)
+ No worst case complexity
Less frequent rebalancing

Free wildcard queries!

* Why?

Wildcard queries

Expand query:

salz*

↓

salzburg OR salzgitter OR salzach

Wildcard queries

salz*

Comes free with search tree

*burg

Build index with reversed terms

sal*urg

Intersect regular and reverse tree

* How to get prefix wildcard queries?

N-gram queries

↓ (3-gram)

$se, sea, ear, arc, rch, ch$

N-gram queries

At index time

Term Doc IDs $se 1, 2, 3 arc 1 ch$ 1, 8 ear 1, 7 rch 1 sea 1, 5, 9

N-gram queries

At query time

Expand query

sea*

↓

$se AND sea

N-gram queries

$se AND sea

Term Doc IDs $se 1, 2, 3 arc 1 ch$ 1, 8 ear 1, 7 rch 1 sea 1, 5, 9

Spelling correction

Find alternatives Evaluate alternatives Return alternative results

1. Find alternatives

bock

↓

book, back, lock

Find alternatives

Compare query with vocabulary
Use index or query log

Levenshtein

Edit distance between two words
Count inserts, deletes, replaces, swaps

Levenshtein

iformmetoin~ 1. Add n informmetoin 2. Delete m informetoin 3. Replace e with a informatoin 4. Swap o and i information

↓

Levenshtein distance = 4

* Audience question

Levenshtein

Weighted (keyboard distance)

Levenshtein

Expensive
Cannot be precomputed
num(query terms) × num(vocabulary terms)

2. Evaluate alternatives

Isolated
Context

Isolated

seach endine?
- → peach engine
- → search ending
- → search engine

Context

Find alternatives for every misspelled query term → collations
q(collation)

seach endine?
- → peach engine? 0 results
- → search ending? 5 results
- → search engine? 10 results

Evaluate alternatives

Most results OR
Most queries

3. Return alternative results

Feedback
Transparency

Phonetic correction

Soundex

For proper names, brand names, drugs
Encode strings according to their sounds
Meier → M600

Index

Term (not in index) Soundex Doc IDs meier, maier, mayer, meyer M600 1, 3, 5 müller, mueller M460 2, 4, 6 goethe, göthe G300 7, 8, 9

Query

Meier → M600

Spelling correction

Recall up
Precision down

* What does spelling correction do to recall / precision?

Soundex alternatives

Daitch–Mokotoff Soundex
Metaphone
Double Metaphone

Field weights

Weighted index

Doc Author Title #1 Arthur McAuthor A book providing information about information retrieval #2 Shakesbeer A book about the search for arthur Term Doc IDs arthur #1:Author, #2:Title book #1:Title, #2:Title information #1:Title mcauthor #1:Author shakesbeer #2:Author ...

Field queries

Term Doc IDs arthur #1:Author, #2:Title shakesbeer #2:Author ...

title:arthur? #1
author:shakesbeer? #2

* Audience question

Field weights

Term Doc IDs arthur #1:Author, #2:Title book #1:Title, #2:Title ...

weight(author) = 10
weight(title) = 1

arthur book?
- #1 → author + title = 10 + 1 = 11
- #2 → title + title = 1 + 1 = 2

* Audience question

Field weights

Determining weights is hard
Use annotated corpus

III

Scaling

Scaling reasons

Query volume
Index volume
Resiliency

Scaling

What are the two directions we can scale to?

Scaling up / Vertical

Scaling up

+ Easy
+ Little overhead (network, hardware)
- No fault-tolerance
- No parallelization
- Scaling limited by hardware

Scaling out / Horizontal

Scaling out

+ Fault-tolerance, resiliency
+ Parallelization
+ "Unlimited" scaling
- Complex
- Overhead: Network, Monitoring

Distributed search

Partitioned index
Distributed queries

Partitioned index

By term
By doc

What can the index be partitioned by? Think of the Inverted Index?

Term-partitioned index

- Uneven distribution
- Inter-node communication
- Hard to update index

Document-partitioned index

Number of nodes is fixed
node(doc) = id(doc) % num(nodes)

Distributed query

Sharding

Sharding nomenclature

Shard

Slice of document collection

Master / Leader / Primary

Distribute requests to Replicas
Distribute requests to other Masters

Replica

Contains all documents of shard
Will actually handle queries
Can become Master

Shards

Replicas

Vector Space Model

Calculate similarity between
- queries and docs
- docs and docs
- queries and queries

Vector Space Model

Queries / docs are vectors
Term-document matrix → document vector

How can a document be a vector?

Document vector

#1 #2 #3 Book 10 3 0 Information 5 2 1 Retrieval 0 0 2

$\vec{V}(\#1)$ = (10, 5, 0)
$\vec{V}(\#2)$ = (3, 2, 0)
$\vec{V}(\#3)$ = (0, 1, 2)

Vector distance

$$\textrm{d}(p, q) = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2}$$

Vector distance

Vector Difference

\[\begin{aligned} \textrm{d}(\#1, \#2) & = \sqrt{(10 - 3)^2 + (5 - 2)^2} \\ & = \sqrt{7^2 + 3^2} \\ & = \sqrt{49 + 9} \\ & = \sqrt{58} \\ & \approx 7.62 \end{aligned} \]

Unit vector

$$\vec{v} = \frac{\vec{V} }{ |\vec{V}| }$$

Unit vector

\[\begin{aligned} \vec{v}(\#1) & = \frac{ \begin{pmatrix}10\\5\end{pmatrix} }{ \sqrt{10^2 + 5^2} } \\ \\ & = \frac{ \begin{pmatrix}10\\5\end{pmatrix} }{ 11.18 } \\ \\ & = \begin{pmatrix}0.89\\0.45\end{pmatrix} \end{aligned} \]

Unit vector

\[\begin{aligned} |\vec{v}(\#1)| & = \sqrt{0.89^2 + 0.45^2} \\ & = \sqrt{1} \\ & = 1 \end{aligned}\]

Cosine similarity

$$\textrm{sim}(d_1, d_2) = \frac{ \vec{V}(d_1) }{ |\vec{V}(d_1)|} \cdot \frac{\vec{V}(d_2) }{ |\vec{V}(d_2)| } = \frac{ \vec{V}(d_1) \vec{V}(d_2) }{ |\vec{V}(d_1)| |\vec{V}(d_2)| }$$

Cosine similarity

\[\begin{aligned} \textrm{sim}(d_1, d_2) & = \frac{ \begin{pmatrix}10 \\ 5\end{pmatrix} \cdot \begin{pmatrix}3 \\ 2\end{pmatrix} }{ \left|\begin{pmatrix}10 \\ 5\end{pmatrix}\right| \left|\begin{pmatrix}3 \\ 2\end{pmatrix}\right| } \\ \\ & = \frac{30 + 10}{\sqrt{10^2 + 5^2} \sqrt{3^2 + 2^2}} \\ & = \frac{40}{\sqrt{125} \sqrt{13}} \\ & = \frac{40}{40.31} \\ \\ & \approx 0.99 \end{aligned}\]

Query vector

"book"
$\vec{V}(q) = \begin{pmatrix}1 \\ 0\end{pmatrix}$

What does the query vector look like?

Query vector

$$sim(\#1, q) = \frac{ \begin{pmatrix}10 \\ 5\end{pmatrix} \cdot \begin{pmatrix}1 \\ 0\end{pmatrix} }{ \left|\begin{pmatrix}10 \\ 5\end{pmatrix}\right| \left|\begin{pmatrix}1 \\ 0\end{pmatrix}\right| } = 0.89$$

$$sim(\#2, q) = \frac{ \begin{pmatrix}3 \\ 2\end{pmatrix} \cdot \begin{pmatrix}1 \\ 0\end{pmatrix} }{ \left|\begin{pmatrix}3 \\ 2\end{pmatrix}\right| \left|\begin{pmatrix}1 \\ 0\end{pmatrix}\right| } = 0.83$$

Missing topics

Clustering
Web search
Web crawling
Pagerank
Search Result Evaluation
Relevance feedback

textsearch102

georgms

textsearch102

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

textsearch102

Text Search 102

Agenda

I

II

III

I

Motivation

Search is everywhere

Boolean retrieval

Nomenclature

Documents

Document collection, Corpus

Terms

Will it match?

Nomenclature

Information need

Query

How to search?

How to search?

Complexity?

Term-document matrix

How to query

Size

Inverted index

Nomenclature

Vocabulary / Dictionary / Lexicon

Posting

Postings list

Postings

Index time

Query time

→ #2

Final projects

II

Tokenization

Tokenization

Challenge

Challenge

Text analysis

Nomenclature

Token

Type

Terms

Outcome

Precision & Recall

Precision

Recall

Precision & Recall

Ranked retrieval

Idea

Multiple matches for query term = more important document

Idea

Infrequent terms in corpus are more important

Term frequency

At index time:

TF

TF pitfalls

Inverse document frequency

IDF

TF-IDF Ranking

#1

#2

#3

Vocabulary structure

Vocabulary structure

Hash table

Hash table

Binary tree

Binary tree

B-tree

B-tree

Free wildcard queries!

Wildcard queries

Wildcard queries

Wildcard queries

0 0