Georg M. Sorst, CTO FINDOLOGIC
Find all documents containing the terms and satisfying the conditions:
fh AND salzburg
mmt OR mma
(information AND retrieval) OR search
Books, chapters, pages, web pages, news posts...
All the documents
Like words, but maybe "FH Salzburg" and "A1" as well
(information AND retrieval) OR search
What the user is looking for
"learn about information retrieval and search"
How the user talks to the computer
(information AND retrieval) OR search
(information AND retrieval) OR search
Ο(num terms in corpus)
Audience question
List of terms
Document which term occurs in
All documents which term occurs in
All postings lists
Query: Information AND Search
a book about information retrieval
↓
[a, book, about, information, retrieval]
apple iphone → iphone more important than apple
information retrieval search
2 × 1.5 + 1 × 3 + 0 × 3 = 6
0 × 1.5 + 0 × 3 + 1 × 3 = 3
1 × 1.5 + 0 × 3 + 0 × 3 = 1.5
Expand query:
salz*
↓
salzburg OR salzgitter OR salzach
Comes free with search tree
Build index with reversed terms
Intersect regular and reverse tree
* How to get prefix wildcard queries?search
↓ (3-gram)
$se, sea, ear, arc, rch, ch$
Expand query
sea*
↓
$se AND sea
$se AND sea
Term Doc IDs $se 1, 2, 3 arc 1 ch$ 1, 8 ear 1, 7 rch 1 sea 1, 5, 9bock
↓
book, back, lock
↓
Levenshtein distance = 4
Meier → M600
$$\textrm{d}(p, q) = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2}$$
$$\textrm{sim}(d_1, d_2) = \frac{ \vec{V}(d_1) }{ |\vec{V}(d_1)|} \cdot \frac{\vec{V}(d_2) }{ |\vec{V}(d_2)| } = \frac{ \vec{V}(d_1) \vec{V}(d_2) }{ |\vec{V}(d_1)| |\vec{V}(d_2)| }$$
$$sim(\#1, q) = \frac{ \begin{pmatrix}10 \\ 5\end{pmatrix} \cdot \begin{pmatrix}1 \\ 0\end{pmatrix} }{ \left|\begin{pmatrix}10 \\ 5\end{pmatrix}\right| \left|\begin{pmatrix}1 \\ 0\end{pmatrix}\right| } = 0.89$$
$$sim(\#2, q) = \frac{ \begin{pmatrix}3 \\ 2\end{pmatrix} \cdot \begin{pmatrix}1 \\ 0\end{pmatrix} }{ \left|\begin{pmatrix}3 \\ 2\end{pmatrix}\right| \left|\begin{pmatrix}1 \\ 0\end{pmatrix}\right| } = 0.83$$