mpetri.github.com

Compressed Suffix Trees for Machine Translation

Matthias Petri The University of Melbourne joint work with Ehsan Shareghi, Gholamreza Haffari and Trevor Cohn

Machine Translation

Use computational resources to translate a sentence from a source language to a target language.

Resources

Parallel Text Corpus
- Wikipedia headlines in multiple languages
- European Parliament or UN transcripts
Large Text Corpus in the target language
Lots of data sets available for free http://www.statmt.org/wmt15/translation-task.html

Translation Process

Given a Source sentence find a translation (the Target) that has the highest probability given the source sentence:

$$P(\text{ Target }| \text{ Source }) = \frac{P(Target) \times P( \text{ Source } | \text{ Target })}{P(Source)}$$

P(Source | Target)

Find probable candidates using the Parallel Corpus

Language A Language B Word Alignment lo que [X,1] que los the [X,1] that the 0-0 1-2 3-2 4-3 lo que [X,1] que los this [X,1] that the 0-0 3-2 4-3 lo que [X,1] que los which [X,1] the 0-0 1-0 4-2

P(Target)

For each candidate sentence, use the text statistics of the monolingual corpus (a Language Model) to determine the "best" translation

q-gram Language Modeling

Assign a probability to a sequence of words $w^n_1$ indicating how likely the sequence is, given a language:

$$ P(w^n_1) = \prod_{i=1}^n P(w_i|w_{i-q+1}^{i-1}) $$ where $$ w^n_1 = S[1..n] \text{ and } w^j_i = S[i..j] $$

Example (q=3)

$$ P(\text{ the old night keeper keeps the keep in the town }) = \\ P(\text{ the} ) \\ \times P(\text{ old }|\text{ the }) \\ \times P(\text{ night }|\text{ the old }) \\ \times P(\text{ keeper }|\text{ the old night }) \\ \times P(\text{ keeps }|\text{ old night keeper }) \\ \times P(\text{ the }|\text{ night keeper keeps }) \\ \times P(\text{ keep }|\text{ keeper keeps the }) \\ \times P(\text{ in }|\text{ keeps the keep }) \\ \times P(\text{ the }|\text{ the keep in }) \\ \times P(\text{ town }|\text{ keep in the }) \\ $$

Kneser-Ney Language Modeling

Highest Level:

$$ P(w_i|w_{i-q+1}^{i-1})=\frac{max(C(w_{i-q+1}^i))−D_q,0)}{C(w_{i-q+1}^{i-1})} + \frac{D_q N_{1+}(w_{i-q+1}^{i-1} \bullet )}{C(w_{i-q+1}^{i-1})} P(w_i|w_{i-n+2}^{i-1}) $$

Middle Level ($1 < k < n$):

$$ P(w_i|w_{i-k+1}^{i-1})=\frac{max(N_{1+}(\bullet w_{i-k+1}^i))−D_k,0)}{N_{1+}(\bullet w_{i-k+1}^{i-1}\bullet)} + \frac{D_k N_{1+}(w_{i-k+1}^{i-1} \bullet )}{N_{1+}(\bullet w_{i-k+1}^{i-1}\bullet)} P(w_i|w_{i-k+2}^{i-1}) $$

Lowest Level:

$$ P(w_i) = \frac{N_{1+}(\bullet w_{i})}{N_{1+}(\bullet \bullet)} $$

Terminology

$C(w_{i}^j)$: Standard count of $S[i..j]$ in the corpus
$N_{1+}(\bullet \alpha) = |\{w: c(w \alpha)>0\}|$ is the number of observed word types preceding the pattern $\alpha = w_{i}^j$
$N_{1+}(w_{i}^j \bullet ) = |\{w: c(\alpha w)>0\}|$ is the number of observed word types following the pattern $\alpha$
$N_{1+}(\bullet w_{i}^j \bullet)$: the number of unique contexts (left and right) where $w_{i}^j$ occurs
$D_i$: Discount parameter for the recursion computed at index construction time
$N_{1+}(\bullet \bullet)$: Number of unique bi-grams

$N_{1}(\alpha \bullet) = |\{w: c(\alpha w)==1\}|$ is the number of observed word types preceding the pattern $\alpha = w_{i}^j$ that occur exactly once
$N_{2}(\alpha \bullet) = |\{w: c(\alpha w)==2\}|$ is the number of observed word types preceding the pattern $\alpha = w_{i}^j$ that occur exactly two times
$N_{3+}(\alpha \bullet ) = |\{w: c(\alpha w)\geq3\}|$ is the number of observed word types following the pattern $\alpha$ that occur at least 3 times

ARPA Files

ARPA based Language Models

Instead of precomputation can we compute probabilities on the fly using Compressed Suffix Trees?

Advantages

No restriction in recursion depth results in better probability estimates
Current approaches prestore use space exponential in q
Construction of a CST is faster than precomputing all probabilities
Can operate on words and character alphabets (need larger q to be useful)

Use 2 CSTs over the text and reverse text

Kneser-Ney Language Modeling

Perform backward search and keep track of dependencies

Computing $N_{1+}(\bullet w_{i}^j \bullet)$

Naive Approach: (Expensive!)

Find the set S of all symbols preceding $w_{i}^j$ For each $\alpha \in S$ determine $N_{1+}(\alpha w_{i}^j \bullet)$ Sum over all $\alpha$

Only use one wavelet tree based CST over the text

$N_{1+}(w_{i}^j \bullet)$ computed as before using the CST
$N_{1+}( \bullet w_{i}^j)$: For the range $[sp,ep]$ of the pattern use the wavelet tree to visit all leaves in $BWT[sp,ep]$. (Interval Symbols in SDSL)
$N_{1+}( \bullet w_{i}^j \bullet )$: Visiting all leaves in $BWT[sp,ep]$ implicitly computes all Weiner Links of the CST node corresponding to $w_{i}^j$ as all [sp,ep] ranges of all $\bullet w_{i}^j$ are computed during the wavelet tree traversal.
Determine number of children of the determined nodes to compute $N_{1+}( \bullet w_{i}^j \bullet )$.

Other Considerations

Special case when pattern search ends in the middle of an edge in the CST
Special handling of start and end of sentence tags which can mess up the correct counts
Ensure correctness by comparing to state-of-the-art systems (KenLM and SRILM)

Construction and Query Time

Query Time Breakdown

Future Work

Precomputing some of the counts is very cheap and speeds up query processing significantly
Alphabet-Partitioning for larger alphabets (requires interval symbols)
Already competitive to state-of-the-art implementations
Backward Search now main cost factor
Lots of open problems in the MT field where succinct structures could be applied to
Easy entry as test collections and software are freely available

Compressed Suffix Trees for Machine Translation Matthias Petri The University of Melbourne joint work with Ehsan Shareghi, Gholamreza Haffari and Trevor Cohn

Compressed Suffix Trees for Machine Translation – Translation Process – q-gram Language Modeling

mpetri

Compressed Suffix Trees for Machine Translation – Translation Process – q-gram Language Modeling

0 1