Readability assessment

For Aizawa Lab by Hugo Mougard on March the 26th

Overview

Definition of readability
Factors of readability
Readability formulas
Issues with formulas
Machine learning and readability
Focus on an interesting paper
Conclusion

What readability is

(Newbold, 2013)

Definition 1

[Readability is the] ease of reading words and sentences.

Hargis et al., 1998

most basic definition
only addresses the mechanical act

Definition 2

[Readability is] the quality of a written or printed communication that makes it easy for any given class of persons to understand its meaning, or that induces them to continue reading.

English and English, 1958

incorporates understanding
doesn't discriminate classes of persons

Definition 3

[Readability is] the degree to which a given class of people find certain reading matter compelling and comprehensible.

McLaughlin, 1969 discriminates classes of persons

Definition 4

[Readability is] the sum total (including all the interactions) of all those elements within a given piece of printed material that affect the success a group of readers have with it. The success is the extent to which they understand it, read it at optimal speed, and find it interesting

Dale and Chall, 1949 most complete

Key ideas

Readability is:

people-dependent
about the text itself, not the layout (legibility)
about understanding, not only recognizing words

The factors of readability

Readability factors

Reproduced from Oakland and Lane, 2004 through Newbold, 2013

Readability formulas

Historical approach to readability

easy to compute
some of them do not require a computer
easy to use once computed

Still used today in schools or as bootstrap for other methods.

Idea behind the formulas

Find a way to measure readability by combining different easy-to-compute features.

Most frequent features:

average sentence length
average word length
belonging to a word list

Quantity of formulas

There are lots of formulas. We will only go through the major ones.

Thorndike's 10k simple words list

(Thorndike, 1921) computed the most common words of English by counting the most occurring ones in common text.

First reading formula

(Bertha A. Lively and Sidney L. Pressey, 1923) used Thorndike's list to rate the readability of a book:

to count the number of words not in the book
to calculate the median index of the words in the list

It took 3h to apply to a book.

First popular formula:Flesch Reading Ease

(Flesch, 1948): $206.835 - 1.015 \left(\frac{\text{total words}}{\text{total sentences}}\right) - 84.6 \left(\frac{\text{total syllables}}{\text{total words}}\right)$

Dale–Chall readability formula

(Dale and Chall, 1949): $0.1579 \left(\frac{\text{difficult words}}{\text{words}} \times100 \right) + 0.0496 \left(\frac{\text{words}}{\text{sentences}}\right)$ Where difficult words are words not present in a list of 763 words (updated to 3k words in 1995).

Gunning fog index

(Gunning, 1952): $0.4 \left[\left(\frac{\text{words}}{\text{sentences}}\right) + 100 \left(\frac{\text{complex words}}{\text{words}}\right)\right]$ Where complex words are words of 3+ syllables.

SMOG formula

(McLaughlin, 1969) aims at computing the grade required to read a text: $1.0430\sqrt{\text{complex words} \times \frac{30}{\text{number of sentences}}} + 3.1291$

Flesch–Kincaid Grade Level

(Kincaid et al., 1975) also aims at computing the grade required to read a text: $0.39\left(\frac{\text{total words}}{\text{total sentences}}\right) + 11.8\left(\frac{\text{total syllables}}{\text{total words}}\right) - 15.59$

Formulas issues

(Bailin and Grafstein, 2001)

Word lists: assumption

A word list can discriminate between easy and difficult vocabulary.

Word lists: issues

Word lists are subject to cultural and personal differences:

age
wealth
ethnical origins
religion
political views
…

Word complexity: assumption

The longer the harder.

Word complexity: issues

Affixes may actually help readers to understand a word, while increasing its length. For example: de-construct-ing, anti-conform-ist

Syntactic complexity: assumption

As for word complexity: the longer the harder.

Syntactic complexity: issues (1/2)

Consider the following sentences:

The mouse ate the cheese, and then the rat ate the mouse, and after that, the cat ate the rat and died. The cat that ate the rat that ate the mouse that ate the cheese died.

Syntactic complexity: issues (2/2)

Consider the following sentences:

It was late at night, but it was clear. The stars were out and the moon was bright. And It was late at night. It was clear. The stars were out. The moon was bright.

Machine Learning

The tasks

Classical task Machine learning task Score a text Regression Sort texts on readability Regression, classification on pairs of documents Assign a required grade to a text Classification with grades as labels Regroup texts of similar readability Clustering

A first approach: unigrams model

Collins-Thompson and Callan, 2004:

train an unigram model for each one of the 12 target grades assign the grade whose model is the most likely to have generated the input text

Outperforms all the formulas.

Language model approach refined

Schwarm and Ostendorf, 2005:

bigrams, trigrams. Improves on unigrams.
combination of LM and text features by using the perplexity as a feature alongside other features (FKRE, OOV, average parse height, …).

The combination improves on the trigram model.

Complex features

Pitler and Nenkova, 2008:

unigram model
lexical cohesion (cosine similiarity averaged over all sentences)
syntactic features (as Schwarm and Ostendorf)
entity coherence (analyse the subjects / objects of consecutive sentences)
language model over discourse relations

→ proves the superiority of discourse relations over average lengths of sentences and words. But discourse relations are not yet easily computable.

Sorting texts by readability

Tanaka-Ishii, Tezuka and Terada, 2010:

word counts in the document and in the corpus as features
corpus is easy to create compared to other ML approaches
results are very good when evaluated vs regression on the sorting task

Focus on an interesting paper

Domain-Specific Iterative Readability Computation

by Jin Zhao and Min-Yen Jan (JCDL'10)

Motivations

compute domain specific terms difficulties is expensive terms of close difficulty are often found together in documents

→ propose a method to build difficulty scores with as few resources as possible leveraging (2).

Idea

obtain a list of domain specific terms (the only resource required)
construct a bi-partite graph to model the relations between documents and domain-specific terms
initialize difficulties with a readability formula
iteratively update the difficulties of documents and concepts
stop when the delta of the difficulties between two steps is small

Graph construction

Conclusion

We have looked into:

how to define readability (people-dependent, understanding, focus on text)
the factors of readability (people-based and text-based)
classic formulas and their issues
new ML approches
an interesting way to infer domain-specific difficulty

Thank you very much for your attention😊

Readability assessment – What readability is – The factors of readability

m09