On Github m09 / readability
(Newbold, 2013)
[Readability is the] ease of reading words and sentences.
Hargis et al., 1998[Readability is] the quality of a written or printed communication that makes it easy for any given class of persons to understand its meaning, or that induces them to continue reading.
English and English, 1958[Readability is] the degree to which a given class of people find certain reading matter compelling and comprehensible.
McLaughlin, 1969 discriminates classes of persons[Readability is] the sum total (including all the interactions) of all those elements within a given piece of printed material that affect the success a group of readers have with it. The success is the extent to which they understand it, read it at optimal speed, and find it interesting
Dale and Chall, 1949 most completeReadability is:
Still used today in schools or as bootstrap for other methods.
Find a way to measure readability by combining different easy-to-compute features.
Most frequent features:
There are lots of formulas. We will only go through the major ones.
(Thorndike, 1921) computed the most common words of English by counting the most occurring ones in common text.
(Bertha A. Lively and Sidney L. Pressey, 1923) used Thorndike's list to rate the readability of a book:
It took 3h to apply to a book.
(Flesch, 1948): $206.835 - 1.015 \left(\frac{\text{total words}}{\text{total sentences}}\right) - 84.6 \left(\frac{\text{total syllables}}{\text{total words}}\right)$
(Dale and Chall, 1949): $0.1579 \left(\frac{\text{difficult words}}{\text{words}} \times100 \right) + 0.0496 \left(\frac{\text{words}}{\text{sentences}}\right)$ Where difficult words are words not present in a list of 763 words (updated to 3k words in 1995).
(Gunning, 1952): $0.4 \left[\left(\frac{\text{words}}{\text{sentences}}\right) + 100 \left(\frac{\text{complex words}}{\text{words}}\right)\right]$ Where complex words are words of 3+ syllables.
(McLaughlin, 1969) aims at computing the grade required to read a text: $1.0430\sqrt{\text{complex words} \times \frac{30}{\text{number of sentences}}} + 3.1291$
(Kincaid et al., 1975) also aims at computing the grade required to read a text: $0.39\left(\frac{\text{total words}}{\text{total sentences}}\right) + 11.8\left(\frac{\text{total syllables}}{\text{total words}}\right) - 15.59$
A word list can discriminate between easy and difficult vocabulary.
Word lists are subject to cultural and personal differences:
The longer the harder.
Affixes may actually help readers to understand a word, while increasing its length. For example: de-construct-ing, anti-conform-ist
As for word complexity: the longer the harder.
Consider the following sentences:
The mouse ate the cheese, and then the rat ate the mouse, and after that, the cat ate the rat and died. The cat that ate the rat that ate the mouse that ate the cheese died.Consider the following sentences:
It was late at night, but it was clear. The stars were out and the moon was bright. And It was late at night. It was clear. The stars were out. The moon was bright.Collins-Thompson and Callan, 2004:
train an unigram model for each one of the 12 target grades assign the grade whose model is the most likely to have generated the input textOutperforms all the formulas.
Schwarm and Ostendorf, 2005:
The combination improves on the trigram model.
Pitler and Nenkova, 2008:
→ proves the superiority of discourse relations over average lengths of sentences and words. But discourse relations are not yet easily computable.
Tanaka-Ishii, Tezuka and Terada, 2010:
→ propose a method to build difficulty scores with as few resources as possible leveraging (2).
We have looked into: