Author: adnarel Class: dsscapstone-002 Johns Hopkins Data Science Specialization
The picture shows an example of a text prediction app. The suggested next words are "degree", "specialization", and "skills". The top predicted word is "specialization". How can we use our Data Science skills to create a text prediction app?
We are provided with large files of text from Blogs, Twitter, and News sources.
We use a sample of text drawn from each of these three sources so that we have about 1,000,000 total words, with approximately one-third from each text source.
We use specialized text-mining libraries in R ("tm" and "RWeka") to clean and summarize the text data and build a 'Corpus' - a collection of text in a format ready for analysis.
We build a "Document Term Matrix" from the Corpus - this is a collection of phrases found in the Corpus and their frequencies. This matrix will be the cornerstone of our predictive text model.
We are concerned with fast execution, so we use this algorithm for text prediction:
Read the user input, clean and validate (2 words or more).
Extract the last two words of the user input.
Search the Document Term Matrix for the most frequent 3-word phrase that begins with these two words.
The third word of the most frequent 3-word phrase is the predicted word.
If there is no matching phrase, select a random word from among a list of very common words.
Note: Github Repository with R scripts and data used for app.