Data Science Specialization Capstone Project

View Github Repository Open presentation in a new window

See all presentation from outrigger

Slidify Presentation for Johns Hopkins University Data Science Capstone Project

On Github outrigger / DSSCapstoneDeck

The goal of this project is to develop a predictive keyboard application that takes in a phrase and outputs a prediction of the next word.

Data Source

The training dataset comes from a corpus called HC Corpora. It consists of three files containing unstructured data from blogs, news articles and tweets from Twitter.
This dataset was first sampled and processed to generate the n-grams required.
The n-grams were then used to build a prediction engine for the app.
Summary of workflow:

Hosted at this link
Wait for the app to load. Once done, "[1] NA" will appear as the next predicted word.
Enter a phrase in the textbox provided and the next predicted word will be generated.
Screenshot:

Katz back-off model
- Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram.
- It accomplishes this estimation by "backing-off" to models with smaller histories under certain conditions.
- By doing so, the model with the most reliable information about a given history is used to provide better results.