Content of this talk
- Web Scraping using Selenium
- Guided tour through some of the pandas/matplotlib features with Data Analysis of IMDB(Internet Movie Database) Best Movies in Filmfare
Github:
https://github.com/min2bro/WebScraping/
IPython Notebook
- Write, Edit, Replay python scripts
- Interactive Data Visualization and report Presentation
- Notebook can be saved and shared
- Run Selenium Python Scripts
Pandas
- Python Data Analysis Library
Matplotlib
- plotting library for the Python
Best Movies
Filmfare Awards 1955-2015
Open IMDB Movie List Page
Data extraction from Web using Selenium
Data format
['Bajirao Mastani', 'Queen', 'Bhaag Milkha Bhaag', 'Barfi!', 'Zindagi Na Milegi Dobara']
['17,362', '39,518', '39,731', '52,308', '41,731']
['Director: Sanjay Leela Bhansali', 'Director: Vikas Bahl', 'Director: Anurag Basu']
Store Data in a Python Dictionary
Data in Dictionary
{
"Director": "Director: Sanjay Leela Bhansali",
"Votes": "17,362",
"RunTime": "A historical ... (158 mins.)",
"Year": 2015,
"Genre": "Drama",
"Movie Name": "Bajirao Mastani",
"Rating": "7.2"
}
- Replace the Comma(,) in Vote Value and change the data type to int
- Change the Data type for Rating and RunTime
- Remove description from Run Time
- Null for missing values
{
"Director": "Sanjay Leela Bhansali",
"Votes": 17362,
"RunTime": 158,
"Year": 2015,
"Genre": "Drama",
"Movie Name": "Bajirao Mastani",
"Rating": 7.2
}
Records with missing values
Replace Null Values with Mean
Movies with Highest Ratings
Top five movies since 1955
Best Movies from last 65 years
Movies with Lowest Ratings
Movies with Maximum Run Time
Movies with rating Greater than 7
Ratings Visualization using Bar Graph
Movies most likely to be selcted for Best Picture
Rating greater than 7
Run time more than 2hrs
Category Drama and Musical
This slide has fragments which are also stepped through in the notes window.
Web Scraping & Data Analysis with Selenium and Python
By Vinay Babu / @min2bro