Web Scraping & Data Analysis with Selenium and Python – Data Science Toolbox – IPython Notebook



Web Scraping & Data Analysis with Selenium and Python – Data Science Toolbox – IPython Notebook

4 4


WebScraping

Web Scraping and Data Analysis with Selenium and Python

On Github min2bro / WebScraping

Web Scraping & Data Analysis with Selenium and Python

By Vinay Babu / @min2bro

Content of this talk

  • Web Scraping using Selenium
  • Guided tour through some of the pandas/matplotlib features with Data Analysis of IMDB(Internet Movie Database) Best Movies in Filmfare

Github: https://github.com/min2bro/WebScraping/

Data Science Toolbox

Web Scraping

IPython Notebook

  • Write, Edit, Replay python scripts
  • Interactive Data Visualization and report Presentation
  • Notebook can be saved and shared
  • Run Selenium Python Scripts

Pandas

  • Python Data Analysis Library

Matplotlib

  • plotting library for the Python

Steps to Follow

Best Movies

Filmfare Awards 1955-2015

Some Import

Open IMDB Movie List Page

Getting Data

Data extraction from Web using Selenium

Selenium vs Others

Data format

						['Bajirao Mastani', 'Queen', 'Bhaag Milkha Bhaag', 'Barfi!', 'Zindagi Na Milegi Dobara']

['17,362', '39,518', '39,731', '52,308', '41,731']

['Director: Sanjay Leela Bhansali', 'Director: Vikas Bahl', 'Director: Anurag Basu']
						

Store Data in a Python Dictionary

Data in Dictionary

{
	"Director": "Director: Sanjay Leela Bhansali",
	"Votes": "17,362",
	"RunTime": "A historical ... (158 mins.)",
	"Year": 2015,
	"Genre": "Drama",
	"Movie Name": "Bajirao Mastani",
	"Rating": "7.2"
}

Data Cleansing

  • Replace the Comma(,) in Vote Value and change the data type to int
  • Change the Data type for Rating and RunTime
  • Remove description from Run Time
  • Null for missing values
{
	"Director": "Sanjay Leela Bhansali",
	"Votes": 17362,
	"RunTime": 158,
	"Year": 2015,
	"Genre": "Drama",
	"Movie Name": "Bajirao Mastani",
	"Rating": 7.2
}

Data in Pandas Dataframe

Missing Values

Records with missing values

Replace Null Values with Mean

Movies with Highest Ratings

Top five movies since 1955

Best Movies from last 65 years

Movies with Lowest Ratings

Movies with Maximum Run Time

Top 10 movies

Trends

Average

Movies IMDB Ratings

Movies with rating Greater than 7

Ratings Visualization using Bar Graph

Percentage distribution

Best Movies By Genre

Directors of Best Movies

Movies most likely to be selcted for Best Picture

Rating greater than 7

Run time more than 2hrs

Category Drama and Musical

This slide has fragments which are also stepped through in the notes window.

Thanks for watching

Web Scraping & Data Analysis with Selenium and Python By Vinay Babu / @min2bro