Intro to Scraping – FiveThirtyEight – Tools



Intro to Scraping – FiveThirtyEight – Tools

0 0


scraping-presentation


On Github AlJohri / scraping-presentation

Intro to Scraping

FiveThirtyEight

Created by Al Johri / @aljohri

Tools

Without Code

Find Hidden APIs

Always use the network tab!

http://www.q1043.com/music/playlist/index.html?last10=1

http://www.q1043.com/services/now_playing.html?streamId=1465&limit=10

High Level

Two Main Types

  • Simulate Browser
  • Simulate GET, POST requests

Simulate Browser

  • slower
  • things can move and change
  • intutive to write

Simulate GET, POST requests

  • often requires a bit more digging to simulate javascript
  • longer lasting (APIs don't change much)
  • much, much faster

Python

Libraries

pip install requests lxml cssselect beautifulsoup4

Example

						
import requests, lxml.html
response = requests.get('https://www.google.com/')
doc = lxml.html.fromstring(response.content)
element = doc.cssselect("#hplogo")[0]
print lxml.html.tostring(element)
						
						

Ruby

Libraries

gem install mechanize nokogiri

Honorary Mention

Caveats

  • Legal Ramifications (Craigslist)
  • Rate limits / Accidental DDOS
  • Website can see your IP
  • Scraping is a scary word

THE END

BY Al Johri