The first web server at CERN

Web Scraping

Wednesday, 9-OCT-2019

HTTP, Requests, and screen scraping

check_circleLearning Objectives

bookInternal Resources

languageExternal Resources

movieScreen casts

Screen cast of Spring 2020 online session Part 1

Screen cast of Spring 2020 online session Part 2

listLesson Sequence

  1. Structure of the WWW: Requests and responses in browsers
  2. Using request and response libraries in URL lib
  3. Structuring documents in trees! HTML basics
  4. Using Beautiful soup to parse a simple HTML File
  5. Run through screen scraping example on GoodReads.com. Learn how to use urllib and BeautifulSoup
  6. Work time on screen scraping project

HTML and HTTP notes

Internet ~ Early 1970s

World Wide Web ~ 1990s

cakeProducts to Produce

program objective

Create a program that uses the urllib and BeautifulSoup to grab HTML code from a public source, parses that source into meaningful bits of data, and spits those meaningufl bits of data into some form that could be transfered into anothre tool, such as a CSV for slurping up into a Database, a JSON file for use in the web, etc.

suggestions for good pages to parse

Choose a website whose page content is retrieved with some sort of query, such as a URL-encoded search query. This will allow your system to programatically tinker with the results you get back, and can be scaled to process lots more data than just a single, static page.

Pages with tables of data are great, since that allows us to loop over trees of page elements and procss their data one at a time

use methods!

When possible, please structure your code in discrete methods that accomplish a single task, returning useful values to the caller. This helps reduce code repetition, allows for modular re-use, and makes the code generally more readable than blobs of lines in a heap.