26-Mach-2020: Created for SP20
HTTP, Requests, and screen scraping
- Explain the essential structure of HTML pages and how they store structured information. Write a basic HTML page about a life interest.
- Employ the urllib and BeautifulSoup libraries to pull down an HTML page and find a particular element, and extract the data in that element.
- Systematically store the result of a screen scraping endeavor into a text file that can be shared and used as inputs by other programs or tools.
- Official python documentation on the string built-in functions
- Beautiful Soup is a URL utility library for breaking down HTML pages into their constituent elements. This is an invaluable library for all screen scrapers out there.
- The urllib package for python 3 provides a suite of tools for fetching URL data. I believe it uses the Requests library in the back end.
- Goodreads.com is a book repository that returns simple, parse-able HTML from URL-encoded queries. This is our sample site for screen scraping.
- RFC 2616: HTTP standard specification, heavy reading, nice to know about
- Selinium browser-based testing tool which could help overcome JS barriers to data access on a target page
Screen cast of Spring 2020 online session Part 1
Screen cast of Spring 2020 online session Part 2
- Structure of the WWW: Requests and responses in browsers
- Using request and response libraries in URL lib
- Structuring documents in trees! HTML basics
- Using Beautiful soup to parse a simple HTML File
- Run through screen scraping example on GoodReads.com. Learn how to use urllib and BeautifulSoup
- Work time on screen scraping project
HTML and HTTP notes
Internet ~ Early 1970s
- Connect computers running compatible operating systems
- Remote logins via a dedicated data network
- Strongly coupled system sub-networks which were very incompatible with one another
World Wide Web ~ 1990s
- Runs on the HTTP which is a SUBSET of the Internet
- HTTP - Hyptertext Transfer Protocol V.1.1: Rules for transmitting data on this network
- HTML - Hyptertext Markup Language: Format for encoding documents exchanged on the WWW using HTTP
- CSS - Cascading style sheets: Providing browsers with formatting information beyond the built-in stylesheet included with each browser
cakeProducts to Produce
- Code to the specification below. Then upload your a python files and any related documents to your GitHub account. I suggest creating a subdirectory in your git repository called "scraping" or something like that.
Create a program that uses the urllib and BeautifulSoup to grab HTML code from a public source, parses that source into meaningful bits of data, and spits those meaningufl bits of data into some form that could be transfered into anothre tool, such as a CSV for slurping up into a Database, a JSON file for use in the web, etc.
suggestions for good pages to parse
Choose a website whose page content is retrieved with some sort of query, such as a URL-encoded search query. This will allow your system to programatically tinker with the results you get back, and can be scaled to process lots more data than just a single, static page.
Pages with tables of data are great, since that allows us to loop over trees of page elements and procss their data one at a time
When possible, please structure your code in discrete methods that accomplish a single task, returning useful values to the caller. This helps reduce code repetition, allows for modular re-use, and makes the code generally more readable than blobs of lines in a heap.