Lecture 13
Dr. Mine Çetinkaya-Rundel
Duke University
STA 199 - Fall 2022
October 12, 2022
ae-12
project from GitHub, render your document, update your name, and commit and push.Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy&paste, but it’s time-consuming and prone to errors
Web scraping is the process of extracting this information automatically and transform it into a structured dataset
Two different scenarios:
Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
|>
read_html()
- Read HTML data from a url or character string (actually from the xml2 package, but most often used along with other rvest functions)html_element()
/ html_elements()
- Select a specified element(s) from HTML documenthtml_table()
- Parse an HTML table into a data framehtml_text()
- Extract text from an elementhtml_text2()
- Extract text from an element and lightly format it to match how text looks in the browserhtml_name()
- Extract elements’ nameshtml_attr()
/ html_attrs()
- Extract a single attribute or all attributesae-12
ae-12
(repo name will be suffixed with your GitHub name).When working in a Quarto document, your analysis is re-run each time you knit
If web scraping in a Quarto document, you’d be re-scraping the data each time you knit, which is undesirable (and not nice)!
An alternative workflow:
Two different scenarios for web scraping:
Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy)
Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files