Error message

  • Deprecated function: The each() function is deprecated. This message will be suppressed on further calls in FieldCollectionItemEntity->fetchHostDetails() (line 313 of /vol/data/2016.calicon.org/profiles/cod/modules/contrib/field_collection/field_collection.entity.inc).
  • Warning: count(): Parameter must be an array or an object that implements Countable in theme_table() (line 1998 of /vol/data/2016.calicon.org/includes/theme.inc).
  • Warning: count(): Parameter must be an array or an object that implements Countable in theme_table() (line 2061 of /vol/data/2016.calicon.org/includes/theme.inc).

Scrape it off: Using or making web scraping tools to gather structured data from webpages

Speaker(s)

Session Description

Link to Slides: 

https://docs.google.com/presentation/d/10oH1XCCF8pDXwNDXgLLJNJUTaRqVxFheCsPOlb76GNA/edit?usp=sharing

Often, in the course of research, librarians and faculty need to gather data that is presented on websites in tables or lists.  Although copy and paste can do the trick when the amount of data is small, at some point we need tools that can automate the job for us.  This is especially useful when gathering metadata for catalogs or online repositories, when much of the required metadata is already available through other sources, but needs to be collected and edited.

First, I will discuss the primary ways in which we can collect data via the internet

Then, I will focus on the use of pre-made web scraping utilities, covering what they are, how they work, and comparing some of the available free or low-cost options.

Finally, I will talk generally about how you could create your own web scraper, customized for what you need.  As an example, I will go through the development of a program I created to pull a monthly report of case opinion metadata from a court website using Python, and I will discuss the skills and tools you would need to go about developing a similar program.

Although I will be talking about programming a bit in this session, I will not be focusing on the specifics of coding.  This will be a beginner-friendly introduction to how web scrapers work, and what you may need to know if you find yourself needing to use one.

Session Track

Technology

Experience level

Beginner

Session Time Slot(s)