Pyscalpel: your friendly tool for web scraping

A spider weaving its web
Photo by Divyadarshi Acharya on Unsplash
  • Gevent: A coroutine -based Python networking library that uses greenlet to provide a high-level synchronous API on top of the libev or libuv event loop.
  • Asyncio: The standard library for writing concurrent code using async/await syntax in Python.
  • Trio: A shiny asynchronous library leveraging the async/await syntax with a clean api and above all a “revolutionary” approach of so-called structured concurrency.

Installation

Like any other python library, you install pyscalpel using the pip command. Note that pyscalpel start working from python3.6 onwards.

$ pip install pyscalpel # if you only want the asyncio backend
$ pip install pyscalpel[gevent] # if you want the gevent backend
$ pip install pyscalpel[trio] # if you want the trio backend
$ pip install pyscalpel[full] # if you want all the backends
$ poetry add pyscalpel # for the asyncio backend
$ poetry add pyscalpel[gevent] # for the gevent backend
$ poetry add pyscalpel[trio] # for the trio backend
$ poetry add pyscalpel[full] # for all the backends

Usage

Gevent backend

So here is an example where we scrape quote information from https://quotes.toscrape.com. It is a website made by the creators of scrapy to help us test our spiders (programs scraping information) 🙃

code snippet running with the gevent framework
  • On lines 28 and 29, we create the configuration for the spider by specifying a file where to store the scraped items and a list of item processors that are functions taking a scraped item and modify or discard it. In the previous example we have a processor defined on lines 22 to 24 that adds the date at which the item was scraped.
  • On line 30 we create a spider using the StaticSpider class, specifying the urls to parse, the function that will do the actual parsing, and the configuration we just created before.
  • On line 31, we run the spider.
  • On line 32, once the execution of the spider is finished, we print some statistics of the operation like the number of requests per second, the urls not reached, etc.. more information about this can be found here.
  • On lines 34 and 35, we use a pyscalpel utility function to read the file where scraped items were saved.
  • From lines 9 to 14, we search for the message, author and tags of each quote using the xpath method of the response object. We could also have done it with the css method. More information about these methods can be found in the documentation.
  • On line 15, we save the scraped item obtained from the previous statements using save_item method from the spider object.
  • From lines 17 to 19, we try to get the url for the next page of the website and if we have one, we use the follow method of the response object to switch to the next page.

Asyncio / trio backend

This is the equivalent code of the spider with asyncio/trio flavor. It leverages the anyio framework and you will note that it look very similar to the previous code except the async/await syntax 🙂

code snippet running with anyio (asyncio or trio)

Single page application scraping

It’s worth mentioning that pyscalpel supports page scraping using heavily javascript to render content as we see in so-called Single Page Applications. For that the selenium library is used. Here is an example using the gevent backend to scrape some data on httpbin.org. For information it is a website useful for testing an http client.

  • StaticSpider and StaticResponse are replaced by their counterparts SeleniumSpider and SeleniumResponse.
  • The response object no longer have the css and xpath methods but rather a driver attribute that represents a selenium remote driver object and lets you manipulate the current page as you want from the moment you know how to use it.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Kevin Tewouda

Kevin Tewouda

69 Followers

Déserteur camerounais résidant désormais en France. Passionné de programmation, sport, de cinéma et mangas. J’écris en français et en anglais dû à mes origines.