Pyscalpel: your friendly tool for web scraping

A spider weaving its web
Photo by Divyadarshi Acharya on Unsplash
  • Gevent: A coroutine -based Python networking library that uses greenlet to provide a high-level synchronous API on top of the libev or libuv event loop.
  • Asyncio: The standard library for writing concurrent code using async/await syntax in Python.
  • Trio: A shiny asynchronous library leveraging the async/await syntax with a clean api and above all a “revolutionary” approach of so-called structured concurrency.

Installation

$ pip install pyscalpel # if you only want the asyncio backend
$ pip install pyscalpel[gevent] # if you want the gevent backend
$ pip install pyscalpel[trio] # if you want the trio backend
$ pip install pyscalpel[full] # if you want all the backends
$ poetry add pyscalpel # for the asyncio backend
$ poetry add pyscalpel[gevent] # for the gevent backend
$ poetry add pyscalpel[trio] # for the trio backend
$ poetry add pyscalpel[full] # for all the backends

Usage

Gevent backend

code snippet running with the gevent framework
  • On lines 28 and 29, we create the configuration for the spider by specifying a file where to store the scraped items and a list of item processors that are functions taking a scraped item and modify or discard it. In the previous example we have a processor defined on lines 22 to 24 that adds the date at which the item was scraped.
  • On line 30 we create a spider using the StaticSpider class, specifying the urls to parse, the function that will do the actual parsing, and the configuration we just created before.
  • On line 31, we run the spider.
  • On line 32, once the execution of the spider is finished, we print some statistics of the operation like the number of requests per second, the urls not reached, etc.. more information about this can be found here.
  • On lines 34 and 35, we use a pyscalpel utility function to read the file where scraped items were saved.
  • From lines 9 to 14, we search for the message, author and tags of each quote using the xpath method of the response object. We could also have done it with the css method. More information about these methods can be found in the documentation.
  • On line 15, we save the scraped item obtained from the previous statements using save_item method from the spider object.
  • From lines 17 to 19, we try to get the url for the next page of the website and if we have one, we use the follow method of the response object to switch to the next page.

Asyncio / trio backend

code snippet running with anyio (asyncio or trio)

Single page application scraping

  • StaticSpider and StaticResponse are replaced by their counterparts SeleniumSpider and SeleniumResponse.
  • The response object no longer have the css and xpath methods but rather a driver attribute that represents a selenium remote driver object and lets you manipulate the current page as you want from the moment you know how to use it.

--

--

--

Déserteur camerounais résidant désormais en France. Passionné de programmation, sport, de cinéma et mangas. J’écris en français et en anglais dû à mes origines.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Tile Letter Possibilities

Multi docker Nodejs app setup

Synergise 2021-Final Report

What happens when you type google.com in your browser and press Enter.

My Journey to becoming a Full Stack Web Developer

Stop outputting records when using rails console

Carry Project Roadmap 2021

Despise New Version — Need

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Kevin Tewouda

Kevin Tewouda

Déserteur camerounais résidant désormais en France. Passionné de programmation, sport, de cinéma et mangas. J’écris en français et en anglais dû à mes origines.

More from Medium

Easy Python setup on Apple Silicon with Homebrew

Building A URL Shortner In Pure Python & Ngrok

Tic Tac Toe — Python Tkinter GUI

How I managed to scrap over 100k properties from a Spanish real estate website