Pyscalpel: your friendly tool for web scraping
Create simple and powerful spiders or crawlers with pyscalpel!
Nowadays, there are many usages where we need to scrape information from web pages like news feed, beautiful link rendering with image and text like social media do, research for data scientists, etc..
So it is important if there is a framework that help us to do this with little effort. The most known in the python ecosystem is probably scrapy. It is a veteran using the Twisted asynchronous backend (note that it has recently add partial support for asyncio). But I feel like its usage was heavy. We have to think about it as a whole project not a library we can use in another project. Also even if Twisted is good enough, I believe that there are frameworks more powerful to help us achieve better concurrency with a clean syntax, so I decide to create my own web scraping library and pyscalpel was born! It supports three asynchronous backends:
- Gevent: A coroutine -based Python networking library that uses greenlet to provide a high-level synchronous API on top of the libev or libuv event loop.
- Asyncio: The standard library for writing concurrent code using async/await syntax in Python.
- Trio: A shiny asynchronous library leveraging the async/await syntax with a clean api and above all a “revolutionary” approach of so-called structured concurrency.
This will not be a tutorial like my previous articles, I just want to show some examples on how to use it.
Like any other python library, you install pyscalpel using the pip command. Note that pyscalpel start working from python3.6 onwards.
$ pip install pyscalpel # if you only want the asyncio backend
$ pip install pyscalpel[gevent] # if you want the gevent backend
$ pip install pyscalpel[trio] # if you want the trio backend
$ pip install pyscalpel[full] # if you want all the backends
You can also use poetry to install the package if you prefer. For those who are saying: “what the hell is that..” I have a nice introduction on it.
$ poetry add pyscalpel # for the asyncio backend
$ poetry add pyscalpel[gevent] # for the gevent backend
$ poetry add pyscalpel[trio] # for the trio backend
$ poetry add pyscalpel[full] # for all the backends
So here is an example where we scrape quote information from https://quotes.toscrape.com. It is a website made by the creators of scrapy to help us test our spiders (programs scraping information) 🙃
So let’s dissect the example. We will start from the end.
- On lines 28 and 29, we create the configuration for the spider by specifying a file where to store the scraped items and a list of item processors that are functions taking a scraped item and modify or discard it. In the previous example we have a processor defined on lines 22 to 24 that adds the date at which the item was scraped.
- On line 30 we create a spider using the StaticSpider class, specifying the urls to parse, the function that will do the actual parsing, and the configuration we just created before.
- On line 31, we run the spider.
- On line 32, once the execution of the spider is finished, we print some statistics of the operation like the number of requests per second, the urls not reached, etc.. more information about this can be found here.
- On lines 34 and 35, we use a pyscalpel utility function to read the file where scraped items were saved.
We still have to see the parse function defined between lines 8 and 19. So first of all you can see that it takes two arguments, the current spider running it and a StaticResponse object that contains the downloaded page and methods to scrape specific items. A few observations:
- From lines 9 to 14, we search for the message, author and tags of each quote using the xpath method of the response object. We could also have done it with the css method. More information about these methods can be found in the documentation.
- On line 15, we save the scraped item obtained from the previous statements using save_item method from the spider object.
- From lines 17 to 19, we try to get the url for the next page of the website and if we have one, we use the follow method of the response object to switch to the next page.
Asyncio / trio backend
This is the equivalent code of the spider with asyncio/trio flavor. It leverages the anyio framework and you will note that it look very similar to the previous code except the async/await syntax 🙂
Single page application scraping
I won’t go into the details of what this code does, it is explained in the documentation, just mention that:
- StaticSpider and StaticResponse are replaced by their counterparts SeleniumSpider and SeleniumResponse.
- The response object no longer have the css and xpath methods but rather a driver attribute that represents a selenium remote driver object and lets you manipulate the current page as you want from the moment you know how to use it.
So that is all for this introduction to pyscalpel, I hope that I drew your curiosity, and if so, that you will enjoy using it. 😁
If you like my article and want to continue learning with me, don’t hesitate to follow me here and subscribe to my newsletter 😉