Web Crawling with PHP

Why in The World?

When I was employed by a company to work on a large legacy system, I faced a problem. The QA people were spending an inordinate amount of time testing a poorly coded data-driven application. It was impossible for our team to proactively identify and resolve errors in a coding environment so prone to failure. We needed a solution. That’s when it dawned on me – what if I build a bot to crawl our products to help our testers look for errors and produce higher quality work with greater efficiency? While several frameworks existed in languages like Python and Javascript, nothing was available at the time to help my team support products using PHP. That need inspired me to build my own PHP crawling framework CrawlZone.

So in this post, I'd like to go over some architectural decisions and mistakes I made while building the library, plus some usage guidelines, and legal considerations. It is oriented towards software engineers but might be interesting for people who want to get a general idea about data scraping.

1. Testing

After watching my QA coworkers enjoy breaking my stuff with a sarcastic smile, I decided to make some changes and automate my testing as much as possible. It gives you a huge advantage and a lot of confidence in many phases of crafting and massaging your code.

It took me a while to figure out what I need to set up a testing environment as you have to be able to crawl several domains, filter URIs, as well as test delays and redirections.

Anyway, to cut a long story short, Docker is all you need. It lets you set up a multi-domain infrastructure in minutes without the need to dive into the configuration hell. Additionally, TravisCI, which is my favorite continuous integration platform, can run and build Docker images. If you are curious how I setup TravisCI with Docker, here is the link to my .travis.yml file and docker-compose.yml file. I use Makefile to create a few useful shortcuts.

2. Architecture

Git only knows how many times I completely rewrote this architecture, starting over from scratch. It evolved from the simple recursive function to what you see in Figure 1.

At its heart is a Request Scheduler (Engine), which makes asynchronous requests using the Guzzle HTTP client, coordinates the events, and runs the middleware stack.

Figure 1. The Engine

Here is what's happening for a single request when you run the client:

The client queues the initial request (start_uri).
The engine looks at the queue and checks if there are any requests.
The engine gets the request from the queue and emits the BeforeRequestSent event. If the depth option is set in the config, then the RequestDepth extension validates the depth of the request. If the obey robots.txt option is set in the config, then the RobotTxt extension checks if the request complies with the rules. In a case when the request doesn't comply, the engine emits the RequestFailed event and gets the next request from the queue.
The engine uses the request middleware stack to pass the request through it.
The engine sends an asynchronous request using Guzzle HTTP Client
The engine emits the AfterRequestSent event and stores the request in the history to avoid crawling the same request again.
When response headers are received, but the body has not yet begun to download, the engine emits the ResponseHeadersReceived event.
The engine emits the TransferStatisticReceived event. If the autothrottle option is set in the config, then the AutoThrottle extension is executed.
The engine uses the response middleware stack to pass the response through it.
The engine emits the ResponseReceived event. Additionally, if the request status code is greater than or equal to 400, the engine emits RequestFailed event.
The ResponseReceived triggers the ExtractAndQueueLinks extension, which extracts and queues the links. The process starts over until the queue is empty.

3. Extensions

As you can see from the overview, the architecture heavily relies on events to decouple the logic and to ensure code extensibility. The extensions are nothing more than event listeners based on the Symfony Event Dispatcher component.

To create an extension, all you need to do is extend the Extension class and add it to the client:

use Crawlzone\Client;
use Crawlzone\Extension\Extension;
...
$config = [
    'start_uri' => ['https://httpbin.org/status/200'],
];

$client = new Client($config);

$client->addExtension(new class() extends Extension {
...
});

$client->run();

All extensions have access to the Queue so you can queue additional requests and to the HTTP client to make new requests (for an authentication purpose for example).

There are quite a few events to hook into the execution process for a wide variety of purposes:

BeforeEngineStarted - Useful when you want to initialize something before you start crawling. I use it for creating the SQLite database to store the queue and the history.
BeforeRequestSent - You can use it to validate the request before it gets sent. I use it to verify the request depth and to comply with robots.txt.
AfterRequestSent - Useful when you want to perform certain actions after the request was sent, but no response was received yet.
TransferStatisticReceived - This is an interesting one. It is dispatched when the handler finishes sending a request, allowing you to get access to the lower level transfer details. It is used to automatically throttle the delay between the requests in the Autothrottle Extention.
ResponseHeadersReceived - Emitted when the HTTP headers of the response have been received but the body has not yet begun to download. Useful if you want to reject responses that are greater than a certain size for example.
RequestFailed - Dispatched when the request is failed (status code 5xx) or when the extension or middleware throws InvalidRequestException. You can use it to log 5xx server errors for example.
ResponseReceived - Useful to log the response and get the data. I use it to extract and queue the links. Also, instead of following the redirects, I schedule them for later use. It provides more consistent behavior to process the requests.
AfterEngineStopped - Occurs when the queue is empty, and there are no requests to send anymore. Use it to perform cleanup or send notifications.

4. Middlewares

While extensions allow you to hook into the crawling process, they won't let you modify the request or the response (add extra headers for example). This is where middlewares come in handy. I built a simple middleware stack for this purpose. To create the request or response middleware, just implement the RequestMiddleware or ResponseMiddleware interface and add it to a client. Here is an example:

use Psr\Http\Message\RequestInterface;
use Crawlzone\Client;
use Crawlzone\Middleware\RequestMiddleware;
...
$config = [
    'start_uri' => ['https://httpbin.org/ip']
];
$client = new Client($config);
$client->addRequestMiddleware(
    new class implements RequestMiddleware {
        public function processRequest(RequestInterface $request): RequestInterface
        {
            $request = $request->withHeader("User-Agent", "Mybot/1.1");
            return $request;
        }
    }
);
$client->run();

5. Storage

I first attempted to store the history and the queue in memory. It worked well for small websites, but it struggled to scale with volume. Once you have a need to crawl thousands of pages, this approach fails because you lose the progress when the process gets interrupted.

I then started writing my handler to store the history and the queue on disk, but I soon realized that you have to code a B-tree to retrieve the data efficiently.

I needed a storage engine to store and index data in memory for testing or on the disk with a simple API to fetch the data. This is where SQLite shines. It is the most used database engine in the world according to their website. However, coupling a specific database engine to your architecture is a huge mistake. Abstracting away your storage is the way to go, as it gives you the flexibility to swap the database engine later if you need.

Another problem I faced was ensuring that I didn't queue the same request twice. Initially, I thought just normalizing the absolute URI and storing it as a primary key would be enough. However, if you want to queue a POST request, for example, this approach doesn't work. After researching the subject and looking at other similar frameworks, in particular, Python Scrapy, I found that using the request fingerprint was the best and most flexible option.

The request fingerprint is a hash that uniquely identifies the resource that the request points to. For example, take the following three requests:

 1. GET http://example.com/query?foo=1&bar=2
 2. GET http://example.com/query?bar=2&foo=1
 3. POST http://example.com/query?foo=2&bar=1

Even though the first two requests have two different URIs, both point to the same resource and are equivalent. They should return the same response. The third request is entirely different since it has another request method.

Authorization headers are another excellent example. Many sites use cookies to store the session id, which adds a random part to the request.

Calculating the fingerprint is not difficult and is simply a hash of method, URI, and request body. By default, the request headers are not included in the fingerprint calculation, because you might end up queuing the same request multiple times. I left this option configurable in case I needed it later.

6. Niceness: The Importance of Whitehat Crawling

It wasn't my original requirement to make the crawler nice, since it can be convenient to be rude sometimes. I only realized the importance of ethical crawling in the testing phase when I flooded my testing environment with a pile of requests that effectively made it unavailable.

My first attempt to resolve the issue was to use a static delay between each request. It worked fine for the most part, except that the crawler became extremely slow, which was, in fact, ideal for production. The point is to avoid affecting the site performance in any way. However, with local testing, you want to get feedback as soon as possible.

Therefore ideally, the crawler should dynamically change the delay between the requests based on the current server load. This is where the auto-throttle extension comes into play. The algorithm automatically adjusts the delay based on average response time, response status, and concurrency.

Obeying the rules of the robots.txt file was another excellent feature to have and easy to implement using extensions, making the crawler even more successful.

7. Legal

When I started working on the crawler, I was a bit naive in thinking that I could use CrawlZone to crawl whatever I wanted. I thought that since you can access the data through the browser, then it is public and I can collect it in any way I wish. It turns out, scraping the data can be a potentially risky endeavor.
Take a look at the part of user agreement taken from www.linkedin.com:

8.2. Don’ts
...
m. Use bots or other automated methods to access the Services, add or download contacts, send or redirect messages;

Facebook has similar terms:

What you can share and do on Facebook:
...
You may not access or collect data from our Products using automated means (without our prior permission) or attempt to access data you do not have permission to access.

So before you start using the crawler in the wild, as a precaution, study and comply with their term of use to avoid any potential legal action. And I'm not providing you any advice on that, please consult your lawyer.

Conclusion

Working on Crawlzone has been a fun and challenging project. I learned many aspects of web crawling and data scraping. It also helped the QA teams, with whom I collaborate, by making their lives much more comfortable.

Feel free to check out the library crawlzone/crawlzone.
Let me know what you think about the article. I welcome your feedback. Thanks!