Learn how to use os.fork() to quickly crawl websites using python3.
So in my initial script, crawling the Steam Web API took a while. Even though I could get call the API 1000 times every 15 minutes, (which is much faster than doing it by hand), Valve has a default rate limit of 100,000 calls per day. At 1000 calls every 15 minutes or 4000 calls every hour, that means that I’m only able to call the API about ~96,000 times a day; while I could leave a server running 24/7, it’s kind of a waste to be only using 1 core at a time, when most modern computers have 4 or more cores.
Additionally, Steam has over 1 billion accounts. Imagine only crawling 100,000 accounts per day – do you really want to use a search engine that only updates every page once every 1000 days?
Once again, remember that I am using a PostgreSQL database to use as a ‘queue’ of what URLs I need to crawl – hence the mention of psycopg2 in the following examples.
Python, like some other languages, have something called fork() in the module ‘os’. Basically what fork() does is creates a clone of the process – that is you can have the ‘parent’ process call fork() to create identical ‘child’ processes. You can read more about fork() here.
Before we continue with actual code, it is always a good idea to see whether the modules we are using support forking.
A quick google shows that for psycopg2, “The Psycopg module and the
connection objects are thread-safe: many threads can access the same database either using separate sessions and creating a
connection per thread or using the same connection and creating separate
The difference between the above two approaches is that, using different connections, the commands will be executed in different sessions and will be served by different server processes. On the other hand, using many cursors on the same connection, all the commands will be executed in the same session (and in the same transaction if the connection is not in autocommit mode), but they will be serialized.”.
Serialisation is bad as that means processes will end up waiting on each other for their postgreSQL query to go through, so in this case, creating a new connection and cursor for each child process is a good idea.
Idea 1 – Fork bomb
Operating systems add a layer of abstraction so that programs don’t need to really worry about CPU time allocation. When a core of a CPU is allocated more than one process, it switches between the processes rapidly to give the illusion of multitasking.
My initial idea was to basically keep spawning child processes in my parent process until my server is overloaded – eventually the parent process will be starved of CPU resources and will spawn child processes at the same or slower rate than previous child processes complete.
Not only is this a bad idea (fork bombs are never nice – unresponsive systems are horrible), but thankfully Python3 forces users to handle all child processes exiting using wait() or waitpid(). If a SIGCHLD signal is not handled, the python script forcefully exits with “ChildProcessError: [Errno 10] No child processes”.
TIP OF THE DAY: FORK BOMBS BAD – ALWAYS HANDLE SIGCHLD USING WAITPID()
Idea 2 – Batches of Children
The next idea involved spawning multiple child processes using fork(), then waiting on all of them to complete, and then repeating the process. Stackoverflow was great for finding example code, of which my adapted code can be seen below:
LOW_RAND = 1 HIGH_RAND = 100 def crawl(pid): bad_offset = random.randint(LOW_RAND, HIGH_RAND) print("[PID: " + str(pid) + "] in crawl() with not so random number: " + str(bad_offset)) # Initialise seed to unique child PID # Otherwise inherits seed from parent meaning all random numbers are the same random.seed(pid) # Generate an offset to read from the start of our PostgreSQL query offset = random.randint(LOW_RAND, HIGH_RAND) print("[PID: " + str(pid) + "] in crawl() with random number: " + str(offset)) # Connect to PostgreSQL # Grab data from query at row 'offset' # Send request to Steam Web API Endpoints # Read/Write any necessary files # Disconnect from PostgreSQL j = 0 while True: print("[PID: " + str(os.getpid()) + "] PARENT: Spawning batch number " + str(j + 1) + " of children") j = j + 1 children =  for i in range(0, 10): pid = os.fork() if pid == 0: crawl(os.getpid()) os._exit(0) else: children.append(pid) for child in children: os.waitpid(child, 0)
While this is no doubt faster than not using fork() at all, do you see a problem? You’re spawning children, waiting for all of them to finish, then repeating the process – this results in CPU usage like this:
That must mean there’s room for optimisation.
Idea 3 – Undying Children
Ideally you would have a certain amount of child processes always active. You can do this two ways – a) upon detecting a child process exiting, spawn a new child process or b) after initially spawning child processes make the child processes repeat the same thing in a loop.
The easiest is probably to just make the child process loop. So to rework some of our previous code to instead infinitely loop in our child process:
LOW_RAND = 1 HIGH_RAND = 100 def crawl(pid): bad_offset = random.randint(LOW_RAND, HIGH_RAND) print("[PID: " + str(pid) + "] in crawl() with not so random number: " + str(bad_offset)) # Initialise seed to unique child PID # Otherwise inherits seed from parent meaning all random numbers are the same random.seed(pid) # Generate an offset to read from the start of our PostgreSQL query offset = random.randint(LOW_RAND, HIGH_RAND) print("[PID: " + str(pid) + "] in crawl() with random number: " + str(offset)) # Connect to PostgreSQL # Grab data from query at row 'offset' # Send request to Steam Web API Endpoints # Read/Write any necessary files # Disconnect from PostgreSQL children =  for i in range(0, 10): pid = os.fork() if pid == 0: while True: crawl(os.getpid()) os._exit(0) else: children.append(pid) for child in children: os.waitpid(child, 0)
And our CPU usage:
Optimising use of CPU time (PostgreSQL)
After offloading this python script onto my server, I began to wonder whether I could crawl even faster. Right now the rate of crawling has increased to about 20,000 requests per hour (480,000 requests per day) using the following server configuration:
- Python script running with 20 child processes
- CPU: i5 4570 (4c/4t)
- RAM: 16GB DDR3 1333MT/s (4x4GB)
- SSD: 120GB WD GREEN
However, surely some optimisations could be made; after all, Larry Page and Sergey Brin (founders of Google) in this 1998 paper mentioned how each crawler they used had 300 concurrent connections, and how on average they indexed 48.5 pages per second. That’s about a magnitude of 9 times faster than my current script, on hardware that is ancient – the i5 4570 came out in 2013, 15 years after the paper was published. Additionally, they were crawling real webpages, not just calling an API like I am. So yeah, definitely room for optimisation.
Upon inspection of what’s been hogging my CPU:
Additionally let’s check our query performance using EXPLAIN ANALYZE:
Apparently it takes > 3.6 seconds to execute a ‘SELECT * FROM table ORDER BY field’ query on my database. That would probably play a major role in why my crawling is so slow.
See the line stating ‘Disk 93984kB’? That’s a low hanging fruit for optimisation – it means PostgreSQL is writing searches to file as it isn’t being allocated enough RAM. RAM is a lot faster than disk, so if you have sufficient RAM, it is a good idea to check your postgresql.conf file to see whether you can allocate more RAM to your queries. If you do decide to change anything in your postgresql.conf file, you may need to restart the service to take into effect any changes – do this by running the command ‘sudo service postgresql restart’.
However, that’s not the main problem here – not only is allocating > 100MB of RAM per worker not sensible (100MB RAM * 2 workers * 20 concurrent queries = 4GB RAM on queries), there’s an easier optimisation. Note how it says ‘Seq Scan’? That means that the column in the table that we are ordering by is not indexed. Make sure to index your columns if you plan on searching through it regularly using the expression: ‘CREATE INDEX idx_column ON table(column)’
Since I don’t really need to index this column, I just restricted the number of results returned which shortens the time spent sequentially scanning. Voila, uses a fraction of the memory (< 1/1000th) and a fraction of the time (< 1/4th). Unfortunately, I had already performed this optimisation to achieve my previously mentioned crawling speeds, so I have to look at other options to increase my crawling speeds.
But undoubtedly postgreSQL is our CPU hog as top shows, and making a query to our database is probably the most costly component. Right now I am querying every time I enter my crawl() function; that is, the ratio of crawling a URL to making a postgreSQL query is 1:1.
Is it important for me to crawl every URL in a strict order? Probably not – as long as I am crawling ‘least recently crawled’ URLS more than I am crawling ‘recently crawled’ URLs. This means that I can choose to query once and choose multiple somewhat ‘long time since last crawled’ URL to crawl using just one query.
Some more example code:
def crawl(cur, con, pid, row): # Send request to Steam Web API Endpoints # Read/Write any necessary files # Update database timestamps and any unseen URLs children =  # Create 20 processes for i in range(0, 20): pid = os.fork() if pid == 0: # create connection to database and reinitialise seed conn = psycopg2.connect(host="localhost", database=DB_NAME, user=DB_USER, password=DB_PASS) cur = conn.cursor() random.seed(os.getpid()) while True: # Query once cur.execute("SELECT steamid64, last_crawled FROM to_crawl ORDER BY last_crawled LIMIT 1000;"); crawl_rows =  # Add 10 random urls to crawl from our query for j in range(0, 10): offset = random.randint(1, 100) for i in range(0, offset): row = cur.fetchone() crawl_rows.append(row) for row in crawl_rows: crawl(cur, conn, os.getpid(), row) os._exit(0) else: children.append(pid) for child in children: os.waitpid(child, 0)
Let’s see how long it takes to crawl 2000 files using a ratio of 1 database query to 10 urls crawled:
So took about 1 minute 40 seconds for 2000 calls to the Steam Web API or about 1200 calls per minute. So 72,000 calls per hour, and just over 1.8 million crawls per day. Let’s up the ratio to 1 database query per 25 urls crawled.
Damn – 4 minutes and 50 seconds for exactly the same number of urls. Note that I had increased the query for the first 2000 results (have to minimise collision of row selection by child processes), which resulted in operations outside of memory and writing to disk 🙁 Let’s try fine-tuning our number of child processes.
With 40 children processes, at a ratio of 1 database query per 10 urls crawled, we get:
1 minute and 15 sections for 2000 queries. Not bad. Just over 2.3 million queries per day.
So from a modest start of 96,000 crawls per day, but introducing multi-threading and optimising our number of children processes and number of queries to our database, we have now ended this article at 2.3 million queries per day. That is an increase of over 2200%! I haven’t implemented it yet, but there’s definitely a better way to reduce number of DB queries; furthermore, I need a minimum viable product for my steam search engine before I email valve for increased API limits. Additionally, this could be optimised by batching PostgreSQL statements, and introducing threading. If this article helped you, or you can think of further ways to optimise crawling speed, please let me know in the comments!