Sooner or later, every data science project faces an inevitable challenge: speed. Working with larger data sets leads to slower processing thereof, so you'll eventually have to think about optimizing your algorithm's run time. As most of you already know, parallelization is a necessary step of this optimization. Python offers two built-in libraries for parallelization: multiprocessing and threading. In this article, we'll explore how data scientists can go about choosing between the two and which factors should be kept in mind while doing so.

## Parallel Computing and Data Science

As you all know, data science is the science of dealing with large amounts of data and extracting useful insights from them. More often than not, the operations we perform on the data are easily parallelizable, meaning that different processing agents can run the operation on the data one piece at a time, then combine the results at the end to get the complete result.

To better visualize parallelizability, let's consider a real world analogy. Suppose you need to clean three rooms in your home. You can either do it all by yourself, cleaning the rooms one after the other, or you can ask your two siblings to help you out, with each of you cleaning a single room. In the latter approach, each of you are working parallely on a part of the whole task, thus reducing the total time required to complete it. This is parallelizability in action.

Parallel processing can be achieved in Python in two different ways: multiprocessing and threading.

## Multiprocessing and Threading: Theory

Fundamentally, multiprocessing and threading are two ways to achieve parallel computing, using processes and threads, respectively, as the processing agents. To understand how these work, we have to clarify what processes and threads are.

### Process

A process is an instance of a computer program being executed. Each process has its own memory space it uses to store the instructions being run, as well as any data it needs to store and access to execute.

Threads are components of a process, which can run parallely. There can be multiple threads in a process, and they share the same memory space, i.e. the memory space of the parent process. This would mean the code to be executed as well as all the variables declared in the program would be shared by all threads.

Process and Threads, By I, Cburnett, CC BY-SA 3.0, Link

### Technical details

• All the threads of a process live in the same memory space, whereas processes have their separate memory space.

• Threads are more lightweight and have lower overhead compared to processes. Spawning processes is a bit slower than spawning threads.

• Sharing objects between threads is easier, as they share the same memory space. To achieve the same between process, we have to use some kind of IPC (inter-process communication) model, typically provided by the OS.

### Pitfalls of Parallel Computing

Introducing parallelism to a program is not always a positive-sum game; there are some pitfalls to be aware of. The most important ones are as follows.

• Starvation: Starvation occurs when a thread is denied access to a particular resource for longer periods of time, and as a result, the overall program slows down. This can happen as an unintended side effect of a poorly designed thread-scheduling algorithm.
• Deadlock: Overusing mutex locks also has a downside - it can introduce deadlocks in the program. A deadlock is a state when a thread is waiting for another thread to release a lock, but that other thread needs a resource to finish that the first thread is holding onto. This way, both of the threads come to a standstill and the program halts. Deadlock can be thought of as an extreme case of starvation. To avoid this, we have to be careful not to introduce too many locks that are interdependent.
• Livelock : Livelock is when threads keep running in a loop but donâ€™t make any progress. This also arises out of poor design and improper use of mutex locks.

## Multiprocessing and Threading in Python

### The Global Interpreter Lock

When it comes to Python, there are some oddities to keep in mind. We know that threads share the same memory space, so special precautions must be taken so that two threads donâ€™t write to the same memory location. The CPython interpreter handles this using a mechanism called `GIL`, or the Global Interpreter Lock.

From the Python wiki:

In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe.

Check the slides here for a more detailed look at the Python `GIL`.

The `GIL` gets its job done, but at a cost. It effectively serializes the instructions at the interpreter level. How this works is as follows: for any thread to perform any function, it must acquire a global lock. Only a single thread can acquire that lock at a time, which means the interpreter ultimately runs the instructions serially. This design makes memory management thread-safe, but as a consequence, it canâ€™t utilize multiple CPU cores at all. In single-core CPUs, which is what the designers had in mind while developing CPython, itâ€™s not much of a problem. But this global lock ends up being a bottleneck if youâ€™re using multi-core CPUs.

This bottleneck, however, becomes irrelevant if your program has a more severe bottleneck elsewhere, for example in network, IO, or user interaction. In those cases, threading is an entirely effective method of parallelization. But for programs that are CPU bound, threading ends up making the program slower. Let's explore this with some example use cases.

### Use Cases for Threading

GUI programs use threading all the time to make applications responsive. For example, in a text editing program, one thread can take care of recording the user inputs, another can be responsible for displaying the text, a third can do spell-checking, and so on. Here, the program has to wait for user interaction, which is the biggest bottleneck. Using multiprocessing wonâ€™t make the program any faster.

Another use case for threading is programs that are IO bound or network bound, such as web-scrapers. In this case, multiple threads can take care of scraping multiple webpages in parallel. The threads have to download the webpages from the Internet, and that will be the biggest bottleneck, so threading is a perfect solution here. Web servers, being network bound, work similarly; with them, multiprocessing doesnâ€™t have any edge over threading. Another relevant example is Tensorflow, which uses a thread pool to transform data in parallel.

### Use Cases for Multiprocessing

Multiprocessing outshines threading in cases where the program is CPU intensive and doesnâ€™t have to do any IO or user interaction. For example, any program that just crunches numbers will see a massive speedup from multiprocessing; in fact, threading will probably slow it down. An interesting real world example is Pytorch Dataloader, which uses multiple subprocesses to load the data into GPU.

### Parallelization in Python, in Action

Python offers two libraries - `multiprocessing` and `threading`- for the eponymous parallelization methods. Despite the fundamental difference between them, the two libraries offer a very similar API (as of Python 3.7). Letâ€™s see them in action.

``````import threading
import random
from functools import reduce

def func(number):
random_list = random.sample(range(1000000), number)
return reduce(lambda x, y: x*y, random_list)

number = 50000

``````

You can see that I've created a function `func` that creates a list of random numbers and then multiplies all the elements of it sequentially. This can be a fairly heavy process if the number of items is large enough, say 50k or 100k.

Then, Iâ€™ve created two threads that will execute the same function. The thread objects have a `start` method that starts the thread asynchronously. If we want to wait for them to terminate and return, we have to call the `join` method, and thatâ€™s what we have done above.

As you can see, the API for spinning up a new thread to a task in the background is pretty straightforward. Whatâ€™s great is that the API for multiprocessing is almost the exact same as well; letâ€™s check it out.

``````import multiprocessing
import random
from functools import reduce

def func(number):
random_list = random.sample(range(1000000), number)
return reduce(lambda x, y: x*y, random_list)

number = 50000
process1 = multiprocessing.Process(target=func, args=(number,))
process2 = multiprocessing.Process(target=func, args=(number,))

process1.start()
process2.start()

process1.join()
process2.join()
``````

There it isâ€”just swap `threading.Thread` with `multiprocessing.Process` and you have the exact same program implemented using multiprocessing.

Thereâ€™s obviously a lot more you can do with this, but thatâ€™s not within the scope of this article, so we wonâ€™t go into it here. Check out the docs here and here if youâ€™re interested in learning more.

### Benchmarks

Now that we have an idea of how the code implementing parallelization looks like, letâ€™s get back to the performance issues. As weâ€™ve noted before, threading is not suitable for CPU bound tasks; in those cases it ends up being a bottleneck. We can validate this using some simple benchmarks.

Firstly, letâ€™s see how threading compares against multiprocessing for the code sample I showed you above. Keep in mind that this task does not involve any kind of IO, so itâ€™s a pure CPU bound task.

And letâ€™s see a similar benchmark for an IO bound task. For example, the following function â€”

``````import requests

def func(number):
url = 'http://example.com/'
for i in range(number):
response = requests.get(url)
with open('example.com.txt', 'w') as output:
output.write(response.text)
``````

The function is simply fetching a webpage and saving that to a local file, multiple times in a loop. Useless but straightforward and thus a good fit for demonstration. Letâ€™s look at the benchmark.

Now there are a few things to note from these two charts:

• In both cases, a single process took more execution time than a single thread. Evidently, processes have more overhead than threads.

• For the CPU bound task, multiple processes perform way better than multiple threads. However, this difference becomes slightly less prominent when weâ€™re using 8x parallelization. As the processor in my laptop is quad-core, up to four processes can use the multiple cores effectively. So when Iâ€™m using more processes, it doesnâ€™t scale that well. But still, it outperforms threading by a lot because threading canâ€™t utilize the multiple cores at all.

• For the IO-bound task, the bottleneck is not CPU. So the usual limitations due to GIL donâ€™t apply here, and multiprocessing doesnâ€™t have an advantage. Not only that, the light overhead of threads actually makes them faster than multiprocessing, and threading ends up outperforming multiprocessing consistently.

# Differences, Merits and Drawbacks

• Threads run in the same memory space; processes have separate memory.

• Following from the previous point: sharing objects between threads is easier, but the flip side of the coin is that you have to take extra measure for object synchronization to make sure that two threads donâ€™t write to the same object at the same time and that a race condition does not occur.

• Because of the added programming overhead of object synchronization, multi-threaded programming is more bug-prone. On the other hand, multi-processes programming is easy to get right.

• Threads have a lower overhead compared to processes; spawning processes take more time than threads.

• Due to limitations put in place by the GIL in Python, threads canâ€™t achieve true parallelism utilizing multiple CPU cores. Multiprocessing does not have any such restrictions.

• Process scheduling is handled by the OS, whereas thread scheduling is done by the Python interpreter.

• Child processes are interruptible and killable, whereas child threads are not. You have to wait for the threads to terminate or `join`.

From all this discussion, we can conclude the following â€”

• Threading should be used for programs involving IO or user interaction.

• Multiprocessing should be used for CPU bound, computation-intensive programs.

# From the Perspective of a Data Scientist

A typical data processing pipeline can be divided into the following steps:

1. Reading raw data and storing into main memory or GPU
2. Doing computation, using either CPU or GPU
3. Storing the mined information in a database or disk.

Letâ€™s explore how we could introduce parallelism in these tasks so that they can be sped up.

Step 1 involves reading data from disk, so clearly disk IO is going to be the bottleneck for this step. As weâ€™ve discussed, threads are the best option for parallelizing this kind of operation. Similarly, step 3 is also an ideal candidate for the introduction of threading.

However, step 2 consists of computations that involve the CPU or a GPU. If itâ€™s a CPU based task, using threading will be of no use; instead, we have to go for multiprocessing. Only then weâ€™ll be able to exploit the multiple cores of the CPU and achieve parallelism. If itâ€™s a GPU based task, since GPU already implements a massively parallelized architecture at the hardware level, using the correct interface (libraries and drivers) to interact with the GPU should take care of the rest.

Now you may be thinking, â€śMy data pipeline looks a bit different to this; I have some tasks that donâ€™t really fit into this general framework.â€ť Still, you should be able to observe the methodology used here to decide between threading and multiprocessing. The factors you should consider are:

• Whether your task has any form of IO
• Whether IO is the bottleneck of your program
• Whether your task depends upon a large amount of computation by the CPU

With these factors in mind, together with the takeaways above, you should be able to make the decision. Also, keep in mind that you donâ€™t have to use a single form of parallelism throughout your program. You should use one or the other for different parts of your program, whichever is suitable for that particular part.

Now weâ€™ll look at two example scenarios a data scientist might face and how you can use parallel computing to speed them up.

Letâ€™s say you want to analyze all the emails in the inbox of your own home-grown startup and understand the trends: who are the most frequent senders, what are the most common keywords appearing in the emails, which day of the week or which hour of the day do you receive most emails, and so on. The first step of this project would be, of course, downloading the emails to your computer.

At first, letâ€™s do it sequentially without using any parallelization. The code to use is below and it should be pretty self-explanatory. There is a function `download_emails` which takes a list of email ids as input and downloads them sequentially. This calls this function with a list of ids of 100 email at once.

``````import imaplib
import time

IMAP_SERVER = 'imap.gmail.com'

client = imaplib.IMAP4_SSL(IMAP_SERVER)
client.select()
for i in ids:
_, data = client.fetch(i, '(RFC822)')
with open(f'emails/{i.decode()}.eml', 'wb') as f:
f.write(data[0][1])
client.close()

start = time.time()

client = imaplib.IMAP4_SSL(IMAP_SERVER)
client.select()
_, ids = client.search(None, 'ALL')
ids = ids[0].split()
ids = ids[:100]
client.close()

print('Time:', time.time() - start)
``````

Time taken :: `35.65300488471985` seconds.

Now letâ€™s introduce some parallelizability into this task to speed things up. Before we dive into writing the code, we have to decide between threading and multiprocessing. As youâ€™ve learned so far, threads are the best option when it comes to tasks that have some IO as the bottleneck. The task at hand obviously belongs to this category, as it is accessing an IMAP server over the internet. So weâ€™ll be going with `threading.`

Much of the code weâ€™re going to use is going to be the same as the one we used in the sequential case. The only difference is that we will split the list of 100 email ids into 10 smaller chunks, each chunk containing 10 ids, then create 10 threads and call the `download_emails` function with a different chunk from each of them. Iâ€™m using the `concurrent.futures.ThreadPoolExecutor` class from the Python standard library for threading.

``````import imaplib
import time
from concurrent.futures import ThreadPoolExecutor

IMAP_SERVER = 'imap.gmail.com'

client = imaplib.IMAP4_SSL(IMAP_SERVER)
client.select()
for i in ids:
_, data = client.fetch(i, '(RFC822)')
with open(f'emails/{i.decode()}.eml', 'wb') as f:
f.write(data[0][1])
client.close()

start = time.time()

client = imaplib.IMAP4_SSL(IMAP_SERVER)
client.select()
_, ids = client.search(None, 'ALL')
ids = ids[0].split()
ids = ids[:100]
client.close()

number_of_chunks = 10
chunk_size = 10
futures = []
for i in range(number_of_chunks):
chunk = ids[i*chunk_size:(i+1)*chunk_size]

for future in concurrent.futures.as_completed(futures):
pass
print('Time:', time.time() - start)
``````

Time taken :: `9.841094255447388` seconds.

As you can see, threading, sped it up considerably.

## Scenario: Classification Using Scikit-Learn

Letâ€™s say you have a classification problem, and you want to use a random forest classifier for this. As itâ€™s a standard and well-known machine learning algorithm, letâ€™s not reinvent the wheel and just use `sklearn.ensemble.RandomForestClassifier`.

The below code serves demonstration purposes. I have created a classification dataset using the helper function `sklearn.datasets.make_classification`, then trained a `RandomForestClassifier` on that. Also, Iâ€™m timing the part of the code that does the core work of fitting the model.

``````from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import time

X, y = datasets.make_classification(n_samples=10000, n_features=50, n_informative=20, n_classes=10)

start = time.time()
model = RandomForestClassifier(n_estimators=500)
model.fit(X, y)
print('Time:', time.time()-start)

``````

Time taken :: `34.17733192443848` seconds.

Now weâ€™ll look into how we can reduce the running time of this algorithm. We know that this algorithm can be parallelized to some extent, but what kind of parallelization would be suitable? It does not have any IO bottleneck; on the contrary, itâ€™s a very CPU intensive task. So multiprocessing would be the logical choice.

Fortunately, `sklearn` has already implemented multiprocessing into this algorithm and we wonâ€™t have to write it from scratch. As you can see in the code below, we just have to provide a parameter `n_jobs`â€”the number of processes it should useâ€”to enable multiprocessing.

``````from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import time

X, y = datasets.make_classification(n_samples=10000, n_features=50, n_informative=20, n_classes=10)

start = time.time()
model = RandomForestClassifier(n_estimators=500, n_jobs=4)
model.fit(X, y)
print('Time:', time.time()-start)
``````

Time taken :: `14.576200723648071` seconds.

As expected, multiprocessing made it quite a bit faster.

# Conclusion

Most if not all data science projects will see a massive increase in speed with parallel computing. In fact, many of the popular data science libraries already have parallelism built into them, you just have to enable it. So before trying to implement it on your own, look through the documentation of the library youâ€™re using and check if it supports parallelism (by the way, I definitively recommend you to check out dask). In case it doesnâ€™t, this article should assist you in implementing it on your own.

# About the Author

Sumit is a computer enthusiast who started programming at an early age; he's currently finishing his master's degree in computer science at IIT Delhi. Whenever he isn't programming, you can probably find him either reading philosophy, playing the guitar, taking photos, or blogging. You can connect with Sumit on Twitter, LinkedIn, Github, and his website.