How to use Boto3 to download multiple files from S3 in parallel?
Introduction
AWS Boto3 is the Python SDK for AWS. Boto3 can be used to directly interact with AWS resources from Python scripts.
Boto3’s S3 API doesn’t have any method to download multiple files from an S3 bucket in parallel. You have the option of downloading one file at a time but that is time consuming.
In this tutorial, we will look at how we can download multiple files in parallel to speed up the process of downloading multiple files from S3.
Table of contents
- Introduction
- Python Multiprocessing & Multithreading
- Boto3 w/ Multithreading
- Downloading multiple files using Multiprocessing
- Downloading multiple files using Multithreading
- Performance
Python Multiprocessing & Multithreading
There are two ways to download multiple files in parallel:
- Multiprocessing
- Multithreading
Due to the global interpreter lock (GIL) in Python, Multiprocessing is the only real way to achieve true prallelism. Multiprocessing takes advantage of multiple CPUs & cores. Multiprocessing is ideal when tasks are CPU-bound.
Multhreading, on the hand, is useful when the tasks are IO-bound. Threading generally has a lower memory footprint and the ability to access shared memory make it a great option for I/O bound applications.
To understand the differences between the two in more detail, check out this topic on Stackoverflow
Boto3 w/ Multithreading
According to Boto3 docs, clients
are thread-safe. Resources and Sessions, however, are not thread-safe. Therefore, we will be using the low-level client
to ensure our code is thread-safe.
This article dives deeper into the differences between Clients & Resources.
Downloading multiple files using Multiprocessing
We will be leveraging the ProcessPoolExecutor
from the concurrent.futures
library to create a Process Pool to download multiple files from S3 in parallel.
import boto3.session
from concurrent import futures
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path
KEYS_TO_DOWNLOAD = [...] # all the files that you want to download
BUCKET_NAME = "my-test-bucket"
LOCAL_DOWNLOAD_PATH = "files/"
def download_object(file_name):
"""Downloads an object from S3 to local."""
s3_client = boto3.client("s3")
download_path = Path(LOCAL_DOWNLOAD_PATH) / file_name
print(f"Downloading {file_name} to {download_path}")
s3_client.download_file(
BUCKET_NAME,
file_name,
str(download_path)
)
return "Success"
def download_parallel_multiprocessing():
with ProcessPoolExecutor() as executor:
future_to_key = {executor.submit(download_object, key): key for key in KEYS_TO_DOWNLOAD}
for future in futures.as_completed(future_to_key):
key = future_to_key[future]
exception = future.exception()
if not exception:
yield key, future.result()
else:
yield key, exception
if __name__ == "__main__":
for key, result in download_parallel_multiprocessing():
print(f"{key}: {result}")
Output of the program:
Downloading sitemap.xml to files/sitemap.xml
Downloading feed.xml to files/feed.xml
Downloading robots.txt to files/robots.txt
Downloading index.html to files/index.html
index.html result: Success
robots.txt result: Success
sitemap.xml result: Success
feed.xml result: Success
Downloading multiple files using Multithreading
We will use the ThreadPoolExecutor
from the concurrent.futures
library to create a Thread Pool to download multiple files from S3 in parallel. Unlike the multiprocessing example, we will be sharing the S3 client between the threads since that is thread-safe.
import boto3.session
from concurrent import futures
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
KEYS_TO_DOWNLOAD = [...] # all the files that you want to download
BUCKET_NAME = "my-test-bucket"
LOCAL_DOWNLOAD_PATH = "files/"
def download_object(s3_client, file_name):
download_path = Path(DOWNLOAD_PATH) / file_name
print(f"Downloading {file_name} to {download_path}")
s3_client.download_file(
BUCKET_NAME,
file_name,
str(download_path)
)
return "Success"
def download_parallel_multithreading():
# Create a session and use it to make our client
session = boto3.session.Session()
s3_client = session.client("s3")
# Dispatch work tasks with our s3_client
with ThreadPoolExecutor(max_workers=8) as executor:
future_to_key = {executor.submit(download_object, s3_client, key): key for key in KEYS_TO_DOWNLOAD}
for future in futures.as_completed(future_to_key):
key = future_to_key[future]
exception = future.exception()
if not exception:
yield key, future.result()
else:
yield key, exception
if __name__ == "__main__":
for key, result in download_parallel_multithreading():
print(f"{key} result: {result}")
Output of running this program is:
Downloading feed.xml to files/feed.xml
Downloading index.html to files/index.html
Downloading robots.txt to files/robots.txt
Downloading sitemap.xml to files/sitemap.xml
sitemap.xml result: Success
index.html result: Success
robots.txt result: Success
feed.xml result: Success
Performance
The performance between Multiprocessing and Multithreading were similar in the tests that I ran. There is a significant improvement as compared to downloading one file at a time.