How to use Boto3 to download multiple files from S3 in parallel?

Introduction

AWS Boto3 is the Python SDK for AWS. Boto3 can be used to directly interact with AWS resources from Python scripts.

Boto3’s S3 API doesn’t have any method to download multiple files from an S3 bucket in parallel. You have the option of downloading one file at a time but that is time consuming.

In this tutorial, we will look at how we can download multiple files in parallel to speed up the process of downloading multiple files from S3.

Table of contents

Introduction
Python Multiprocessing & Multithreading
Boto3 w/ Multithreading
Downloading multiple files using Multiprocessing
Downloading multiple files using Multithreading
Performance

Python Multiprocessing & Multithreading

There are two ways to download multiple files in parallel:

Multiprocessing
Multithreading

Due to the global interpreter lock (GIL) in Python, Multiprocessing is the only real way to achieve true prallelism. Multiprocessing takes advantage of multiple CPUs & cores. Multiprocessing is ideal when tasks are CPU-bound.

Multhreading, on the hand, is useful when the tasks are IO-bound. Threading generally has a lower memory footprint and the ability to access shared memory make it a great option for I/O bound applications.

To understand the differences between the two in more detail, check out this topic on Stackoverflow

Boto3 w/ Multithreading

According to Boto3 docs, clients are thread-safe. Resources and Sessions, however, are not thread-safe. Therefore, we will be using the low-level client to ensure our code is thread-safe.

This article dives deeper into the differences between Clients & Resources.

Downloading multiple files using Multiprocessing

We will be leveraging the ProcessPoolExecutor from the concurrent.futures library to create a Process Pool to download multiple files from S3 in parallel.

import boto3.session
from concurrent import futures
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path

KEYS_TO_DOWNLOAD = [...] # all the files that you want to download
BUCKET_NAME = "my-test-bucket"
LOCAL_DOWNLOAD_PATH = "files/"

def download_object(file_name):
    """Downloads an object from S3 to local."""

    s3_client = boto3.client("s3")
    download_path = Path(LOCAL_DOWNLOAD_PATH) / file_name
    print(f"Downloading {file_name} to {download_path}")
    s3_client.download_file(
        BUCKET_NAME,
        file_name,
        str(download_path)
    )
    return "Success"

def download_parallel_multiprocessing():
    with ProcessPoolExecutor() as executor:
        future_to_key = {executor.submit(download_object, key): key for key in KEYS_TO_DOWNLOAD}

        for future in futures.as_completed(future_to_key):
            key = future_to_key[future]
            exception = future.exception()

            if not exception:
                yield key, future.result()
            else:
                yield key, exception

if __name__ == "__main__":
    for key, result in download_parallel_multiprocessing():
        print(f"{key}: {result}")

Output of the program:

Downloading sitemap.xml to files/sitemap.xml
Downloading feed.xml to files/feed.xml
Downloading robots.txt to files/robots.txt
Downloading index.html to files/index.html
index.html result: Success
robots.txt result: Success
sitemap.xml result: Success
feed.xml result: Success

Downloading multiple files using Multithreading

We will use the ThreadPoolExecutor from the concurrent.futures library to create a Thread Pool to download multiple files from S3 in parallel. Unlike the multiprocessing example, we will be sharing the S3 client between the threads since that is thread-safe.

import boto3.session
from concurrent import futures
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path

KEYS_TO_DOWNLOAD = [...] # all the files that you want to download
BUCKET_NAME = "my-test-bucket"
LOCAL_DOWNLOAD_PATH = "files/"

def download_object(s3_client, file_name):
    download_path = Path(DOWNLOAD_PATH) / file_name
    print(f"Downloading {file_name} to {download_path}")
    s3_client.download_file(
        BUCKET_NAME,
        file_name,
        str(download_path)
    )
    return "Success"

def download_parallel_multithreading():
    # Create a session and use it to make our client
    session = boto3.session.Session()
    s3_client = session.client("s3")

    # Dispatch work tasks with our s3_client
    with ThreadPoolExecutor(max_workers=8) as executor:
        future_to_key = {executor.submit(download_object, s3_client, key): key for key in KEYS_TO_DOWNLOAD}

        for future in futures.as_completed(future_to_key):
            key = future_to_key[future]
            exception = future.exception()

            if not exception:
                yield key, future.result()
            else:
                yield key, exception


if __name__ == "__main__":
    for key, result in download_parallel_multithreading():
        print(f"{key} result: {result}")

Output of running this program is:

Downloading feed.xml to files/feed.xml
Downloading index.html to files/index.html
Downloading robots.txt to files/robots.txt
Downloading sitemap.xml to files/sitemap.xml
sitemap.xml result: Success
index.html result: Success
robots.txt result: Success
feed.xml result: Success

Performance

The performance between Multiprocessing and Multithreading were similar in the tests that I ran. There is a significant improvement as compared to downloading one file at a time.