Introduction

Pandas is an open-source library that provides easy-to-use data structures and data analysis tools for Python. AWS S3 is an object store ideal for storing large files.

This tutorial will look at two ways to read from and write to files in AWS S3 using Pandas.

Table of contents

Prerequisites

Authentication

If the S3 bucket you are accessing is private, you must ensure that you have access to the bucket. You can authenticate yourself using the following methods:

  • Create a credential file stored at ~/.aws/credentials which would contain your AWS credentials. The credential file looks like this:
[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

AWS_ACCESS_KEY_ID
The access key for your AWS account.

AWS_SECRET_ACCESS_KEY
The secret key for your AWS account.

AWS_SESSION_TOKEN
The session key for your AWS account. This is only needed when you are using temporary credentials.

Pandas

Pandas (starting with version 1.2.0) supports the ability to read and write files stored in S3 using the s3fs Python package. S3Fs is a Pythonic file interface to S3. It builds on top of botocore.

To get started, we first need to install s3fs:


pip install s3fs

Reading a file

We can read a file stored in S3 using the following command:

import pandas as pd

df = pd.read_csv("s3://my-test-bucket/sample.csv")

Writing a file

We can store a file in S3 using the following command:

import pandas as pd

df.to_csv("s3://my-test-bucket/sample.csv")

AWS Wrangler

AWS Wranger is an AWS SDK for Pandas. This library provides easy integration with various AWS services like S3, Glue, and Redshift.

Install AWS Wrangler

We can install the library using the following command:

pip install awswrangler

Reading a file

We can read a file stored in S3 using the following commands:

import awswrangler as wr

df = wr.s3.read_csv("s3://my-test-bucket/sample.csv")

Writing a file

We can write a Pandas dataframe to a file in S3 using the following commands:

import awswrangler as wr

wr.s3.to_csv(df, "s3://my-test-bucket/sample.csv")