How to read and write files stored in AWS S3 using Pandas?
Introduction
Pandas is an open-source library that provides easy-to-use data structures and data analysis tools for Python. AWS S3 is an object store ideal for storing large files.
This tutorial will look at two ways to read from and write to files in AWS S3 using Pandas.
Table of contents
Prerequisites
Authentication
If the S3 bucket you are accessing is private, you must ensure that you have access to the bucket. You can authenticate yourself using the following methods:
- Create a credential file stored at
~/.aws/credentials
which would contain your AWS credentials. The credential file looks like this:
[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
- Create the following environment variables:
AWS_ACCESS_KEY_ID
The access key for your AWS account.
AWS_SECRET_ACCESS_KEY
The secret key for your AWS account.
AWS_SESSION_TOKEN
The session key for your AWS account. This is only needed when you are using temporary credentials.
Pandas
Pandas (starting with version 1.2.0) supports the ability to read and write files stored in S3 using the s3fs Python package. S3Fs is a Pythonic file interface to S3. It builds on top of botocore.
To get started, we first need to install s3fs
:
pip install s3fs
Reading a file
We can read a file stored in S3 using the following command:
import pandas as pd
df = pd.read_csv("s3://my-test-bucket/sample.csv")
Writing a file
We can store a file in S3 using the following command:
import pandas as pd
df.to_csv("s3://my-test-bucket/sample.csv")
AWS Wrangler
AWS Wranger is an AWS SDK for Pandas. This library provides easy integration with various AWS services like S3, Glue, and Redshift.
Install AWS Wrangler
We can install the library using the following command:
pip install awswrangler
Reading a file
We can read a file stored in S3 using the following commands:
import awswrangler as wr
df = wr.s3.read_csv("s3://my-test-bucket/sample.csv")
Writing a file
We can write a Pandas dataframe to a file in S3 using the following commands:
import awswrangler as wr
wr.s3.to_csv(df, "s3://my-test-bucket/sample.csv")