Rajendra Gupta
RDS Snapshot

Export Amazon Aurora MySQL or Amazon RDS snapshots to AWS S3 buckets

January 19, 2021 by

In this article, we will explore the following topics:

  • Amazon Aurora or Amazon RDS database snapshot export to S3 bucket
  • Use AWS Glue to crawl the Snapshot and generate a data catalog
  • Use Amazon Athena to query the data catalog
  • Export Amazon Athena query results into S3 bucket

Introduction and Requirements

Amazon RDS is a relational managed database service for Microsoft SQL Server, MySQL, PostgreSQL, Oracle and Aurora (MySQL & PostgreSQL compatibility). AWS automatically takes regular database snapshots and stores them into the S3 bucket. It uses these backups for point-in-time recovery. Its maximum retention period is 35 days for these snapshots.

AWS also allows users to take manual snapshots as well. Recently, in January 2020, AWS added the flexibility to export these snapshots into the S3 bucket in an Apache Parquet format. It is an open-source format that consumes very less storage and efficient in interaction with serverless solutions like AWS Athena. The export to S3 bucket functionality is available for MySQL, PostgreSQL, MariaDB, Aurora (MySQL & PostgreSQL compatibility).

In the article, Query Amazon Athena external tables using SQL Server, we use Amazon Athena for querying a CSV file stored in the S3 bucket without importing it into a database. You should go through it to get a basic understanding of the data flow.

For this article, we have the following requirements.

Amazon Aurora cluster

An Aurora cluster for MySQL compatibility. You can refer to the article, Explore Aurora Global Database for MySQL compatibility for its implementation.

  • DB identifier: database-1
  • Region: us-east-1
  • size:db.t3.medium
  • status: Available

Amazon Aurora DB cluster

You require a MySQL database and sample data in the Amazon Aurora instance. You can refer to the article, Rewind or Forward the Amazon Aurora Cluster using the backtrack feature to prepare sample data in the [SQLShackDemo] database.

  • Sample database: [SQLShackDemo]
  • Number of records: 2

Sample data

Amazon Linux EC2 instance

An Amazon Linux EC2 instance to connect with the Aurora MySQL.

  • Public IPv4 address: 3.225.20.119
  • Instance state: Running
  • Instance type: t2.medium

Amazon Linux EC2 instance

AWS S3 bucket

We require an S3 bucket in the same region where our RDS or Amazon Aurora instance exists. You can go to S3 services and create a bucket.

  • S3 bucket: sqlshackdemo3
  • Region: us-east-1

AWS S3 bucket

Open the S3 bucket and create a folder aurora/ to hold the exported objects. We require this S3 bucket and folder in the later section of the article.

Create a folder for the exported objects

AWS symmetric key

Create an AWS symmetric key using the AWS key management service. Click on Create a key.

AWS symmetric key

Step 1: Configure key

Select the Symmetric key type to use a single encryption key for both encrypt and decrypt purposes.

Configure key

Step 2: Add Labels

Specify an alias name and its description for the KMS key.

Add labels

Step 3: Define key administrative permissions

In this step, we select the AWS IAM user that is the administrator for the AWS key.

Define key administrative permissions

Step 4- Define key usage permissions

We can skip step 4 here because we do not want to give any user permissions to access the AWS key at the moment.

Step5- Review and edit key policy

It gives your JSON statement for the AWS key. Click on Finish to create the key with your configuration.

Define key usage permissions

It creates the customer-managed AWS key, as shown below.

customer-managed AWS key

Click on the key name and note down the key ARN. We require this ARN while exporting the Snapshot to the S3 bucket.

exporting the Snapshot

At a high-level, we implement the following architecture in this article:

  • Aurora MySQL instance manual snapshot
  • Export the Snapshot into AWS S3 bucket
  • Use a crawler process to prepare a data catalog from the Snapshot and store it into S3 in parquet format
  • Use Amazon Athena to query the data catalog
  • Stores the query results into the S3 bucket

RDS Snapshot

Let’s starts the deployment as per defined high-level architecture.

Take a Manual snapshot for the Amazon Aurora database

We can use either manual or automated Snapshot for the export feature. In my lab environment, I do not have any automated snapshot, therefore initiate a manual snapshot with the following steps.

Click on the Aurora instance in the Amazon RDS console, go to actions and choose to Take Snapshot

Take a Manual snapshot

In the snapshot window, specify a familiar name for your manual database snapshot and click on Take Snapshot as shown below.

manual database snapshot

It takes some time to create the manual Snapshot depending upon your database size. It shows completed in the following screenshot.

Snapshot progress

Export to Amazon S3 for Aurora snapshot

In the RDS console, select the Snapshot that you want to export into the AWS S3 bucket. Now, go to the Actions menu and choose Export to Amazon S3.

Export to Amazon S3.

In the RDS snapshot export page, select the following parameters.

  • Export identifier: It is the name for the export snapshot
  • Export data format: It displays the export format as Parquet. You cannot modify this data format
  • Amount of data to be exported: You have the flexibility to choose the export data from the Snapshot
    • All: It exports all data in the database to the S3 bucket
    • Partial: In the partial, we can define the identifier such as schema, table names that we want to export

AWS charges you based on the per GB of the snapshot size. You can review AWS documentation for more details on the pricing part.

In my case, the [sqlshackdemo] database is relatively small(<1GB); therefore, I choose the option All.

Export Identifier

In the S3 destination, select the bucket name from the drop-down and write the folder name in the s3 prefix section as shown below.

S3 destination

Next, create a new IAM role or choose an existing IAM role for exporting objects into an S3 bucket. Here, I specify the IAM role as [aurorasnapshot3].

create a new IAM role

In the encryption section, enter the AWS key ARN we created earlier.

AWS key ARN

Click on Export to Amazon S3, as shown below. Its status is Starting -> In Progress -> Complete as shown below.

Export to Amazon S3

View progress

Completed snapshot to AWS S3 bucket

Once the export to the S3 bucket is completed, you can navigate to the AWS S3 bucket and view the folder contents.

It creates a subfolder as per the defined export identifier.

creates a subfolder into S3 bucket

In the folder, we have JSON files that have the export configurations and folder for the database.

JSON files

In our case, [SQLShackDemo] is the database name in the Amazon Aurora cluster. Therefore, it has a folder [SQLShackDemo/] in the S3 bucket.

Inside the [SQLShackDemo/] folder, we get the exported result in the parquet format.

exported result in the parquet format

Configure a crawler for Amazon Aurora using the AWS Glue

In the AWS web console, launch AWS Glue to add tables using the crawler.

Configure a crawler using the AWS Glue

Click on Add tables using a crawler. It has the following steps in the crawler wizard:

  • Crawler info: Enter a crawler name. In my case, I specify [auroradbexport]

     Add tables using a crawler for Amazon Aurora

  • Crawler source type: The crawler source type is data stores. If you have multiple folders ( multiple databases), you can configure crawl for all folders or new folders only

    Crawler source type

  • Data Store: In the data store, choose S3 and specify the S3 bucket and folder name in the Include path column

    data store

    Click No to add another data store

    add another data store.

  • IAM Role: Create an IAM role for the crawler to run and access the AWS data stores. Here, I specify the IAM role name as [auroraglue]. AWS creates the IAM role and allows read access to the S3 bucket

    IAM Role

  • Schedule: Select the frequency as Run on Demand

    frequency as Run on Demand

  • Output: In the output, click on the Add database and specify a database name for the data catalog

    specify a database name

Review all steps and click on Finish to configure the AWS crawler.

configure the AWS crawler

It creates the crawler. Its status should be Ready for you to run and process data.

creates the crawler

Before we execute the crawler, go back to your AWS key and add the IAM Role [auroraglue] for encrypting and decrypting data.

add the IAM Role

Now, select the crawler and click on Run Crawler.

Run Crawler

It takes less than in comparison with the time taken in an export snapshot to the S3 process. export snapshot to S3 process

Once the status changes to Ready, you can view the logs or use Amazon Athena to query data from the data catalog. status changes to Ready

Click on Logs to get detailed messages for each step for the crawl. Click on Logs for Amazon Aurora

Use Amazon Athena to query data stored in parquet format

To query crawled data, launch Amazon Athena from the AWS services. In the console, select the data source and database name as per your earlier configurations.

It shows the JSON tables and portioned table for your Amazon Aurora database.

Amazon Athena to query data

Click on the Preview table for the sqlshackdemo(partitioned) and Choose the Preview table.

It shows you the query and its output. In the preview, it limits the 10 records however we have only two rows in the sample table. You can compare it with the earlier screenshot of the Amazon Aurora sample table.

You can note there that the query window name is [New Query 4].

Preview table

Click on the Settings in Amazon Athena and specify the Query result location as an AWS S3 bucket.

Query result location

In the AWS S3 bucket, you can view the query metadata and output in the CSV format. AWS Athena stores the results in the S3 bucket for each query execution. You can note the folder name is 04 because of the query window name [New Query 4].

query metadata

Conclusion

In this article, we explored the snapshot export to the S3 bucket for Amazon Aurora or supported AWS RDS databases. It gives the flexibility to access data without any performance impact on the actual production database. You can use Amazon Athena to query tables like a regular table. You can work with Snapshot’s data and recover data as well in case of any requirements.

Table of contents

Deploy your Amazon Aurora database clusters for MySQL
Explore Amazon Aurora global databases with MySQL compatibility
Rewind or Forward the Amazon Aurora Cluster with MySQL compatibility using backtrack feature
Implement an Amazon Aurora Serverless cluster
Database Cloning in Amazon Aurora
Export Amazon Aurora MySQL or Amazon RDS snapshots to AWS S3 buckets
Rajendra Gupta
Latest posts by Rajendra Gupta (see all)
168 Views