The Amazon Relational Database Service (RDS)makes it easy to set up, operate, and scale a relational database in the cloud.
It provides cost-efficientand resizable capacity while automating time-consuming administration tasks such as hardware provisioning, database setup, patching, and backups.
When we think about backups, we often think aboutdisaster recovery or data migration scenarios, but these aren’t the only use cases.
Let’s consider another common one.
Suppose wehave an example database called “bank-demo” running on a db.
large instance and taking automatednightly snapshots.
Our internal analytics team wants to be able to run ahigh volume of intense queries on the data whenever they want.
Can we make this possible for them in asafe, reliable, and cost-effective way? The answer is “yes!” In this demo, we’re going to export an Amazon RDSsnapshot from this example RDS database to S3, and then, with the help of AWS Glue, use Amazon Athenato perform queries on the exported snapshot.
This will enable us to generate reports, performcomputationally expensive analytical queries, combine our data from RDS with other sourcesto build data lakes, comply with long-term data retention requirements, and much more – all withoutimpacting the performance or availability of our database.
To get started, let’s head over to the RDS console, andopen the list of available snapshots for our sample database.
We’ll choose the most recent automated system snapshot, but we could also select any snapshots we’ve taken manually, or even create abrand new snapshot just for this exercise by using the “Take snapshot” button.
Next, we’ll select“Export to Amazon S3” from the “Actions” dropdown menu button and start configuringour export.
Let’s call it “bank-demo-export-1”.
Now we need to indicate how much data we’dlike to export.
We can either export the entire database, or we can specify only a few tables bylisting the schema and name for each table that we want to include.
Let’s go ahead and exporteverything.
We need to choose a destination on S3 for the export, but since we don’t have one yet, let’s go create it.
We’ll call it “bank-demo-exports”, leavethe default options alone, and as a general best practice, leave the “Block all public access”setting checked.
Once we’ve done that, we can go back to the snapshot export form, refresh the“S3 Destination” section, and select our new bucket.
Now we need to choose an IAM Role that thesnapshot export process can use to access our S3 bucket.
We don’t have one yet, so let’s createa new role called “bank-demo-exports”.
The role will automatically have this IAM Policy shownhere, and that will allow it to read, write, and manage the snapshot exports for us.
To protectour exported data, we must also provide a customer-managed encryption key from theAmazon Key Management Service (KMS).
We can do this by going to the KMS console andpressing the Create Key button.
We’ll give the key a logical name and description, but since there are no other users in this demo account, we don’t need any separate keyadministrators or key users at this time, so we can leave those options alone.
We’ll click“Finish” to create the key, but before leaving KMS, let’s grab a copy of the key ARN, and thenpaste it into the Encryption section back in RDS.
Now we can start the Export.
The amount of time it’ll take to finish will dependon the size of the snapshot itself, but we can look in the RDS Console to view the current status.
For moregranular progress, the `aws rds describe-export-tasks` command in the AWS CLI will tell us how much datahas been extracted and what percentage of the overall process has been completed thus far.
Now that the export is complete, let’s take alook at what ended up in our S3 bucket.
We have two export info JSON files, and a folderwith the same name as our database.
The first info file is the final report of the export task, andthe second one breaks the status down for us by individual table, including overall size and thedatatype mappings.
Within the “bank_demo” folder we have separate folders for each individualtable that was exported, and under each of those we’ll see one or more folders corresponding to howmany partitions were created during the export process.
“transactions”, being the largest of thethree tables, created several hundred partitions, while the “accounts” and “customers” tables onlyrequired one each.
Finally, at the deepest level, we’ll find the exported data inApache Parquet format.
The Parquet format is up to 2x faster toexport and consumes up to 6x less storage in Amazon S3, compared to textformats, and we can analyze the exported data with other AWS services like Amazon Athena, Amazon EMR, and Amazon SageMaker.
Let’s set up Athena for that right now.
First, we need to tell Athena where to find thedata and what it looks like.
Thanks to AWS Glue, we don’t need to give Athena the details on everytable and its properties ourselves though – we can set up a crawler to go discover that for us.
Let’s call the crawler “bank_demo” and tell it that it’s going to crawl through an S3 bucketin our account under this path.
Next we’ll ask it to create the appropriate IAM Role for us, and, forthe time being, tell it to only crawl our data when we ask it to.
And finally, we’ll ask it to organize the schemainformation it discovers under the name “bank_demo” With that, the crawler is ready to go, but there’s onemore important task that we need to complete before running it.
Remember earlier when we configuredthe KMS key policy, but didn’t need to give anybody else permission to use the key? That’s no longerthe case.
This crawler is going to need to crawl through our exported snapshot data in S3, which is encrypted, so the crawler, or more specifically the crawler’s IAM Role, is goingto need access to use that key.
To grant it access, we just go back to KMS and add the newIAM Role as a key user.
Now let’s run the crawler! This will take a few minutes to complete dependingon the size of your data, but you can expect it to finish considerably faster than the originalexport to S3 did.
Now let’s go back to Athenaand run some queries! Amazon Athena supports a subset of SQL, andyou can refer to the Athena documentation for a full reference.
If you’ve ever worked with SQLbefore though, you’ll find the query editor fairly straightforward.
As you can see, we’ve just countedup all of the accounts, found a random customer, and looked up the ten most recent transactionsin all of their accounts using the same general query structure that we’re accustomed to.
The best part is that all of these queries we’re runningare being executed against the exported data that we have in S3.
Whether we need to run a high volumeof analytical queries, or even just a handful of very slow queries involving columns that aren’t optimizedfor search, we know that there’s going to be absolutely no impact to the database itself.
That’s all great, but we don’t want to have to beexporting RDS snapshots to S3 manually every time.
Let’s automate this! Amazon RDS uses the Amazon Simple NotificationService (SNS) to publish notifications when certain RDS events occur.
These notifications cantrigger an AWS Lambda function to start a snapshot export, and we can then use anAWS Glue crawler to make the snapshot data available for querying with Amazon Athena.
We’ll use the code provided in the `aws-samples/rds-snapshot-export-to-s3-pipeline` repository to deploy an example of this setup.
After giving it the name of our database and thename of an S3 bucket to create for the exports, the Amazon Cloud Development Kit (CDK) willhandle deploying everything.
Step-by-step instructions on how to do this areprovided in the repository, so we won’t walk through all of that here, but we can see thatafter a few minutes, CDK has created everything we need to automatically export the“bank-demo” database’s automated snapshots to S3.
Now let’s wait a few days and see what happens.
OK, great! Our Lambda function has been exporting allof the snapshots that took place after the pipeline was set up, and we can now run ourGlue crawler to make this data available to us in Athena.
That’s going to take a little bit longerthis time, since the crawler has more data to go through, but to avoid this in the future, we couldschedule the Glue crawler to run automatically each morning shortly after we expect our snapshotexports to complete.
Our “On Demand” crawler just finished, though, so let’s see what’savailable now in Athena.
Unlike in the previous example, where we had onetable in the Glue data catalog corresponding to each table that we had in the “bank-demo” database, we now have several instances of each of those tables, which correspond to the different days’ snapshots.
This means that we can run queries on not just the latest snapshot, but also on the earlier versionsof our database in the older snapshots.
These snapshot exports in S3 and the tables in theAWS Glue data catalog will build up over time though, but you can clean them up automatically by usingAmazon S3 object lifecycle policies, which will transition your snapshot exports through the differenttiers of storage available such Infrequent Access, Glacier, or Deep Archive automatically as they age.
For more information on the S3 lifecycle policies, the RDS Snapshot Export to S3 feature, or any of the other services mentioned in this demo, pleasevisit the links in the video description below.
Thanks for watching!.