An introduction to Amazon Athena

Transcript

All right. Thank you so much for inviting me. It's the first time I have the pleasure to talk to you all. Thanks to the organizers and thanks to Jan for having me here in his garden. I'm Julien. I'm an evangelist based in France. So you got Italian English, and now you have French English, but hopefully we have AI services that can help you understand us. Tonight, I'm going to talk about a new service that was launched at re:Invent called Amazon Athena, which is yet another data service. We're going to look at it and then have a death match between Athena and Redshift to see what we can learn. This is a brand new service, and I was looking for a nice logo to put in the corner, as I do all the time. When you Google "Amazon Athena," you find this lovely lady, a professional wrestler. Not quite what I expected, but on second thought, it's interesting because our Athena is a data wrestler. This service was just announced at re:Invent a few days ago. We're still very jet-lagged, at least I am. In a nutshell, this service allows you to run SQL queries directly on files stored in S3. Just dump your files in S3, and run SQL directly from there. The cool thing about Athena is there's no loading or pre-processing of data. You just create a table, map that data to your files, and off you go. The best part about Athena is that there is absolutely no infrastructure. Zero. None. You don't see it, you don't create it, you don't manage it, you don't see it. Our website calls it serverless. I'm not super comfortable with that term because it might create confusion with Lambda, but it is serverless in the sense that there are no servers at all, and you just fire away. For now, it's only available in the US. I'll show you how to use it in US East, but I'm quite confident that most of Europe will be yelling at AWS to bring this to the European regions. How much does it cost? As always, it's pay-per-use, and in this case, you pay per data scanned. It's five bucks per terabyte scanned, which means the less you scan, the less you pay. We'll see how to do that with a couple of techniques. It's based on Presto. Anyone using Presto already? No one? Okay, so it's a cool service. You could already use it on an EMR cluster, but you had to go through the process of creating an EMR cluster and dealing with the Hadoop ecosystem. Today, it's much simpler. How do you run queries? Anything that deals with table creation is not going to be standard SQL; it's going to be Apache Hive DDL, which is close to SQL but slightly different. The actual querying, the DML, such as selects, is standard SQL. All the operators and functions you expect are there. The doc says, if you want to find out what Athena supports, go to the Presto doc, and that's what we support. We have a few restrictions: no user-defined functions, no transactions, and no stored procedures. But standard SQL operators are all here. How about data? Your data lives in S3 and can have many different forms, such as unstructured log files, semi-structured data like CSV, TSV, and JSON, and columnar formats like Parquet and ORC. Columnar formats are extremely efficient, and we'll look at that. Compression is important too, and we support several algorithms. Unfortunately, we don't have LZO. Compression is important because, like in Redshift, if data is compressed, you have less I/O and scan less data, leading to better performance and lower costs. Partitioning is another technique that has been around forever in databases and is available in Athena. You can either have pre-partitioned data in S3 or partition your data at creation time. Partitioning reduces the number of I/Os, giving you better performance and savings. Combining compression, partitioning, and columnar formats can yield extremely good performance at a very good price. How do you run queries on Athena? You can do this in the AWS console, which is actually quite cool. There's a nice wizard to import your data, a SQL editor with auto-completion, query history, and the ability to run multiple queries in parallel. Most of the time, you won't be using the console but the JDBC driver with tools like Workbench or directly from your own code. We provide a specific JDBC driver to connect to Athena. Setting it up took me a while, so I suspect I'm not alone. Just install the driver, and the only catch is that you need to add a property in Workbench with an S3 bucket to output the data. The bucket needs to be in the same region as Athena. The data you load can come from other regions, but the output bucket must be in the same region. This saves you 10 minutes of frustration. So, how does Athena compare to Redshift? I created some clusters: a small Redshift cluster, a larger one, and a table for my dataset. I loaded the data in both clusters and ran some queries. I did the same for Athena, created the table, ran the same queries, and compared the results. Creating a table in Athena is literally pointing to an S3 bucket, listing the columns in your files, giving them a name and type, and that's it. It took less than five seconds. My dataset is 12 gigabytes compressed, so every full scan costs fractions of a dollar. I did the same on a small Redshift cluster with four DC1 large nodes. It took six minutes to create the cluster and 38 minutes to load the data. The cluster costs a dollar an hour, about a third of that for reserved instances. The larger cluster was faster but much more expensive. The dataset is one table with one billion lines of fake e-commerce transactions, 10 columns in CSV format, and 1,000 compressed files in S3. I ran queries to count the number of cells, find the average basket size, determine when women spend the most, find the top 10 states where they spend the most, and the top 10,000 women who spend the most in those states. I ran these queries on all three systems. The larger Redshift cluster completed the query in 5.7 seconds, while the small Redshift cluster and Athena were still running. Athena took about 40 seconds to return the first row. The small Redshift cluster was much slower, and I had to cancel it. Athena consistently took about 20 seconds to scan 12 gigabytes. The small Redshift cluster had similar performance to Athena for some queries, while the 8XL cluster was much faster but more expensive. For this dataset, there's no reason to use a small Redshift configuration because Athena is just as good. Athena is about equivalent to four DC1 large nodes, but it's up and running in seconds, whereas a Redshift cluster takes six minutes to create and 38 minutes to load data. Using Athena can save a lot of money. The serverless comparison makes sense because the savings are similar to those of Lambda versus EC2. Athena is simple, fast, and extremely cheap. If you need ultimate performance, Redshift still rules, but for most use cases, Athena is a great choice. Would you go for EMR, Redshift, or Athena? EMR is like a bulldozer for moving large amounts of unstructured data. Redshift scales very high and can handle complex processing, ETL, and huge joins. It's a Formula One for data. Athena is about answering questions quickly and simply. No fuss, no plumbing, no drama. Create your table, ask your questions, and get your answers. If you need a simple, cost-efficient alternative to existing data services and want to run queries in minutes or seconds, I believe you should use Athena. Before I go, Danilo's book is out, so congratulations, and please buy it on Amazon.com. Thank you so much for having me. I'm very impressed by the group, and I hope you like Athena. I hope to be back to talk about more stuff in the future. Thank you. One of the things Presto is good at is pulling data from heterogeneous data sources. I can see a need where you have a lot of data in S3 but also want to pull in extra data from SQL. Any plans for that? As you know, it's always a risky proposition to make bets about the AWS roadmap because it changes a lot. I can't comment on the future. However, today, if you have data in Hive, you can export it back to S3, possibly keeping the Parquet or ORC format. If you have MySQL or RDS, you can export to CSV or JSON. S3 is the central repository for data in the AWS universe, and that's likely to remain the case for a while. Connecting Athena to other sources could introduce complexity, but exporting data to S3 while keeping the native format makes sense. If you yell loud enough, you might get a connector for RDS. Who knows? Thank you very much. Round of applause for Julien. Thank you.

An introduction to Amazon Athena

Transcript

Tags

About the Author