All right, I think it's time to start. Good afternoon, everyone. My name is Julien, and I work for AWS. We're going to talk about backends. Although the title of this session refers to Java, what I'm going to tell you could apply to any other language, but the code examples I'll use are based on Java.
Just a few words about me. I've been using Java for as long as I can remember. Sometimes I wish I didn't. I started working with the first versions of Java, doing embedded systems at the time, so I could tell you some horror stories about embedded Java in 1999, but we'll skip that. Generally, one day I love Java, and the next I swear I'll never write a single line of that language again. But hey, it's still very popular, and it gets the job done, so why not? I travel all over the world to meet developers, engineers, sysadmins, etc. It's pretty cool; it's my first time in Riga, so I'm definitely enjoying it. Prior to this, I spend my life building, scaling, and fixing stuff. And sometimes fixing people.
So, what is this about? First, let's take a few seconds to talk about how you actually build Java apps on AWS. Then we'll quickly jump into the core of the discussion, which is what are my database options? We'll talk about RDS and DynamoDB. We'll also discuss analytics, as the lines between databases and analytics are blurring. We'll talk about EMR and Hive specifically, as well as Athena and Redshift. We'll try to sum things up and give you some solid, real-life advice on what to choose. All the code I'm using is available on GitHub. If anyone finds a bug, send me a pull request, and you might win a sticker or something.
There will be time at the end of the session for questions, but if you have a question that's really bugging you, please raise your hand. Try to keep it short, and we can talk more over beers after the session.
When you build stuff on AWS, you always have four deployment options. The classic one is deploying to EC2, to virtual machines. Anyone using EC2? Most likely, you're spending too much money. I'll give you some better options. Option number two is using Docker containers within ECS. Anyone using ECS? Do you like it? Kind of, okay. The third option is Lambda, building serverless apps. Anyone doing that? Anywhere I go, people use Lambda, so it's a worldwide epidemic now. Good. You can only use Java 8 for Lambda. I'm guessing you don't use the AWS CLI. Probably, you're using some frameworks like Serverless, Gordon, or Apex. Is that what you're using? No? Okay, we can talk about that later. The last option is to use a platform as a service, which means Elastic Beanstalk. Anyone using that? Okay, we need to talk. What do you mean it's not working? Come on. It is. And it's probably the simplest way to deploy anything on AWS because you don't have to manage the infrastructure. You just zip your code and push it to a pre-built environment, and it just works. You can use all versions of Java and quite a few containers. We have a Java SDK that still works with 1.6 if you're stuck with that, and it's available on GitHub. If you're using Eclipse, we have an AWS plugin. I'm guessing 99% of you use IntelliJ when you do Java stuff, right? But hey, I'm an older guy, so I use Eclipse and keep the legend alive. We have a pretty nice plugin maintained by AWS that allows you to build projects, deploy stuff in one click, and browse quite a few AWS resources. I'll use that today and can quickly show it to you.
In the spirit of being a modern guy, I looked for AWS plugins on IntelliJ. I found a Beanstalk plugin, but none of these are maintained by AWS; they're all third-party. There's a CloudFormation plugin to help build your infrastructure templates and an AWS Manager plugin similar to the one for Eclipse, which allows you to see your resources. However, it's very old, not maintained anymore, and the sources are not available. Hopefully, the initial developer will post the sources, and people will resurrect it. But today, I wouldn't recommend it.
A quick reminder on credentials, as security is an obsession for AWS and should be for everyone. Permissions and roles are managed in a service called IAM, identity and access management. When you create users, they could be human users or applications, you need to create an access key and a secret key. Using those keys, you will authenticate and maybe have permissions to use AWS resources. The default setting is everything is forbidden, which is the best default setting for security. When you create a user or keys for an application, by default, nothing is allowed, and you have to explicitly allow this API to be invoked and this resource to be accessed. It's a bit of work but the most secure way to do things. Permissions can be attached to groups or users for people, and for applications, you attach them to roles and then to an EC2 instance, Lambda function, or Docker container, giving it permission to access the API and resources.
If you've worked with Lambda before, you're familiar with this. Please do not hard-code secrets in your apps. I still see this, and it's really bad practice on AWS and everywhere. Passwords in code, API keys in code, it's a nightmare. Do not store them on EC2 instances because if one of your instances gets compromised, much more than your instance could be compromised. Trust me, there is no reason to do it, never, even in development. It will always end in tears. You will end up committing your secret keys to GitHub, and we monitor that. You will get a friendly email from AWS support saying we suspended your account because we found your keys on GitHub. Please send us an email or call us, and we'll resume your account. It's always unpleasant, but it's better than letting your account be used by millions of developers on GitHub who are keen on enjoying free AWS resources on your behalf.
When using applications, you must use roles. When you create an instance, Lambda function, or container, you must assign a role to it. That role will have permissions, giving the service permission to access the resources it needs. For backend credentials like user logins and passwords for MySQL or Redshift, you can use a subservice of EC2 called the parameter store. It's a secure store where you can put all your credentials and have them encrypted automatically by KMS, our key management service. You can just put them there and get them securely when needed. This is what I'm using.
This is the reference architecture for a backend. I try to keep it simple. I won't talk about all of these today. The first layer is really about data stores like RDS and DynamoDB. We have some other options, but you can't really call them databases, like Cloud Search or S3. They're cool, but they're not databases. Then we have analytics, and good choices for analytics include Amazon EMR, Hive, Redshift, or Athena. These are the five we're going to look at today.
Let's start with databases. The first one is the obvious one, usually the first one people look at, called Amazon RDS. The name is pretty obvious; it's a managed service for relational databases. What does it mean to be a managed service? It means you don't have to create, scale, or fix infrastructure. We do that. You basically launch an instance using one of the engines I'll mention in a second. You pick an instance size, decide whether you want high availability or not, and you can always add it later if you want. You wait a few minutes, and your server pre-installed with your database engine is ready. You can scale up and down and shut them down when you don't need them anymore. It's pretty cost-effective, and there's a 99.95 SLA on it, which we tend to exceed.
The engines available include MySQL, MariaDB, Postgres, and this slide is never up-to-date. It's probably outdated today. I checked it yesterday, and I wouldn't be surprised if something had changed. Quite recently, we still supported MySQL 5.1 if you can believe that. It changes every day. For commercial databases, we support Oracle and SQL Server in way too many versions to fit on this slide. The last one we support is called Aurora. Anyone heard of Aurora? A few people. Aurora is a high-performance implementation of MySQL and now Postgres. Our teams took MySQL and Postgres and injected the AWS secret sauce to make them five to ten times faster, improve high availability, and scalability. Aurora can scale storage automatically up to 64 terabytes. You could have a 64 terabyte relational database. Whether it's a good idea is debatable, but hey, whatever floats your boat. A very cool research paper on the design principles of Aurora just came out a few days ago. It's a good read.
So, who uses RDS? Well, show me an application that doesn't need a relational database. It's one of our most popular products. I could pick one here, like Airbnb, which everyone knows. Airbnb uses RDS for pretty much all their backends, which is spectacular given their scale. They picked RDS early on and stuck with it because it scaled and allowed them to focus on building the service, application, and product, not doing DBA stuff. Rovio is another company that scaled well, and if you have a Lamborghini, they're using RDS too. Everyone's using databases, and RDS is a very popular choice because it simplifies your life. You focus on building the product, not tweaking your database.
Let's look at an example. On each slide, there's a link to the SDK for the service, how to create a cluster, scale it, and install the JDBC driver. In this case, I'm using the MySQL driver. So, how do you connect to these? You've seen this a million times before, right? It's no different. All these backends, except one, use JDBC, so I'm using vanilla JDBC code to connect to them. The only difference is some need a user and password, and some just need AWS credentials, like a role. The query I'm running is the same data set on all backends except DynamoDB. It's one billion lines of fake e-commerce transactions, not AWS or Amazon transactions. I didn't steal any logs. It's a billion lines, and I'm looking for all people named Jones in Florida. A simple query.
Here's how you would connect to Aurora. I'm using the MySQL driver, and that's my endpoint, so the cluster name and database name. You've done this a million times. Here, I'm connecting to EC2 to get credentials, so the login is called back-end user, and the password is called back-end password, so I don't have to hard-code anything. You build a client to the systems manager entity and send a get parameter request with the parameter name, which gives you the value back and decrypts it for you. Couldn't be simpler. Please, do not hard-code stuff. There's no reason to do it now.
So, I'm getting my credentials and can connect. I create an RDS client in the US East 1 region, where my cluster lives. I'll call one of the Java APIs for RDS to describe the instance I'm connecting to and then send that query looking for Joneses in Florida. I've got my credentials. That's my describe db instance output, showing the name of the instance and the size. I have a master and a read replica. How many people did I find? 226. So, whoever thinks relational databases are not fast enough, think again. It took one second. Let's not jump to conclusions right away.
Aurora is like MySQL but way faster. If you can do MySQL, you can do Aurora. Just enjoy the speed. Let's look at DynamoDB now. DynamoDB is a totally different beast. It's a NoSQL database, fully managed, even more so than RDS. We're simply creating a table and putting and getting items in the database. We never worry about scaling or speed. It's NoSQL as a service. It's been around for a while, built by Amazon early on and became an AWS product in 2007. If you know the blog allthingsdistributed.com, it's the blog of Werner Vogels, the CTO of Amazon. I highly recommend reading everything he posts, especially the technical articles.
DynamoDB is a very popular service, and you probably know about Expedia, the travel website. They have crazy traffic and use DynamoDB for real-time analytics. They push over 200 million events a day into DynamoDB and do real-time analytics on top of that. The only thing you have to do is create a table, and that's pretty much it. No infrastructure to manage. You can scale it way up and not worry about it. Focus on building your business and product, leave the plumbing to us. We're the plumbing guys.
Let's look at DynamoDB. There's no connection in the traditional sense. I'll connect to the service, not an instance. I'll create a table called My Favorite Movies Table and write some items into it. I should mention there's a local version of DynamoDB you can run on your machine. It's in Java, and you can start it locally and connect to the local endpoint to work on your machine with DynamoDB, which is nice because you can work locally, do all your testing, and then switch to the production service.
How do we connect? This time, I'm using the Ireland region, EU West 1. I'm creating a table, which looks scary but isn't. Java builders are the hate part of my love-hate relationship. Table name, key schema. This table will have a key called name, a hash key, allowing me to shard items across DynamoDB nodes. The attribute is called name and is a string, which makes sense. It's a complicated way to say I need a sharding key that's a string and call it name. This is the provisioned throughput, which says how many reads and writes we're going to do per second. These are read units, so it doesn't really mean one read or write per second, but we don't care today. This is how you scale DynamoDB. If you need a million writes per second for a day or two, you would go to the API, tweak those parameters, and DynamoDB would provision enough hardware for that in maybe an hour or so, and then you could turn it down when done.
I'm creating the table if it doesn't exist, waiting a few seconds for the table to show up, getting the table name, describing the table, and adding some items. It's a movie table, so let's add some movies: Star Wars, Star Trek, Phantom Menace. Anyone here thinks Phantom Menace deserves more than one star? Hopefully not. I would put zero if I could. Lord of the Rings, which should have 10 stars. Then I'm going to get one of those items and build a request to find items with five stars. This looks a little complicated but isn't. The value I'm looking for is five stars. I build my scan request to find everything with a rating equal to vrating, which is five stars, and just do it.
Let me check that I deleted the table. This is the browser, the manager I showed you. I can look at my services. Let's run this. Oh, I'm still running; let's do that again. Creating the table takes maybe five to ten seconds because it needs to provision resources. Let's wait five to ten seconds. Come on, DynamoDB. Don't let me down now. I'm describing my table, so I can see my attributes, key, creation date, and other details. Then I'm putting my four movies. The result is empty because I didn't ask for a result. I could ask for the old value to be sent back or the actual new value written, but I don't need it here. I'm searching for the movie Star Wars. Here I'm finding my two five-star movies, Lord of the Rings and Star Wars. That's it. NoSQL as a service, zero infrastructure to provision. For Aurora, I started the Aurora cluster, which took 10 minutes. For DynamoDB, I did nothing. I just did what I did now. I should be able to see my... I need to refresh it. This is way too small, but you can see four items.
If you insist, you could still do Cassandra or MongoDB, but my answer would be why? I've done that in the past but learned from my mistakes. Here's a recap, and I included more on the other backends, so don't worry; we're not going to read all that stuff. These are the two we talked about. As you can see, they're pretty similar, except for whether you have a schema or not, which is probably the main thing to select between relational and NoSQL databases. You could have a lot of data stored in both and request them at fairly high rates, although DynamoDB will scale up to the sky. We had customers who ran TV ads during the Super Bowl in the US and went up to one million reads per second on DynamoDB with no worries. Their website did not go down, unlike some others. Dynamo is really brutal when it comes to requests. For the rest, it's more about whether you have a schema or not. Some additional information on the other backends, but you can read that stuff at home.
Let's move on to analytics. I'll start with EMR, which is the oldest one. EMR is a managed service for the Hadoop ecosystem. We started with Hadoop, Hive, and Pig and added Spark, Flink, Presto, and more. It's a managed service. You just create a cluster, specify how many nodes you want and the Hadoop applications, wait a bit, and the cluster is pre-installed. You don't have to install Zookeeper manually. Has anyone installed Zookeeper manually? Was it enjoyable? It wasn't for me, so that makes it two of us. A Hadoop cluster ready to go in about 10 minutes. You can resize and stop them, and it's nicely integrated with S3. You can keep all your data in S3, load it on the cluster, run it, and save it back to S3 without worrying about HDFS.
We have tons of customers using it, but I selected one of the really spectacular ones. It's not a sexy company or a startup but a regulation agency for financial markets in the US, the largest one, called FINRA. They look at all market trades in the US during the day, store everything, and run crazy Hadoop and machine learning jobs on that data at night to find something suspicious, illegal, or not quite right. Imagine billions of trades every day, and they have to look for the few things that are probably illegal. They report it to the SEC, and hopefully, the bad guys go to jail. To do this, they have to wait for the market to close. When the market closes in the afternoon, they fire up the Hadoop clusters and use up to 10,000 nodes at any given time. If you needed to build it yourself in your data center, 10,000 nodes would be impossible. They run all the jobs, get the results, and shut down the clusters, saving tens of millions of dollars every year compared to a 24-7 cluster that's only useful a few hours a day. Just to show you how large you can go with EMR.
Let's look at Hive. My tunnel is still up, which is good. Hive, where are you? I'm using the JDBC driver for Hive but connecting locally because I have an SSH tunnel between my machine and the EMR cluster running in the US. Authentication is done through the tunnel with the SSH key I attached when creating the cluster. Let me start it because it's Hive. The first thing I'm doing is describing the cluster using the Java SDK for EMR to get some information. That part was pretty fast. Cluster ready, creation time, subnets, and everything you would see in the AWS console about this cluster. After 30 seconds or so, I got my result. It's the same number, which is reassuring. It's the exact same data set and SQL query. Aurora took one second with one node. Hive, with 15 nodes, took probably 20 seconds. If you think you have to rush to Hadoop and Hive for speed, please reconsider. Hive was great for a while, but there are better options. Still, using it is straightforward. You use JDBC and run queries. The interesting part is how fast it is and whether it's what you really need.
Let's move to Athena. Anyone heard of Athena? A few people? Athena is a relatively new service, announced at re:Invent last December. What can you do with Athena? You can run read-only queries on data stored in S3. You could have petabytes of data, log data, or application data in S3 and query it directly without loading or indexing anything. You don't create any infrastructure or start anything. With EMR, you have to start something; with Aurora, you have to start something; with DynamoDB, you have to create a table. Here, you don't do anything. You create an external table to map the schema on the data in S3, which takes 200 milliseconds, and then you can go. The service is based on an open-source project called Presto, built by Facebook. We took it and made it an AWS service. It uses Hive for table creation and data definition and C-SQL for everything else. I can use the same SQL queries on Athena. There are a few exceptions, mostly user-defined functions, but that's about it.
The cool thing about Athena is that it can adapt to many different formats because your data is in S3, and the last thing you want to do is transform it. You want to query it as it is. We can handle unstructured, semi-structured, and structured data, specifically columnar formats like Parquet or ORC, which give you excellent performance. I'm using ORC today. If you have data in this format in S3, just create a table that lists the columns, and you're ready to go. You can also use compressed data or partitioned data, which is even better because you scan less data, making your queries faster and cheaper. The price of Athena depends on how much data you scan for the queries. If you use columnar formats plus compression plus partitioning, you end up paying fractions of pennies, which is nice.
We have customers already using it. DataXoo is an ad tech company in the US that runs real-time bidding and has a retargeting platform. They get display data from millions of websites worldwide and decide in real-time if they want to buy advertising space and show a banner. They get tons of traffic, funnel it into Kinesis, a scalable and high-performance message queue, and then into their actual crunching machine. They dump all the results to S3 and use Athena for Hadoop analysis, investigation, data science, and interactive queries, building reports. They have that much data coming every day, and it's crazy. Just drop it in S3 and query it with Athena. No worries.
Let's look at Athena. It's completely, fully, absolutely managed, and there is no Java SDK for Athena. I cannot call describe table or describe anything. There's probably only one AWS service without a Java SDK, but you don't really need it. All we need is a JDBC driver. The connection to Athena is a bit different. I'm using JDBC but showing a different example. I could use different types of credentials. Here, I decided to show you how to do it using local credentials stored on my laptop. These are just properties you pass to the JDBC driver, where my login and API key and secret key are. I'm pointing to a credentials provider class to say, "Hey, please read my credentials locally, don't look anywhere else." Another difference, and this one is a bit sad, is that as of today, you cannot have prepared statements for Athena. I was disappointed when I found out, but I can blame Presto for it. It's an issue in Presto where prepared statements are not implemented. I'm trying to push our team to do it and contribute it back to Presto, but I'm not sure they're listening today. It's the same request, but it's a regular statement, not a prepared statement, which is a bit of a letdown. The rest is completely identical. Let's run it. I created nothing; the table just points to stuff in S3. Let's see how it goes. One billion lines, no infrastructure at all. Come on, Athena, wake up. Yes, thank you. 2.26, okay, so five to six seconds. Imagine you have billions and billions of lines somewhere, load balancer logs, application logs, and you have that question. Normally, you would go to your Hadoop cluster or BI cluster, ingest all those logs, and it would take hours to set up. Only then could you ask the actual question and hopefully get the answer. Here, just create a table that maps to the actual data in S3, ask your question, and you're done. It will take five minutes, not hours. I'm madly in love with Athena, I have to say.
But there's even better now. The last one I want to talk about today is called Redshift, which has been around for a while. Anyone uses Redshift? All right, well, you use all of them, come on. You definitely deserve a sticker. Redshift is a relational data warehouse based on SQL and compatible with PostgreSQL, which is good because I hate learning new stuff. It's fully managed, so you just fire up the cluster, wait a bit, and get your connection string to the cluster. It's massively parallel, and you could have a very large number of nodes, up to more than 100 if you wanted. It scales to multi-petabytes of data. We have customers who go that big, like NTT in Japan, the telco operator, running multiple Redshift clusters at petabyte scale. It works well for them. You can get very cost-efficient price points. When you talk about data warehouses, you usually compare the cost per terabyte per year, and with Redshift, you can go to a thousand dollars. If you have a data warehouse in the office, it's probably more expensive than that.
We extended the service with Redshift Spectrum, which gives Redshift Athena-like capabilities. Now, with Redshift, you can also create external tables and query data hosted in S3. But it's a different service. Spectrum is an AWS build, not based on Presto. This is the architecture of Redshift. Client applications connect through standard drivers to the leader node of the cluster, sending SQL queries. The leader node builds the execution plan, optimizes the request, breaks it down into tiny pieces, sends each piece to a compute node, and each compute node runs that part of the request on its local data. When done, they send all the results back to the leader node, which sends you the result. If the table in your query is an external table, the query goes to Spectrum. Spectrum is a managed fleet of instances different from the Redshift nodes, queries your data, and sends the result back to the leader node, which sends you the result.
When Redshift came out, it was a Hive killer. We had a lot of customers with huge data sets suffering with Hive who moved to Redshift. Most of them said Redshift is 20 to 50 times faster than Hive or competing solutions. Redshift is very fast, and with Spectrum, it's even faster. Let's do this. I need to get the credentials, describe the cluster, and send my query. You've seen this five times already. Let's do it. Where's my cluster? Come on. Ah! Alright. Let's do it again. It's a little slow, why? I don't know. Yeah, 226. That should be faster. I don't know, maybe it's that describe operation that slows us down. But Redshift is, in this case, quite faster than Athena. This isn't a proper benchmark, but it can scale extremely high. If you have a data warehouse already and use Redshift, you can extend it. Okay, here we go. That's more like it. That was faster than Athena. That's what I wanted to show you. Thank you.
Time to wrap up. A quick comparison of these services. I'll let you read that part. We looked at two databases and three services for analytics. As you can see, they have strong points and weaker points. You need to find the one that's really good for your application. Only you can decide that, which is why we provide multiple choices. We build it, and you select the one you like best. The last thing I want to point out is that we have a couple of tools to help you migrate your databases to AWS. There's one called the Schema Conversion Tool, which you install locally on your machine. You point it to your database, say, "Here's my Oracle, MySQL, or SQL Server database, and I want to move it to something else." It looks at the schema, does what it can automatically, and tells you what you need to do manually. It can get a lot of the schema conversion done automatically, which is nice. We also have a service to do the actual migration called the Database Migration Service. You select the source and destination databases, and it works in all combinations. You could go into AWS, outside of AWS, stay inside AWS, and have pretty much anything as the source and anything as the destination, including now MongoDB and SAP. It's pretty extensive and helps you migrate your database from source to destination as you go.
As you can see, you have a lot of different choices. I talked about Java today, but the same SDK would be available for other languages. You can select from many different options, and some tools like MySQL or Postgres or Hive, you know them already, but here we take away the infrastructure pain. We help you focus on building your app and bring high availability, scalability, security, and compliance, so you don't have to build it. These are super important to get right. You can focus on your creativity and productivity, work on your business, and leave the plumbing and low-level tasks to us. Thank you very much. If you have any questions, I'll be happy to answer. Thank you.