Julien Simon Picking the right AWS backend for your Java app
June 14, 2017
When it comes to data processing, software applications have many different requirements, which cannot be satisfied by a single backend. This is why AWS has built a number of data services, such as DynamoDB, RDS, EMR, Redshift and Athena. In this technical session, you will learn how their respective strenghts and how you can best use them in your Java apps. will explore them all, compare them and demonstrate how you can use them in your Java apps. Code and live demos will illustrate the discussion.
Transcript
All right, it's time to start. Good afternoon, everyone. My name is Julien, and I work for AWS. We're going to talk about backends. Although the title of this session refers to Java, what I'm going to tell you could apply to any other language, but the code examples will be based on Java.
Just a few words about me. I've been using Java for as long as I can remember. Sometimes I wish I didn't. I started working with the first versions of Java, doing embedded systems at the time, so I could tell you some horror stories about embedded Java in 1999, but we'll skip that. Generally, one day I love Java, and the next I swear I'll never write a single line of that language again. But hey, it's still very popular and gets the job done, so why not? I travel all over the world to meet developers, engineers, sys admins, etc. It's pretty cool; it's my first time in Riga, so I'm definitely enjoying it. Prior to that, I spend my life building, scaling, and fixing stuff. And sometimes fixing people.
So, what is this about? First, let's take a few seconds to talk about how you actually build Java apps on AWS. Then we'll quickly jump into the core of the discussion, which is your database options. We'll talk about RDS and DynamoDB, and we'll also discuss analytics, as those lines are blurring. Where does the database stop, and where does analytics start? We'll talk about EMR and Hive specifically, as well as Athena and Redshift. We'll try to sum things up and give you some solid, real-life advice on what to choose. All the code I'm using is available on GitHub. If anyone finds a bug, send me a pull request, and you might win a sticker or something.
When you build stuff on AWS, you always have four deployment options. The classic one is deploying to EC2, to virtual machines. Anyone using EC2? Most likely, you're spending too much money on it. I'll give you some better options. Option number two is using Docker containers within ECS. Anyone using ECS? Do you like it? Kind of, okay. The third option is Lambda, for building serverless apps. Anyone doing that? Everywhere I go, people use Lambda; it's a worldwide epidemic now. Good. You can only use Java 8 for Lambda. I'm guessing you don't use the AWS CLI. Probably, you're using some frameworks like Serverless, Gordon, or Apex. Is that what you're using? No? We can talk about that later. The last option is using the platform-as-a-service way, which means Elastic Beanstalk. Anyone using that? Okay, we need to talk. What do you mean it's not working? Come on. It is, and it's probably the simplest way to deploy anything on AWS because you don't have to manage the infrastructure. You just zip your code and push it to a pre-built environment, and it just works. You can use all versions of Java and quite a few containers. We have a Java SDK that still works with 1.6 if you're stuck with that, and it's available on GitHub. If you're using Eclipse, we have an AWS plugin. I'm guessing 99% of you use IntelliJ when you do Java stuff, right? But hey, I'm an older guy, so I use Eclipse and keep the legend alive. We have a pretty nice plugin maintained by AWS that allows you to build projects, deploy stuff in one click, and browse quite a few AWS resources. I'll use that today and can quickly show it to you.
In the spirit of being a modern guy, I looked for AWS plugins on IntelliJ. I found an Elastic Beanstalk plugin, but none of these are maintained by AWS; they're all third-party. There's a CloudFormation plugin to help build your infrastructure templates and an AWS Manager plugin similar to the one for Eclipse, which allows you to see your resources. However, it's very old, not maintained anymore, and the sources are not available. Hopefully, the initial developer will post the sources, and people will resurrect it. But today, I would not recommend it.
A quick reminder on credentials, because security is an obsession for AWS and should be for everyone. Permissions and roles are managed in a service called IAM, Identity and Access Management. When you create users, whether human or applications, you need to create an access key and a secret key. Using those keys, you will authenticate and maybe have permissions to use AWS resources. The default setting is that everything is forbidden, which is the best default setting for security. When you create a user or keys for an application, by default, nothing is allowed, and you have to explicitly allow this API to be invoked and this resource to be accessed. It's a bit of work but the most secure way to do things.
Permissions can be attached to groups or users for people, and for applications, you attach them to roles and then to an EC2 instance, Lambda function, or Docker container, giving it permission to access the API and resources. If you've worked with Lambda before, you're familiar with this. Please do not hard-code secrets in your apps. I still see this, and it's really bad practice. Passwords and API keys in code are a nightmare. Do not store them on EC2 instances because if one of your instances gets compromised, much more than your instance could be compromised. Trust me, there is no reason to do it, never, even in development. It will always end in tears. You will end up committing your secret keys to GitHub, which we monitor, and you will get a friendly email from AWS support saying your account has been suspended because we found your keys on GitHub. Please send us an email or call us to resume your account. It's always unpleasant, but better than letting your account be used by millions of developers on GitHub for free AWS resources.
When using applications, you must use roles. When you create an instance, Lambda function, or container, you must assign a role to it. That role will have permissions, giving the service permission to access the resources it needs. For backend credentials like user logins and passwords for MySQL or Redshift, you can use a subservice of EC2 called the Parameter Store. It's a secure store where you can put all your credentials and have them encrypted automatically by KMS, our key management service. You can just put them there and get them securely when needed.
This is the reference architecture for a backend. I'll keep it simple and won't cover everything today. The first layer is about data stores like RDS and DynamoDB. We have other options, but you can't really call them databases, like Cloud Search or S3. They're cool but not databases. Then we have analytics, and good choices for analytics include Amazon EMR, Hive, Redshift, or Athena. We'll look at these five today.
Let's start with databases. The first one is the obvious choice, usually the first one people look at. It's called Amazon RDS, a managed service for relational databases. What does "managed service" mean? It means you don't have to create, scale, or fix infrastructure; we do that. You launch an instance using one of the engines I'll mention in a second, pick an instance size, decide if you want high availability, and wait a few minutes. Your server, pre-installed with your database engine, is ready. You can scale up and down and shut them down when you don't need them anymore, making it cost-effective. There's a 99.95 SLA, which we tend to exceed.
The engines available on RDS include MySQL, MariaDB, Postgres, and others. This slide is never up to date. It's probably outdated today. I checked it yesterday, and I wouldn't be surprised if something had changed. Quite recently, we still supported MySQL 5.1, if you can believe that. It changes every day. For commercial options, we support Oracle and SQL Server in many versions. The last one we support is called Aurora. Anyone heard of Aurora? A few people. Aurora is a high-performance implementation of MySQL and now Postgres. Our teams took MySQL and Postgres, injected the AWS secret sauce, and made them five to ten times faster, improved high availability, and scalability. Aurora can scale storage automatically up to 64 terabytes. I don't believe it's a good idea to have a 64 terabyte relational database, but hey, whatever floats your boat. A very cool research paper on the design principles of Aurora just came out a few days ago. It's a good read.
Who uses RDS? Show me an application that doesn't need a relational database. It's one of our most popular products. For example, Airbnb uses RDS for pretty much all their backends, which is spectacular given their scale. They picked RDS early on and stuck with it because it scaled and allowed them to focus on building the service, application, and product, not doing DBA stuff. Rovio is another company that scaled well and uses RDS too.
RDS makes your life simpler. You focus on building the product, not tweaking your database. Let's look at an example. Each slide has a link to the SDK for the service, showing how to create a cluster, scale it, and install the JDBC driver. In this case, I'm using the MySQL driver. How do you connect to these? It's no different from what you've seen a million times. All these backends use JDBC, so I'm using vanilla JDBC code to connect to them. The only difference is that some need a user and password, while others just need AWS credentials, like a role.
For example, here's how you connect to Aurora. I'm using the MySQL driver, and that's my endpoint, the cluster name, and the database name. I'm connecting to EC2 to get credentials, so the login is called back-end user and the password is called back-end password, so I don't have to hard-code anything. You build a client to the Systems Manager entity and send a get parameter request with the parameter name. It gives you the value back and decrypts it for you. Please do not hard-code stuff. There's no reason to do it now.
I create an RDS client in the US East 1 region, where my cluster lives. I call one of the Java APIs for RDS to describe the instance I'm connecting to and then send a query looking for people named Jones in Florida. One billion lines, hopefully, that's fine and fast. I got my credentials, and here's the describe db instance output, showing the name and size of the instance. I have a master and a read replica. How many people did I find? 226. If you think relational databases are not fast enough, think again. It took one second.
Now, let's look at DynamoDB. DynamoDB is a NoSQL database, fully managed, even more so than RDS. Here, we're simply creating a table and putting and getting items in the database. We never worry about scaling or speed. It's NoSQL as a service. It's been around for a while, built by Amazon early on and became an AWS product in 2007. Werner Vogels, the CTO of Amazon, writes about it on his blog, allthingsdistributed.com. I highly recommend reading everything he posts, especially the technical articles.
DynamoDB is a very popular service. Expedia, the travel website, uses it for real-time analytics, pushing over 200 million events a day into DynamoDB and doing real-time analytics on top. The only thing you have to do is create a table. No infrastructure to manage. You can scale it way up and not worry about it. Focus on building your business and product, leave the plumbing to us.
Let's look at DynamoDB. There's no connection to an instance. I'll create a table called "my favorite movies" and write some items into it. There's a local version of DynamoDB you can run on your machine, which is nice for local testing. How do we connect? I'm using the Ireland region. I create a table, which looks scary but isn't. Java builders are the hate part of my love-hate relationship. The table name, key schema, and attribute are straightforward. The table will have a key called "name," a hash key, allowing me to shard items across DynamoDB nodes. The attribute is called "name" and is a string.
Provisioned throughput specifies how many reads and writes per second. For testing, I create the table, wait a few seconds for it to show up, describe the table, and add some items. It's a movie table, so let's add some movies: Star Wars, Star Trek, Phantom Menace, and Lord of the Rings. I search for the movie "Star Wars" and find my two five-star movies, Lord of the Rings and Star Wars. No infrastructure to provision. For Aurora, I started the Aurora cluster, which took 10 minutes. For DynamoDB, I did nothing.
Let's move on to analytics. I'll start with EMR, the oldest one. EMR is a managed service for the Hadoop ecosystem, starting with Hadoop, Hive, and Pig, and adding Spark, Flink, and Presto over time. It's a managed service. You create a cluster, specify how many nodes and Hadoop applications you want, wait a bit, and the cluster is pre-installed. No manual Zookeeper installation. A Hadoop cluster is ready in about 10 minutes. You can resize and stop them, and it's nicely integrated with S3. You can keep all your data in S3, load it on the cluster, run it, and save it back to S3 without worrying about HDFS.
A notable customer is FINRA, the largest financial market regulation agency in the US. They look at all market trades during the day, store everything, and run high Hadoop and machine learning jobs at night to find suspicious activity. They fire up Hadoop clusters with up to 10,000 nodes when the market closes, run the jobs, get the results, and shut down the clusters, saving tens of millions of dollars annually compared to a 24-7 cluster.
Let's look at Hive. I'm using the JDBC driver for Hive, connecting locally through an SSH tunnel to the EMR cluster. Authentication is done through the tunnel with the SSH key attached when creating the cluster. I describe the cluster using the Java SDK for EMR, getting information like creation time and subnets. The same data set and SQL query as before: Aurora took one second, Hive with 15 nodes took about 20 seconds. If you think you need Hadoop and Hive for speed, please reconsider. Hive was great for a while, but there are better options.
Next, let's look at Athena. It's a relatively new service announced at re:Invent last December. With Athena, you can run read-only queries on data stored in S3. You can have petabytes of data, log data, or application data in S3 and query it directly without loading or indexing. You don't create any infrastructure; you just create an external table to map the schema on the data in S3. The service is based on the open-source project Presto, built by Facebook. It uses Hive for table creation and data definition and C-SQL for everything else. I can use the same SQL queries on Athena.
Athena supports various data formats, including unstructured, semi-structured, and structured data, specifically columnar formats like Parquet or ORC, which provide excellent performance. You can use compressed and partitioned data for faster and cheaper queries. The price of Athena depends on how much data you scan. If you use columnar formats, compression, and partitioning, you end up paying fractions of a penny.
DataXoo, an ad tech company in the US, uses Athena for real-time bidding and retargeting. They funnel data into Kinesis, process it, and dump the results to S3, using Athena for Hadoop analysis, data science, and interactive queries. They generate a crazy amount of data daily, which they store in S3 and query with Athena.
Let's look at Athena. It's completely managed and has no Java SDK, but you don't need it. All you need is a JDBC driver. The connection to Athena is different, using local AWS credentials stored on my laptop. I pass these as properties to the JDBC driver. Another difference is that you cannot use prepared statements for Athena, which is a bit sad. The actual issue is in Presto, and I'm trying to push for our team to implement it.
I run the same query on my Parquet table, but it's a regular statement, not a prepared statement. The rest is identical. I created nothing; the table just points to data in S3. Let's see how it goes. One billion lines, no infrastructure at all. Athena, wake up. 226, okay, so five to six seconds. Imagine you have billions of lines of data, like load balancer logs or application logs, and you have a question. Normally, you'd go to your Hadoop cluster, ingest the logs, and wait hours to set up and get an answer. With Athena, you create a table that maps to the data in S3, ask your question, and you're done in minutes, not hours. I'm madly in love with Athena.
The last one I want to talk about is Redshift, which has been around for a while. Redshift is a relational data warehouse based on SQL and compatible with PostgreSQL. It's fully managed, so you just fire up the cluster, wait a bit, and get your connection string to the cluster. We recently extended the service with Redshift Spectrum, which gives Redshift Athena-like capabilities. Now, with Redshift, you can create external tables and query data hosted in S3, but Spectrum is a different service, not based on Presto.
The architecture of Redshift involves client applications connecting through standard drivers to the leader node of the cluster, sending SQL queries. The leader node builds the execution plan, optimizes the request, breaks it down into pieces, and sends each piece to a compute node. Each compute node runs its part of the request on its local data. When done, they send the results back to the leader node, which then sends you the result. If the table in your query is an external table, the query goes to Spectrum. Spectrum is a managed fleet of instances that query your data and send the results back to the leader node.
When Redshift came out, it was a Hive killer. Many customers with large datasets moved from Hive to Redshift, finding it 20 to 50 times faster. With Spectrum, it's even faster. Let's demonstrate. I need to get the credentials, describe the cluster, and send my query. Let's do it. Where's my cluster? Ah, yeah. Alright, let's do it again. It's a bit slow, but 226 seconds should be faster. Maybe the describe operation is slowing us down. However, Redshift is quite faster than Athena in this case. This isn't a proper benchmark, but Redshift can scale extremely high. If you already use Redshift, you can extend it. Okay, that's more like it. That was faster than Athena, which is what I wanted to show you.
To wrap up, here's a quick comparison of the services we looked at. We examined two databases and three services for analytics. Each has its strengths and weaknesses, and you need to choose the one that's best for your application. We provide multiple choices so you can select the one you like best. We also have tools to help you migrate your databases to AWS, such as the Schema Conversion Tool, which helps with automatic schema conversion, and the Database Migration Service, which supports various source and destination databases, including MongoDB and SAP. These tools help you migrate your database from source to destination as you go. You have many choices, and the same SDK is available for other languages. We take away the infrastructure pain, help you focus on building your app, and bring high availability, scalability, security, and compliance. You can focus on creativity and productivity, working on your business while we handle the plumbing and low-level tasks. Thank you very much. If you have any questions, I'll be happy to answer. Thank you.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.