Picking the right AWS backend for your Java app Julien Simon Amazon
September 04, 2017
When it comes to data processing, software applications have many different requirements, which cannot be satisfied by a single backend. This is why AWS has built a number of data services, such as DynamoDB, RDS, EMR, Redshift and Athena. In this technical session, you will learn how their respective strengths and how you can best use them in your Java apps. will explore them all, compare them and demonstrate how you can use them in your Java apps. Code and live demos will illustrate the discussion.
Transcript
Good afternoon. It's been a long day, so do you have any energy left? My name is Julien. I'm a tech evangelist with AWS. Sometimes I work in the Paris office, but most of the time I'm traveling, and it's great. It's my first time in Ukraine, and I'm very impressed, except for the weather today. But it's okay because we're going to talk about clouds, so I don't mind it. I'm very happy to be here. I hope this will be useful for you. If you have questions after the session, I'll be around, so feel free to come up and ask all your questions. I love to answer questions.
Today, we're going to talk about AWS backends. This seems to be a Java conference, right? We're going to use Java to look at these backends, run some code, and do some demos. They might not work, and that's okay. Who am I? I've been a software engineer for a long time and ended up being a CTO and VP of engineering for several startups in Paris. Then I decided it was time to see more of the world, so for the last 18 months, I've been working with AWS and traveling all over Europe, especially this month, which is crazy.
What to expect today? I'll talk for a few minutes about writing Java apps on AWS. Who is using AWS today? They told me to remove all the beginner stuff because you don't care. Okay, but still, a few of you might not be comfortable with it, so I'll go through it quickly. We'll start with databases, talk about RDS and DynamoDB, then move on to analytics with Hive, Athena, and Redshift, and some of the newer stuff we've released. Hopefully, we'll have a conclusion for you. All the code I'm going to use today is on GitHub. Feel free to get it and make fun of me because it's really shitty Java code. Hopefully, it's not too bad, and if you find a bug, you can send me a pull request. I really love those.
Let's start with Java apps. Most of you know about AWS, right? We have four deployment options when it comes to deploying code on AWS. The first and oldest is deploying with virtual machines on Amazon EC2. Just fire up a server, SSH to it, and it's a server. You can do whatever you want with it. The second is using Docker containers. We have a service called Amazon ECS, which allows you to manage Docker clusters and schedule containers on those clusters. Anyone using ECS? Lambda, which is the so-called serverless architecture, where you don't care about infrastructure anymore. You just deploy code functions and trigger them using APIs or any kind of infrastructure event in your AWS account. You can use Java 8, and there are quite a few open-source frameworks that make it easy to write, deploy, and manage Lambda functions. Anyone using one of those? Serverless or Gordon? A few people, okay. I like them very much. The last one is our platform as a service option, called Elastic Beanstalk. We have Beanstalk users as well. It's a really good option for developers, probably the simplest way to deploy code on AWS. You can use very old Java or pretty new Java on different application servers. We have a Java SDK that allows you to invoke all the Java APIs, going back to 1.6, so you should be able to use it with all your Java versions.
When it comes to tools, we develop a management plugin for Eclipse, which I'll show you today. It allows you to have a quick start for your projects, run your Lambda functions, and manage a lot of AWS resources directly from Eclipse. But they told me, none of these guys are using Eclipse. Who's using Eclipse? You're my brothers. Thank you. Everyone else is using IntelliJ IDEA, right? Oh, you make me feel really old. But at least I have the excuse that this is only available for Eclipse. They asked me to look at IntelliJ stuff, so I did, and we have a couple of third-party plugins that seem up to date. I quickly tried them, and they were fine. There's an AWS manager plugin similar to what we maintain for Eclipse, but it's very old, doesn't work with the latest IntelliJ version, and it looks like the maintainer fell off a cliff. If anyone can get those sources and fix them, that's better.
The last thing I want to say about writing apps is how to manage credentials. Please never hard-code credentials in your application. The API key and secret key should never end up on GitHub. Don't put them on any EC2 instance either because they will end up in the credentials config file on the instance, and this will eventually end up on GitHub. You'll also have the problem of distributing credentials to auto-scaling EC2 instances, which can open major security holes. If your keys end up on GitHub, we're monitoring for that, and support will contact you with a friendly email saying your AWS account has been suspended because we found your security keys on GitHub. Please don't do that again, and we'll be happy to resume your account. It's okay because at least if the account is suspended, there's no risk. But please be really careful with this. If you need to distribute AWS credentials, like the API key and secret key, use IAM roles. If you're not familiar with roles, please look them up. If you need to distribute backend credentials, like MySQL login and password, we have a new service called the EC2 Systems Manager with a module called the Parameter Store. Anyone using that? This is the best way to do it. It's a simple way to store credentials, encrypted with KMS, and you can retrieve them in one API call, so you never have to have any secrets in your code. I'm using this in my code, so I'll show you.
Here's our reference architecture for backends. Data comes in many shapes and sizes, throughputs, and granularities, so we need different backends to handle different tasks. We have a first line for storing data, like Elasticsearch, Cloud Search, DynamoDB, RDS, and S3, to adapt to different shapes. Then we have a second line with analytics services like Lambda, MapReduce, Redshift, etc. Today, I'm going to talk about the relational database RDS and the NoSQL database DynamoDB. I'll also cover Hive, Redshift, and Athena for analytics. The other services are interesting too, but there's only so much you can fit in 45 minutes.
Let's start with databases. The first one, and I'm pretty sure everyone has played with this before, is called RDS. Who's using RDS? A lot of people, right? Pretty much every application needs a relational database. The cool thing about RDS is that it's a managed service. In just a few clicks or one API call, we provide you with a database instance or cluster pre-installed with your favorite engine. You can scale it up and down, create replicas, etc. It's very easy to do, and you don't have to be a DBA. Developers can manage it. Since you can scale up and down, you can adapt to traffic and reduce costs. We support six engines: three open-source engines—MySQL, MariaDB, and Postgres—and commercial engines like Oracle and SQL Server. There's also Aurora. Anyone using Aurora already? Aurora is a high-performance re-implementation of MySQL and Postgres. The AWS team took MySQL and Postgres and increased performance by five to ten times, made high availability better, and scaled storage much higher. You can scale up to 64 terabytes of storage, which should be enough for everyone. I'm going to use Aurora today. Airbnb has been using RDS from day one and has always managed their databases with RDS. Lamborghini is using RDS as well, probably on a smaller scale.
Let's take a look at Aurora. Aurora is based on MySQL and Postgres, but today I'm using the MySQL version. How do you connect to these databases? The good news is you've seen this a million times before. This is vanilla JDBC code, nothing specific to AWS. The only difference is that in some cases, I'm using AWS credentials, so I don't need a login and password. For some other services, I do need a login and password, like MySQL, and I need to provide it. How will I grab those? I'm using the system manager parameter store, where I entered my credentials and can just grab them by building a request and asking for the parameter to be decrypted. This is how I'll get my login and password for MySQL and Redshift.
Let's look at Aurora. You can use the normal MySQL driver to connect to Aurora. I created a cluster before the talk. This is the endpoint. I'm getting my credentials from the parameter store, connecting to the Aurora database, and calling a Java API to describe the database and run a query. I have a table with 1 billion lines of fake sales data. I'm looking for users called Jones who live in Florida. I'm using JDBC prepared statements, which is vanilla Java code. Let's run it. I got my credentials from the parameter store, described my cluster, and found 226 people called Jones in Florida. I did a simple select on one table with one billion lines, and it took about a second because I have the right indexes. My message is don't rush to NoSQL. If you know the queries and can optimize with the right indexes, you can make databases faster than anything else, especially with Aurora.
Now, let's move on to DynamoDB. DynamoDB is a NoSQL database built by Amazon in 2004 and became an AWS service in 2007. It's a fully managed service, even more managed than RDS. You don't have to create clusters or instances; you just create a table, call an API, and get your items. It's a high-level service that developers love because the infrastructure is completely hidden. You can scale it up and down, and it's pretty cool. For example, Expedia uses DynamoDB for real-time processing of events from their websites and mobile apps, handling about 200 million messages a day. Since the service is managed, you can be up and running in a couple of days, even if you're learning DynamoDB from scratch. There's no fuss—create tables, put and get, and you're on your way. That's why developers love this system.
One thing to mention is that I'm going to use the real DynamoDB in the cloud, but there is a local version you can download to work locally on your machine. This is great for local testing and development without cloud latency or cost. I'm connecting to DynamoDB in the US, creating a table, and adding attributes. It's a movie database with a title, rating, and release date. I can create a secondary index on ratings and create the table with the sharding key as the title. I'm adding an indication of how much data I'll read and write, which drives the cost. I need to wait a few seconds for the table to be created, and then I'll add some movies, print one, add a character to "Return of the Jedi," print all movies with a one-star rating, print all movies with a five-star rating between 75 and 82, print all Star Wars movies, print all movies with Yoda, and delete bad movies. These are examples of the API: put item, get item, update item, query with an index, query with the hash key, and scan the full table if you don't have indexes or a hash key.
Let's look at a simple one: create movie. If you need to get an item from the table, you call `getItem` and provide the hash key to get your item back. If I want to find all movies with a given character, I need to scan the table because I don't have an index for that. I build a scan request, check if the characters contain the parameter, and print the items. You can do put, get, update, delete, query on indexes or the hash key, and scan if everything else fails. Let's do this. Here's the AWS plugin for Eclipse, and I'm connected to the US. Let's start by creating the table. This takes maybe 10 seconds because it needs to provision a little bit of infrastructure. Then we'll roll out the API calls. I should see my table showing up here. My items are here, and I did what I had to do. I created the table, described it, added movies, got information for "Return of the Jedi," added Java, found all one-star movies, found five-star movies for a time range, found all Star Wars movies, found movies with Yoda, and deleted bad movies.
DynamoDB is not SQL. You have to understand how tables work, the API, and the queries you can do. You can only query on indexes and hash keys, but not on other columns. If you need to do something on other columns, you need to scan. Here's a summary of some backends. The major difference between Dynamo and RDS is not so much about scale and performance but about whether you have structured data or a schema. Let's move on to analytics because it's nice to have all that data, but now you want to crunch it. You could crunch it with Aurora, but Aurora is not a data warehouse and won't scale for long SQL queries. The first one we'll look at is Hive on EMR. EMR is the managed service for the Hadoop ecosystem. After my talk, you might move on to something else. EMR includes Hadoop, Spark, Hive, Presto, Flink, and HBase. Like RDS, you create your cluster with one API call or a few clicks, and everything is ready. You don't have to install Hadoop, and you can resize and terminate the cluster. It's very flexible, so you're never afraid to start or kill a cluster to save money. A large customer, a market regulator in the US called FINRA, uses EMR to process billions of stock and option trades daily. They run these transactions through machine learning systems at night using up to 10,000 Hadoop nodes. They don't need this infrastructure during the day, so they scale super high at night and kill everything during the day to save money. Your cluster should be short-lived: start it, load the data, crunch it, get the results, and kill the cluster to save money.
Let's look at Hive. Connecting to Hive is similar to what you know. You use the JDBC driver, but you don't want to expose your EMR cluster on the public internet. It runs in a private subnet in your account, and you authenticate by opening an SSH tunnel to it. I'm not connecting with a login or password; I'm SSHing into the master node and sending my code. I'll start it because Hive is a bit slow. Here, I'm describing my cluster and running the same SQL query. Hive is not strictly 100% standard SQL, but it's pretty close. Can I see this running? Aurora was one second. Oh, I see a tiny thing showing up here. Very nice graphs, by the way. It's a decent-sized cluster, using 10 C4 2XL instances, 160 cores total, and 300 gigs of memory. It's not a ridiculous cluster. Oh, it's over now. It took about a minute, and I found those 226 Joneses in Florida. Hadoop promises your job will complete but doesn't promise how long it will take. It worked, and we can see crazy network traffic and nodes working. Hive has been around for a long time, and there are better options. One of those is Athena.
Athena is a new service that allows you to run read-only queries, selects, on data stored in S3. You can have petabytes of data in multiple formats and query it directly without loading or indexing. There's no infrastructure to manage or scale. It's based on Presto, an open-source project available on EMR, but we made it even simpler. You create a table that maps to the layout of the data in S3, and then you can run standard SQL on that data. If you use columnar formats like Parquet or ORC, you'll get very good query times. It's highly recommended to transform your data from CSV or TSV to columnar formats, and you can do compression and partitioning to increase performance. We'll look at this in a minute. The service is quite new, but we already have a lot of people using it, like DataXoo, an ad tech company in the US. They process 180 terabytes of data daily, log everything, and use Athena for BI and analysis.
Let's look at Athena. You connect with JDBC using Athena. I'm using local credentials on my laptop to authenticate to Athena. I'm building a client, calling an Athena API to get the last three queries, and printing them. Then I connect and run the same statement. The only thing is, why the extra quotes? Prepared statements are not supported in Presto, on which Athena is based, so I have to use a normal statement. Listing past queries is just one API call away. Here are my last three queries, showing execution time and data scanned. I'm running my select on Athena, and it takes about six or seven seconds. Seven seconds to go through a billion lines of data stored in S3, with zero loading, management, or scaling time. I'm madly in love with this service. If you use Hive today and just for read-only stuff, please look at Athena. You'll save hours. It's a very good service for exploration, debugging, and ad-hoc analysis. You'll be up and running in minutes with no infrastructure to manage.
The last one I want to talk about today is Redshift. Anyone using Redshift? Redshift is our data warehouse based on SQL, evolved from Postgres. If you know SQL, you know how to use Redshift. It's completely managed like RDS, and you get your cluster after a few minutes with one click or API call. It's massively parallel, scaling up to over 100 nodes working in parallel on the same query, making it blazingly fast. You can scale it to multi-petabyte scale if needed. It's simple to use but very scalable and inexpensive for a data warehouse. You can get about $1,000 per terabyte per month, which is a very competitive price. There's an extension called Redshift Spectrum, which brings Athena-like capabilities to Redshift. Now Redshift can query data stored locally on the nodes and also hosted in S3. You get the best of both worlds. You send a query to the leader node, which builds the execution plan and splits the query across compute nodes. Each node runs part of the query on local data and sends results back to the leader node. You can load data from S3 and Dynamo, and if the data is in S3, you can go through Spectrum with a fleet of managed nodes to work on that data. You get fast, local, super-optimized queries for reporting and analytics, and the capability to query petabytes of data hosted in S3 without hosting it on the cluster itself, saving tons of money.
Customers say Redshift is super fast. Many people left Hive for Redshift, reporting 20 to 50 times faster performance. Now with Spectrum, it's even better. If you need data warehousing for BI and reporting and the capability to query tons of data at a low price, Redshift and Spectrum is a great combination. If you don't need a data warehouse, you can use Athena for querying data. You have lots of options.
Let's look at Redshift. Redshift is JDBC, and you need a login and password. I'm grabbing those from the parameter store, connecting to my cluster, describing the cluster, and looking for the Joneses in Florida. It's almost the exact same code as Aurora. Let's do that. Credentials, description of the cluster, query. Boom. The data here is in S3, the same dataset I used for Athena. It took Athena seven seconds. How much did it take for Redshift Spectrum? One second, 1.5 seconds. Those billion lines are not on the Redshift cluster; they're in S3. Spectrum is the Formula 1 of AWS backends. I'm madly in love with Spectrum too. Can you love two backends at the same time? I don't know. I have to choose.
Let's sum things up. Hive was great for a while, but now I think Athena is a better option for most use cases. Redshift was already very good for BI and data warehousing, and now with Spectrum, it's even better. I'll post the slides on Twitter tonight, so you'll get everything. If you love Spark, run Spark on EMR. Here's the reference architecture again. The schema conversion tool is an app you download, connect to your database, and select the tables and objects you want to migrate. It will attempt to convert the schema and objects automatically, highlighting anything it cannot convert and providing pointers for manual adjustments. Once ready, you can use the database migration service to replicate and handle schema changes from one engine to another. This service is popular and easy to use, with tens of thousands of databases migrated this way, including relational databases and data warehouses.
In conclusion, you have a wide choice of backends. You can start with familiar options like MySQL or Postgres, saving yourself the trouble of managing infrastructure by using RDS. You can use NoSQL and DynamoDB for highly scalable databases. You can also handle big data and analytics for almost all use cases. These tools provide everything you need to build your backend stack, minimizing infrastructure management. We focus on providing high availability, scalability, security, and compliance out of the box, so you can focus on designing your tables and product, not the plumbing. Our job is to handle the infrastructure, so you can focus on creativity and building great apps.
Thank you very much. Thanks to the organizers for having me today. It's a great event, and I'm enjoying my stay in Ukraine. I might be back this year, but it's too early to tell. If you have questions, I'll be happy to answer them. Enjoy the rest of the day.
Julien Simon is the Chief Evangelist at Arcee AI
, specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.
With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.
Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.
Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.