A real life account of moving 100% to a public cloud

December 03, 2015

Velocity Conf, Amsterdam, 30/10/2015 Julien Simon, CTO, Viadeo Antoine Guy, Principal Infrastructure Architect, Viadeo http://conferences.oreilly.com/velocity/devops-web-performance-eu-2015/public/schedule/detail/44042

Transcript

Can you hear me okay? Yeah? All right. Okay, let's get started. So first of all, I want to thank you for staying late and attending this last session. I know you guys are all waiting to go out to Amsterdam to party. You can do that in about an hour. Until then, we're going to talk to you about our work. So before we start, just a quick word about us. I'm Julien. I've been a VP and CTO in web startups for the last 10 years. I have to be fully honest. I left Viadeo two weeks ago, and I am now working for AWS. But fear not, this is definitely not a sales pitch. This is about the work we've done at Viadeo and nothing else. Antoine? So my name is Antoine Guy, and I've worked in web infrastructure and operations for more than 14 years, and I'm still working at Viadeo. So I have many things to say on AWS that won't be a sales pitch. Alright. That's good. At least one of us is going to be honest. All right. You can beat me up at the end if you want. Okay, so what is this about? We're going to start way back and discuss the rationale. Why did we even end up picking AWS for our migration? We'll look at some of the design options and trade-offs that we had to consider. We'll also discuss the parts of the technology stack that were altered to go to AWS or not. Antoine will give you a lot of great technical details on the complete automation of our infrastructure build using a technology called CloudFormation and continuous integration. What's the starting point? The starting point is our technology stack. We were a large website, and everything relies on a service platform called Casper. It's our internal code name for it. It's written with Java and Spring, and it's pretty modern. It uses some nice architecture patterns, such as CQRS, etc. So, it's a pretty nice service platform. On top of that, we're running a web application called Limbo, written in Node.js with all those fancy JavaScript packages. Mobile is super important to us. We're running mobile apps on iOS and Android for all the different devices. And, of course, we love backends. We have many of them for legacy reasons and some different use cases as well. So MySQL, Elasticsearch for member indexes, HBase, Spark, Hadoop, and probably a few more I didn't even find out about. Infrastructure is the funny part. We have about 250 physical servers, that's 15 racks or so, and they're hosted in San Francisco, California. We're a French company, and we do most of our business in France, with 10 million users there. You don't want to know why. But we have a nice data center downtown San Francisco, and as you will see, that was one of the reasons why we decided to do something else. As soon as you power up servers, you start to have problems, and so did we. Let's look at some of the challenges we had. The first one is obvious, and it's a bit sad, but probably some of you have this problem too. Our servers are getting old. Most of them are not supported anymore. They start to fail, get slow, and you get frustrated about performance and service quality. Definitely something we want to address. The second one has been talked to death in these last three days. It's about agility. The Viadeo tech team has adopted pretty modern agile practices over the last two to three years, such as feature teams, etc. All the latest stuff has been in place for a while and is running fine from design to coding, etc. Unfortunately, we can't say the same thing about infrastructure. We have very strong agility for dev and a lot of rigidity for infrastructure. This is a typical situation in many companies, and something we want to fix. Operating a physical network is not key to our business, and we definitely don't want to do it. Sure, we can run Cisco stuff and firewalls, etc., but we don't want to. Operating a remote data center is pretty bad. We're thousands of kilometers away from our servers, which creates a lot of issues. We want to get out of that. My favorite, and everyone's favorite, is disaster recovery, especially when your data center is likely to fall into the ocean at any given minute. We do have a disaster recovery plan, but it's not super strong. If the big one hits San Francisco, it will probably take us one or two days to be back in a different location. That's way too long for a web company, and my boss agreed. Faced with this, we sat down and tried to come up with options. We had four options. On the left, we could upgrade the servers. We could buy more servers and replace the old ones. Everyone knows how to do that, but it would only solve the hardware issue, not the other issues. It could cost a lot of money because you have to pay upfront for that physical infrastructure. Our gut feeling was, we've been doing this for 10 to 15 years. It's time to do something else. So, no, we don't want to do this. It's not solving much. Another option was full virtualization and consolidation of those 15 racks into maybe two or three racks. Yes, it solves the hardware issue and improves agility, but it doesn't solve anything else, still costs a lot of money, and feels outdated. Number three was building a new data center with modern technology in a very safe place. It solves the hardware issue and improves disaster recovery, but costs even more money, takes a lot of time, and is a huge waste of time and money because we don't need to do that. The last option was to go to a public cloud. Our feeling was that it would solve or improve a lot of our issues, but it looked like a massive project. We didn't have a lot of cloud experts in the team, so we knew it would hurt. And it did hurt quite a lot. But let's be honest, we're all engineers, and this was the solution that put a smile on everyone's face, even though we knew it would hurt. So this is the one we picked. Why, besides being crazy engineers? The number one thing is that in a company like ours, you want to focus on building a great service. This is your core business. Dealing with hardware issues, licensing issues, and supplier issues drives you nuts and is a huge waste of time. We didn't want to do this ever. As a web company, we want to experiment, A/B test, and try out a lot of ideas without any constraints and without waiting for weeks for servers to show up. A few clicks away, you can build your clusters, start your instances, and try things out. If it doesn't work, just terminate everything, and you've wasted $5 or $17. As I mentioned, our tech culture of agility and DevOps is very strong, and there was no resistance to change. Everyone was like, "Let's do it. We can't wait to start." Unlike physical infrastructure, cloud services allow you to measure very precisely what part of the infrastructure is costing what. You can see what instances, storage, etc., are assigned to a project and know exactly how much it costs. This helps you determine if the ROI is okay or if it's a waste. We had already been using public clouds for a while for services like backups, archives, and cold storage. So, while it didn't make us experts, we knew it was agile and easy, and we had some confidence it would work. The next logical step was, which cloud are we going to use? We picked AWS. We were already using some services, so it made sense to continue. But the most important reason was accessing cool PaaS technology. We tried a few services and saw how fast we could deliver clusters in different parts of our infrastructure. Being multi-region was nice. We have infrastructure in the US, and we could decide to leave some stuff there or move it to Europe or other parts of the world. AWS is moving super fast with new features and improvements, making it a very lively platform that looked appealing to us. We started the project in January, and about five months later, Gartner released a report on cloud infrastructure. I usually don't show this stuff because Gartner is about marketing, but in the top right corner, that's AWS. We were like, "Okay, at least other people think this is the way to go." Take a look at that report; it's interesting to see the strengths and weaknesses of each platform. So, we were all excited and thought it would be important to have some rules and guidelines to keep us on track. The key objectives were automation, scalability, and safety. Automation is obvious; it's not a big team, and we need to be as efficient as possible. Scalability because we want to use exactly the resources we need. Safety because it's automated and could go wrong quickly. We wanted continuous integration and delivery on everything, including our infrastructure, which is a new concept. You can actually test and deploy your infrastructure. We would only replace parts of the stack if the benefits were too good to pass. We're not going crazy with replacing everything. We're going all in, so at the end, we should be able to close the data center. But we're not doing this over a weekend, and no big bang kind of thing. It's a gradual move, and we're ready to roll back if things go really wrong. This means we have to plan for a temporary hybrid run with both our data center and our growing AWS deployment. These were the rules that we had in mind to keep us on the road. Now I will pass the mic to Antoine, who will give you more technical details about how we did it. Thank you. Okay, thank you, Julien. Now that Julien gave us some context about the situation, I'm going to explain the way we followed to conduct this transformation. First, we did a sorted state of our infrastructure. We made sure we listed every server, every service, every network equipment, every VPN connection, and everything. The goal was to estimate what it would cost to move everything as is to the cloud and to know each service's cost and feasibility. Knowing the cost of each service helps evaluate if it makes sense to replace it with a SaaS solution. Sometimes, you find pain points, like technical debt or legacy apps that nobody wants to touch. We discovered things we had to deal with. Then, you have to define a plan, an island plan, because you want to know where to start. Most companies that move to the cloud follow a similar path. They usually start with the staging environment because there's no risk and it's easy to gain experience. Then analytics, because there's no risk and there are nice cloud features for analytics. Next, legacy applications, critical applications, and if everything goes well, you close your data center. We started like this, but there were unexpected diversions. For example, emergencies where you need capacity and can't expand your data center, so you go directly to the cloud. Sometimes, new projects come up, and you don't want to start them in the legacy infrastructure, so you go directly to the cloud. Or sometimes, early adopters want to move, and you help them. So, that's what we started to do. Keeping in mind the key objectives Julien mentioned, one logical thing to do and a great advantage of moving to the cloud is to use infrastructure as code. With infrastructure as code, your infrastructure is just code, so you can track changes, review changes, test it, and deploy it automatically. Developers have been using and experimenting with these features for many years, so it makes sense to move in the same direction. One big advantage is that if you do it correctly, your infrastructure is just code, and if you can deploy it in one region, you can deploy it in another region in a matter of hours or even minutes. If your RTO is not strict, you can even think of the cloud as its own disaster recovery. We went with AWS, and in AWS, the infrastructure as code product is called CloudFormation. CloudFormation is a JSON-based descriptive language that allows you to create every kind of resource you can find in AWS, from instances and databases to storage and security rules. It also takes care of relationships, dependencies, and the order of creation. The nice thing is that with CloudFormation, you can test the validity of your work before deploying it. We considered other solutions when we started, like Terraform from HashiCorp or Ansible. But the advantage of CloudFormation is that you stay close to the provider, with no overhead, missing APIs, or issues you have to deal with from a third party. After a few weeks of playing, writing JSON, and experimenting, we ended up with something that looks like this. It's not very fancy; it's just a VPC, which is a virtual private cloud, basically your own network in AWS with different subnets, private and public subnets. It's scalable, highly available, and can be deployed in a few minutes. I'll explain how we automated all of this. First, you write some templates and code files, and you need to test them before deploying them. AWS provides a CLI for CloudFormation, which allows you to validate your templates in detail. You can write test cases with the language of your choice, and AWS provides many SDKs, including the Go SDK. At Arcee, we used Ruby and Rake to test the files and push them to production once they were tested. You put everything in a Git repository and link it with any CI tool. We use SQL CI and Jenkins. Every change in Git triggers the CI tool, which validates the file, runs the tests, and, if you want, deploys the change to your staging or test environment. So, you have continuously deployed infrastructure. It's not really difficult. For instances in the cloud, which run on machine images, you can automate the baking of machine images using configuration management tools like Puppet, Chef, or others. Tools like Packer from HashiCorp and Aminator from Netflix can help with this. You can put this in a CI tool, so for every change in your configuration management, you create a new image. If you use the role pattern in your configuration management, you can create an image for each role and deploy it in each environment, which is very convenient. You might ask yourself whether to completely bake your images with every change, every new version, or every release, or to just do a base image and let the configuration management do the rest. There's no one-size-fits-all answer. For autoscaling services, a completely baked image is the way to go because you want instances to start and be ready as quickly as possible. For less critical services that don't change often, you can just make a half-baked image or a base image and let the configuration management do the rest. We took a mixed approach and built a system with two Git repositories: one for configuration management and one for infrastructure. From the configuration management repository, we build the machine image, and from the infrastructure repository, we build CloudFormation templates that we push to S3. The only thing left to do is trigger the deployment of the infrastructure code using the machine image, and your service is up and running. If you do this for all your services, your data center is up and running and ready. CloudFormation can be intimidating if you start from scratch because you don't know the language or AWS. AWS provides tools called CloudFormation Designer, which can decode your existing resources in AWS and output CloudFormation files with all the JSON of the resources. The files it creates are not really usable, but they are a good starting point if you want to learn or write the JSON for your infrastructure. When creating CloudFormation templates, you want to make sure they are easy to reuse. There is a concept in CloudFormation called nested stacks, which is calling a stack from within a stack. You can do this with any depth, but it can be dangerous if a stack gets stuck in an undefined state. We recommend not doing nested stacks of more than one level. Another nice thing with CloudFormation is that you can use it for green-blue deployment. You can deploy the exact same thing with a little change, move traffic from the green stack to the blue stack, and if it doesn't work, just roll back. If it works, stay like that, destroy the old stack, and you're done. You can also tag everything consistently within CloudFormation, which is very important. After a few months, we learned many things we wanted to share. Network in AWS is very convenient and virtual, but it's still network, and every resource depends on it, so you have to plan it early and carefully. If you're going to run a hybrid environment, you will need good connectivity with your data center. Load balancers in AWS, called ELBs, are level three and have fewer options than hardware load balancers, so you will have to adapt. CloudFront, AWS's CDN, is a good choice, but performance can vary, so test it yourself. It's also a good idea to have a second CDN for redundancy. EMR, Elastic MapReduce, is Hadoop on demand, but the data is not always available, so you will have to adapt your workflow. MySQL users might find RDS appealing, but it's not suitable for all use cases, especially if you have large tables. AWS has limits on the number of resources you can create, so you need to request limit increases in advance, especially for disaster recovery. If you're using autoscaling, you need to use a service discovery tool like Consul or etcd because instances can be created and destroyed dynamically. Finally, make sure everything is multi-AZ for high availability. So, tech is half the work. It's a big project that impacts the whole company. You need to talk to the CEO, CFO, and other stakeholders to get buy-in and understand their objectives. Maybe the CFO wants to cut infrastructure costs by 50%, which could be a problem for your project. You need to involve legal and finance early on because their time scale is different from yours. For example, budgeting and canceling legacy contracts can be challenging. Hosting companies are not thrilled by this kind of project, so you need a good lawyer. When you start your project, build an A-team with dev, ops, architects, and security to address all potential issues. As you go, raise awareness in the tech team and transfer knowledge. You don't want to surprise your developers with a completely new environment. Current status: Staging is fully done and running on AWS, and we're already live, serving production traffic from AWS. We're running in three regions with interconnected VPCs and over 100 instances for production and testing. We're big fans of Redshift, the data warehouse solution, and have two clusters for that. Next steps: Move the backends, find an efficient way to replace our live Hadoop cluster with EMR, clean up old MySQL stuff to get it ready for RDS, and optimize Elasticsearch, which will remain on EC2. We'll also optimize technically and financially, using reserved instances, spot instances, and auto-scaling. Conclusion: Five years ago, if you had asked us if we were keen on using cloud computing, we would have said no. Now, if you ask us the same question, and when we ask others who say no, we ask why. Some reasons are good, but most are wrong or outdated. There's a lot of catching up to do on what cloud computing can do for you. We still get accused of following trends, but to us, the cloud is infrastructure in digital form, just like photos and movies. If cloud computing is a fashion, maybe MP3 is a fashion as well. We're making the most of it. We jumped into it, and it creates a lot of great engineering challenges. We're not faced with physical infrastructure anymore, but there's still a lot of cool and interesting work to do for us engineers. This concludes our talk. Thank you all for staying up late. You can start drinking now. If you have any questions, we'd be happy to answer them. Thank you very much. Have a great weekend.

Tags

CloudMigrationAWSInfrastructureAsCodeDevOpsCloudFormation

← Back to 2015 Videos ← Back to YouTube Overview