SageMaker + VS Code + Github Copilot the dream IDE for ML

November 01, 2022
We have all our definition of the perfect development environment for ML. For me, it's the combination of cloud-based managed infrastructure (all the way up to large multi-GPU instances), a great software IDE (not Jupyter!), and amazing productivity tools. In this video, I show you how to make it happen by installing VS Code and Github Copilot on Amazon SageMaker. Yes, it's possible :) Enjoy! ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️ - AWS blog: https://aws.amazon.com/blogs/machine-learning/host-code-server-on-amazon-sagemaker/ - Github copilot instructions: https://docs.github.com/en/copilot/getting-started-with-github-copilot/getting-started-with-github-copilot-in-visual-studio-code - Setup code for lifecycle configuration: https://gist.github.com/juliensimon/4eccabf58fa2d97a294d181a525b0127

Transcript

Hi everybody, this is Julien from Arcee. I guess we all have our own idea of what a perfect development environment would be for Jupyter notebooks, and there are plenty of options. For me, the dream scenario is a combination of managed infrastructure from small instances to really big multi-GPU instances, the ability to use a modern IDE with code completion, and, as of a few days ago, I can't live without GitHub Copilot any longer. So, I'm going to show you how to use all these things, combining a SageMaker notebook instance with VS Code instead of the built-in Jupyter environment, and we'll throw in GitHub Copilot for good measure. It's not a complicated setup; we can automate a lot of it, so let's take a look. As you probably know, SageMaker comes with two types of Jupyter environments: Notebook instances and SageMaker Studio. Notebook instances are pretty much what the name suggests: an EC2 instance completely managed, pre-installed with a Jupyter environment and Kanda kernels. When you open it, you jump straight into a vanilla Jupyter environment, which should look very familiar. SageMaker Studio is a more ambitious, full-fledged IDE still based on JupyterLab, with lots of integrations for additional SageMaker features like SageMaker Pipeline, etc. If you're deep into the SageMaker SDK and use SageMaker pipelines, data wrangler, debugger, and all those things, then SageMaker Studio makes more sense. If you're looking for a simpler, non-SageMaker-specific dev environment, just a GPU instance and Jupyter running on it, then a notebook instance is what you should use. I'll go for notebook instances; they make more sense to me, and they tend to be more reliable. The first step is to create a notebook instance. Creating a notebook instance is very easy. Let me quickly walk you through the steps. From the SageMaker console, go to notebook instances, click on create, give it a name, pick an instance type—you can go from really small to really big. For now, it doesn't matter because we're not going to create it immediately. We need to take care of an automation step first. You need to create some permissions. If you have an IAM role you want to use, feel free to use that. If you don't have one, you can create one right away, which is very simple. Then, that's about it. We could go and click "Create," wait for a few minutes, and then we would have this instance ready to open and work with. But we need to install some additional stuff. For example, I want to install Git LFS because for HuggingFace repositories, we need Git LFS support. Models are big files, and that's how the HuggingFace hub manages them. I don't want to do this manually every time I open the instance, so I want to automate it. We also want to install a code server to get VS Code support, and we don't want to do this manually either. We want a simple way to automate this every time we create a new instance. To do this, we can use a feature called lifecycle configurations. Lifecycle configurations let you create two scripts: one that runs once you create the instance and another that runs every time the instance is started. This is useful because you can stop and start instances to avoid unnecessary charges. We have a creation script and a start script, which are good places to inject our own scripts and commands to automate everything we want. So, we're going to create a lifecycle configuration and then create the instance. Let's create the lifecycle configuration. We start from here, select "Notebook instance," click on "Create," and let's call it "Code Server Demo." Now, we can paste our two scripts in the start and create sections. Let me show you the script before we do that. The create script is this one. First, we install Git LFS, then clone my repository for demos, set up the credential helper so I don't have to re-authenticate to Git every time I push, and install the VS Code server stuff and run the installation scripts. This comes from a really nice blog post on the AWS machine learning blog by Sofiane Eric and Giuseppe. They walk you through the setup for Code Server on notebook instances and Studio, so it's a good post to read. They use the AWS CLI to create the lifecycle, but I'll go through the UI. First, you should read that blog post and then do the config like I did. Let's go back to this and copy-paste the scripts. Here's the create script. Let's paste this into the create section. And then we'll do the start section, which is simpler. I'm still reinstalling Git LFS because it's not stored on the persistent volume. Notebook instances have different file systems. Your SageMaker directory, where you put your notebooks, is persisted, but the root volume is not. So, you need to redo this. Set the credential helper, and this time, I don't need to install VS Code because that's done in my directory. I just need to enable it. Let's paste this into the start section. Now we can create the configuration and then create the notebook instance. Give it a name, pick an instance type. I'm not going to run any actual notebooks here, so a t3.xlarge should be fine. I need to select my lifecycle config. Five gigs is not a lot of storage, so I'll bump that up. The IAM role, as mentioned before, if you already have one, fine; if not, you can create it. Then we're good to go. Click "Create." My instance is still starting up, and now I see logs. We have a separate log for the lifecycle config on create and then one for lifecycle config on start. This is super useful to debug things when your scripts don't go as planned. We see the installation, Git LFS, my repo has been cloned, and it fetched the Code Server extension. There's a bit of conda stuff involved, so it takes a minute or two. Let me pause again, and when I come back, the instance will be ready, and we'll open it. The instance is up. Let's open it. We'll zoom in a bit. We can see our downloads, the Code Server stuff, and my repo. We could open a notebook, but what we really want to do is open Code Server. This does look like VS Code. Choose the dark theme—it's Halloween, by the way, Happy Halloween! There we are. This is in the browser, not local VS Code. Now, I can do the usual stuff. EC2 user, SageMaker. This is where you want to keep your files. Let's quickly create a notebook to see that everything works. New file, Jupyter, untitled one. Let's pip install transformers. It created an environment automatically for me, but it also auto-detects existing environments on the instance. Let's try and run this. It needs IPyKernel in that environment, so maybe that's something to add to the Code Server script. We can fix this by opening a terminal and installing IPyKernel. It takes a minute. IPyKernel has been installed. Now we should be able to run this. That's the only gotcha, that extra conda package you need to install. Now it's installing transformers, and we're good to go. The next step is to install Copilot. You might think it's super easy to go to the marketplace and install it, but it's not there, so there's some kind of restriction. We can still install Copilot by going to the GitHub Copilot extension page, downloading the file, and installing it. Let's open that page. If we look at the installation instructions for GitHub Copilot on VS Code, we need to go to the extension page. If we were running local VS Code, we'd just click "Install," but we can download the extension and install it. It's a very tiny file. We'll go back to our notebook instance and upload this file. Now we can install it in a terminal. Installing the extension is simple. Just open a terminal, and in this directory, we have our Code Server binary. If it's not in the path, that's where it lives. We can do `code-server --install-extension` followed by the file name. It should all work. There you go. It pops up immediately. I need to sign into GitHub, and after clicking a few times, Copilot is enabled. Let's get rid of this terminal and ask for `install datasets`. Amazing. Once we're done working, we can save the file, close the window, and stop the notebook instance to avoid unnecessary charges. I started the instance again, so you can see it's in service, and the lifecycle config on start log is available if you need to debug. Let's open this and go back to Code Server. It should be enabled. There you go. This is pretty close to my dream IDE: managed infrastructure, a proper IDE, code completion, git management, all the good stuff, and GitHub Copilot, all hosted in the cloud. You can start and stop it as needed. This is pretty close to my dream IDE with a wide range of managed infrastructure from small CPU to large GPU instances, a proper IDE with code completion and git management, and all the good stuff that comes with it, including GitHub Copilot. I can't think of a better way to write code and work with notebooks right now. Until next time, Happy Halloween, be safe, and keep rocking!

Tags

SageMakerVSCodeGitHubCopilot

About the Author

Julien Simon is the Chief Evangelist at Arcee AI , specializing in Small Language Models and enterprise AI solutions. Recognized as the #1 AI Evangelist globally by AI Magazine in 2021, he brings over 30 years of technology leadership experience to his role.

With 650+ speaking engagements worldwide and 350+ technical blog posts, Julien is a leading voice in practical AI implementation, cost-effective AI solutions, and the democratization of artificial intelligence. His expertise spans open-source AI, Small Language Models, enterprise AI strategy, and edge computing optimization.

Previously serving as Principal Evangelist at Amazon Web Services and Chief Evangelist at Hugging Face, Julien has helped thousands of organizations implement AI solutions that deliver real business value. He is the author of "Learn Amazon SageMaker," the first book ever published on AWS's flagship machine learning service.

Julien's mission is to make AI accessible, understandable, and controllable for enterprises through transparent, open-weights models that organizations can deploy, customize, and trust.