How to Use AWS SageMaker and S3 for ML Development (cheaply)

16 min readFeb 26, 2023

When I first started learning about Data Science and Machine Learning, I remember I used conda and jupyter notebooks on my local computer. Initially, this worked great for learning about the subject. However, after a while, my hard drive started to get full with all the data I was downloading, not to mention how hot my laptop would get when training a deep neural net model on a large dataset.

I remember my Dad’s friend, who worked in tech, mentioned using SageMaker from AWS, saying it’s great for ML development. At the time, it sounded like yet another thing to learn and I worried it could open the door for a surprise bill of $3,000. And because of this, I never bothered to even try it out, until recently. And now I wish I had been using it all this time!

It turns out that doing ML development in the cloud is much better for multiple reasons:

No filling up my hard drive with data and models
Training models on lots of data doesn’t cause my laptop to overheat
Much easier to share data and models with others
Much easier to deploy models using integrated AWS services
Very Cheap (assuming you follow what I do in this article!)

Since I wish that I came to this realization years ago, and I don’t want you to experience this same thing, I decided to make a guide for you to help you learn from my mistakes!

First, I will explain how cheap this approach is and the nuances around cost. Then, I will give you a walkthrough of my ML development workflow, showing you everything you need to start ML development using AWS SageMaker quickly and effectively. And, you can always use this article as a reference in the future as well.

So, in order to show you how to confidently use jupyter notebooks in SageMaker cheaply, I want to show you an extract from the SageMaker pricing page:

With SageMaker, you pay only for what you use.

This sounds great, and it really is great. And, when you hear stories about people getting large bills from AWS, I think it usually comes from confusion around what constitutes “using” SageMaker (or another AWS service). I think using some Questions & Answers will help make this easy to understand!

Is “using” SageMaker only when you are typing and executing code in a jupyter notebook in SageMaker?

No. I will dive into more details about terminology later, but the gist of how the pricing works is: you will be charged for the number of hours that the computer running jupyter notebooks is on, at the rate of that computer.

So how can we keep the cost of SageMaker low?

Use the computer with the cheapest rate to accomplish our ML task, and turn it off when we are done using it. In order to do this, it is also good to store our data and models in S3 so they can be accessed later (S3 is also very cheap for an individual developer or small team at a smaller scale).

How much will the default option for running jupyter notebooks in SageMaker end up costing with this approach?

The default “computer” when setting up a jupyter notebook environment is ml.t3.medium, and it costs $0.05 per hour. If we assume that we will use it for 40 hours a week, our monthly cost will be 40*4*$0.05 or $8, and it can be even lower if we only turn it on when we really need to use a notebook.
If we forget to turn it off for an entire month, it will only cost us $36 — not the end of the world — though it could add over time.

What if I need a more powerful computer for training or cleaning larger datasets?

In this case, I would recommend setting up a new “computer” to run your more intensive ML tasks, run them in that “computer”, then shut it off right after. However, the default option can handle a lot, and I believe in many cases it is more than enough for learning and prototyping.

I have not been using the correct terminology so far to help you get good intuition for how to use SageMaker cheaply. However, I think it will be helpful to know the actual AWS terminology before the walkthrough. What I have been referring to as a “computer” is called a Notebook Instance in SageMaker, and we will be running a jupyer notebook environment in a Notebook Instance. The Notebook Instance has a “computer type” that it is run on (the official terminology is a Compute Instance, and I believe it may just be an EC2 instance or something very similar). Compute Instance types have names like ml.t3.medium. While I do not believe it is possible to change the computer type that the Notebook Instance runs on, you can always create a new Notebook Instance with a different Compute Instance type and down files from the oldNotebook Instance and upload them to the new Notebook Instance.

I mentioned before that I recommend saving data and models to S3, and you may be wondering how S3 is priced and if that could end up costing you a lot of money unexpectedly. Well, S3 will only end up being expensive if you are reading and writing many, many Gigabytes across thousands of requests. This is very unlikely to happen with ML development for one person or a small team.

S3 currently costs about $0.02 per Gigabyte per month for storage and fractions of a penny for read/write requests (full details here), so we should not have to worry about storing a few Gigabytes for development. However, in order to effectively do ML development as cheaply as possible, I would recommend not being excessive with the amount of data you store and deleting unneeded data. This is a good practice, even for your own computer. However, I believe if you store even 10 GB of data in total and read/write 1,000 times per month, using S3 will still cost less than $1 per month. I very much recommend taking this approach because it helps us stop and resume our work in jupyter notebooks across sessions and in different Notebook Instances easily and it is very cheap!

ML Workflow Set Up and Walkthrough

I am going to assume that you already have an AWS account, you can navigate around the console, and that the account you are on has permissions to make an S3 bucket and SageMaker notebook (the root account has these privileges though it’s a best practice to not use the root account — this is something you can learn about later if you choose). And, if not you are not familiar with these points, there are plenty of AWS resources to help with this!

As an overview, here are the steps we will take in this walkthrough:

Create AWS resources (making an S3 bucket and a SageMaker Notebook Instance)
Create ajupyter notebook in the Notebook Instance
Create some data in the jupyter notebookand write/read it to S3 (similar process to downloading/cleaning data)
Train a model and do some visualizations
Save the model to S3
Turn off theNotebook Instance to avoid being charged when we are not using it

I already said this above, but it is the crux of this approach so I will mention it again: the reasoning behind this workflow is to enable us to save our progress at any given point in time to S3, so we can turn off the Notebook Instance until we resume our work. And when we resume our work, we can read our data from S3 and pick up where we left off.

So first, let’s make an S3 bucket for this example. Make sure you are signed into your AWS account and navigate to the S3 service. When you are there click on the Create Bucket button. The screenshot below should help confirm you are at the same point as me:

Create Bucket Screenshot — The button should be in the upper right-hand corner of the screen

After doing that, we just have to name the bucket and create it, leaving all the default options (the bucket is secure from external access by default). As a note, all S3 buckets need to have a unique name across all AWS account users, so that is why mine is named my-ml-space54gsdf543. The random letters and numbers at the end ensure no one has this name already. I suggest you take a similar naming approach and write down the name (or copy and paste it to your notes somewhere) because we will need it later.

Visually, doing these steps will look like this:

Name the S3 bucket with a unique name across all AWS accounts (in this region)

Then scroll down to the bottom of the page and click Create Bucket:

Just click Create Bucket and we’re done!

Easy enough! Now, we have a place to save our Datasets and ML models, very much like a folder on your computer, just in the cloud!

Also, remember to save your bucket name somewhere so we can use it in our jupyter notebook later!

Next, we are going to create a jupyter notebook on a Notebook Instance in SageMaker. So, Navigate to SageMaker, and click on Notebook Instances under the Notebook dropdown on the left-hand side menu. This screenshot should help you see where to click:

You will likely need to expand the “Notebook” dropdown before clicking on “`Notebook Instances"`

Great job! Now on the next screen, click on theCreate notebook instance button:

Now, we will name our notebook instance my-ml-space-notebook at the top of the page, then click Create notebook instance in the lower left of the page, leaving all other options default. Remember, the ml.t3.medium costs about $0.05 per hour, and is good enough for many ML tasks, so it is a great Notebook instance type to start with. Here is a screenshot to confirm you are following along:

As a note, this will also create an IAM role for us that will allow the Notebook Instance to access S3, so we do not have to worry about this. This is in line with AWS security best practices and will not allow any permissions that could give outsiders access to S3 or SageMaker. It will simply allow your SageMaker to read and write to your S3 buckets. If you would like to read more on this, there are lots of resources available here.

Now, we have to wait for the Notebook Instance to get up and running, which could take a bit of time — up to maybe 5 minutes. This could be a good time to maybe do some stretching, read some more about AWS security best practices, or whatever you feel like for a few minutes.

Here is the Notebook Instance with status pending:

Waiting for the Instance to be created and ready to use

Also, make a mental note that this what the Notebook Instance instance looks like in the web console, and this is where we will be turning it off when we are done using jupyter notebooks to ensure we are not being charged.

Ok, once the Notebook Instance status is InSerivce, we want to click on Open Jupyter:

This should take you to the familiar page of the jupyter notebooks file system browser, except running on AWS SageMaker. From here we want to click on New and under Notebook, choose conda_pytorch_p39 (or whatever conda_pytorch version of python they have when you are doing this).

Select this conda environment for the walkthrough

This will create us a new jupyter notebook running python 3.9 as well as having PyTorch and a bunch of other ML packages pre-installed. Although we are not going to use PyTorch in this walkthrough, I believe this will keep us from having to install packages like scikit-learn, and it comes with a version of pandas that can read and write to S3 easily as well!

Now, we should finally be able to see a good-old jupyter notebook. I am going to rename mine to example_notebook by clicking on Untitled at the top of the page and entering in example_notebook:

You can name yours whatever you wish, then we are ready to start developing ML!

Now, we are finally ready to start our ML workflow in our jupyter notebook. However, before we get into the actual coding, we are going to start with some helpful import statements and put our S3 bucket name in a variable to help later:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# my ml bucket name that I created (remember to update this with yours)
BUCKET_NAME = 'my-ml-space54gsdf543'

And here is where you need the name of the S3 bucket we created earlier. Don’t worry if you don’t have it stored somewhere, just navigate back to S3, copy the name of the bucket and paste it in place of my bucket name!

Now that this is out of the way, let’s get into the fun stuff. First, we will be “downloading” data, cleaning it, and writing/reading it to S3. This should give us a good idea of how to go about procuring an ML dataset using SageMaker and S3. I put “downloading” in quotes because the code I am using just creates data with numpy (and some old math you may remember from here), but it gives us an idea of what downloading data from an API could look like. Here is the code I am using:

# helper function to emulate downloading data into a pandas DataFrame
def download_data_to_pandas():
    '''
    Pretend this function is downloading data from the internet somewhere.
    '''
    
    def random_1D_array(num, upper_bound=.2):
        '''
        This function outputs 1D-array with the shape (num,) 
        full of random elements in [0,upper_bound)
        '''
        return np.random.random_sample((num,))*upper_bound
    
    # number of data points
    n = 1000

    r = np.linspace(0.0, 10.0, num=n)
    
    x0_0 = r*(np.cos(r) + random_1D_array(num=n))
    x1_0 = r*(np.sin(r) + random_1D_array(num=n))

    x0_1 = r*(np.cos(r + np.pi) + random_1D_array(num=n))
    x1_1 = r*(np.sin(r + np.pi) + random_1D_array(num=n))

    class0 = np.vstack([x0_0, x1_0, np.ones_like(x0_0)]).T
    class1 = np.vstack([x0_1, x1_1, np.zeros_like(x0_1)]).T

    return pd.DataFrame(
            np.vstack([class0, class1]), columns=['x0', 'x1', 'y']
    )

Now, let’s “download” this data and save it to S3 using pandas, read it back in again, and take a look at the data itself with matplotlib:

# download data points into a pandas DataFrame
df = download_data_to_pandas()

# you can just save a dataframe right to an S3 path using pandas
df.to_csv(f's3://{BUCKET_NAME}/datafile.csv', index=False)

# read data that you uploaded to s3 into memory using pandas
df = pd.read_csv(f's3://{BUCKET_NAME}/datafile.csv')

# plot data to see what it looks like
title = 'Comparing y = 0 and y = 1 for Features x1 and x2'
fig, ax = plt.subplots(1, 1, figsize=(12, 6))

df[df['y'] == 0].plot.scatter('x0','x1', ax=ax, color='b', label='y = 0')
df[df['y'] == 1].plot.scatter('x0','x1', ax=ax, color='r', label='y = 1')
ax.set_title(title)
plt.show()

Wow, look at how easy this is! Especially using this newer version of pandas where we can just read from and write to S3 using a plain S3 path. It’s pretty much as easy as reading and writing to your computer's disk!

Now, this is supposed to simulate steps you would take in downloading, cleaning, and studying a data set in getting it ready for use in a model. And as I mentioned before, we are not actually downloading any data from the internet and you can imagine this is similar to gathering data by calling an API in a jupyter notebook and saving it to S3. However, data can be saved to S3 by other means as well, which is part of what makes this method great.

Here are some other ways you can get your data into your S3 bucket:

Downloading a dataset to your computer, then uploading it into your S3 bucket in the AWS web console
Write data to S3 bucket programmatically using other AWS services such as Glue or Lambda

Alright, let’s move on to the actual ML model training now as somehow our data ended up being super clean, showing a very obvious pattern, and is nearly perfectly formatted for plugging right into an ML model!

So we will start by splitting our data into training and test data sets, training model on it, and see how accurate this model is:

from sklearn.model_selection import train_test_split

# put data in format model prefers
X, y = df[['x0', 'x1']].values, df[['y']].values.ravel()

# split training and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

from sklearn import svm

clf = svm.SVC()
clf.fit(X_train, y_train)

> SVC()

accuracy = np.sum((y_test - clf.predict(X_test) == 0))/y_test.shape[0]
print(f'Our accuracy is: {accuracy}')

> Our accuracy is: 0.9125

Whoa, it looks like our model is pretty accurate as well. Choosing a Support Vector Machine did an alright job of handling this data!

Let’s get more insight into the model performance by looking at the decision boundary (code adapted from this stackoverflow post):

def make_meshgrid(x, y, h=.02):
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(
        np.arange(x_min, x_max, h), 
        np.arange(y_min, y_max, h)
    )
    return xx, yy

def plot_contours(ax, clf, xx, yy, **params):
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

fig, ax = plt.subplots(figsize=(12, 6))
# title for the plots
title = ('Decision surface of linear SVC ')
# Set-up grid for plotting.
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)

plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_ylabel('x0')
ax.set_xlabel('x1')
ax.set_title(title)
plt.show()

Well, it looks like the classifier did a great job getting the spiral pattern, although there are a few places (such as in the middle and at the outside edge of the red spiral) that aren’t perfect. However, perfection is not possible in reality, and I think this model does a great job of capturing the essence of the pattern!

So now let’s save our model to S3 and read it back into memory again (code adapted from this stackoverflow post):

import tempfile
import boto3
import joblib

# intialize s3 client to save model
s3_client = boto3.client('s3')

# name to save model as in s3
model_name = "model.pkl"

# save to s3
with tempfile.TemporaryFile() as fp:
    joblib.dump(clf, fp)
    fp.seek(0)
    s3_client.put_object(
        Body=fp.read(), 
        Bucket=BUCKET_NAME, 
        Key=model_name
    )

print(f'model saved to s3 as: {model_name}')

model saved to s3 as: model.pkl

# read model into memory from s3
with tempfile.TemporaryFile() as fp:
    s3_client.download_fileobj(
        Fileobj=fp, 
        Bucket=BUCKET_NAME, 
        Key=model_name
    )
    fp.seek(0)
    model = joblib.load(fp)

accuracy = np.sum((y_test - model.predict(X_test) == 0))/y_test.shape[0]
print(f'Our accuracy of model loaded from s3 is: {accuracy}!')

Our accuracy of model loaded from s3 is: 0.9125!

Whoa, we can see the model read in from S3 has the same accuracy as the model that we trained earlier! So we can use this to save our model and open it again to continue working on it, or we can even use this same model stored S3 with other AWS services to deploy the model or share it with other people who have access to this bucket!

I also want to point out that the above code snippet is using theboto3 package, which is essentially the official AWS python interface of the AWS CLI. Boto3 is invaluable for using AWS services from python. In this above example, we used it to store and read files from S3, and we can also use it for many other tasks. For example, we could use boto3 to list all the objects bucket. The docs for boto3 are great, and there are many resources and examples of how to use it internet as well!

Ok, now let’s talk about how to exit out of our ML development session in a way that ensures that we are not being charged for our Notebook Instance when we are not actively using it!

First, we want to make sure we are saving our jupyter notebook so we can open it from this point in the future. We do this by pressing the Save button or using ctrl + s:

Note: in the future, if you have more than one jupyter notebook in your Notebook instances with changes, you will want to save all of those notebooks in this step here.

Now, that we saved our jupyter notebook, it is time to turn off the Notebook Instance itself to ensure we will not be charged when we are not using it. To do this, Navigate back to the Notebook instances page that we were on earlier in the SageMaker service in the AWS web console. Then select the Notebook Instance that we created earlier, my-ml-space-notebook, and click on Stop under the Actions dropdown at the top of the page:

Shutting down the Notebook Instance completely when we are done using it will allow us to be confident that we are telling AWS that it is not in “use”, and that the underlying EC2 instance (or similar style compute instance) used to power our jupyter notebooks can be used by other AWS customers while we are not using it!

This is a great habit to get into for your ML development. It allows you to do things like use an expensive compute instance when needed for some ML task and save the output to S3 to be used again on a cheaper instance. Then you can be confident that you are not accidentally leaving the expensive instance on and finding out by receiving a bill at the end of the month for $3,000!

Ok, one last thing to really drive home what we did and learned here. Let’s take a look at what is in the S3 bucket that we created earlier. Navigate over to it in the AWS web console, and take a look at what is in there:

We should be able to see the files that we created from the jupyter notebook— just as if we were navigating to some directory on our own hard drive!

Great job following along here!

Now you know how to do ML development entirely in the cloud on AWS, and you are empowered to do this cheaply! It is entirely possible to do 40 hours a week of ML development using this workflow and pay less than $10 a month (just be sure to not start doing things in 100s of GigaBytes of scale on expensive compute instances, if you are trying to do things cheaply).

I tried my best to make the resource that would have helped me greatly a few years back. Now that I made this, I hope you can jump right in and use SageMaker and S3 (cheaply) to build something great!

—

Also, here is a link to a GitHub repo containing the example_notebook that I made for this article: https://github.com/jude253/use-sagemaker-cheaply

How to Use AWS SageMaker and S3 for ML Development (cheaply)

ML Workflow Set Up and Walkthrough

Written by Jude Capachietti