AWS EC2 vs EMR vs SageMaker
An overview of AWS's most popular products to deploy machine learning projects with its respective advantages/disadvantages and instructions on how to launch them.
Nowadays, individuals, companies, and governments prefer to use cloud machines over traditional computers for multiple reasons. Among those are the ability to switch machines, upgrade them quickly, obtain bigger RAM, more cores, more GPUs, and perhaps the best reason of all, you pay per second instead of an up-front cost.
Amazon Web Services (AWS) is one of the most popular on-demand cloud computing platforms at the moment. It offers a variety of technical infrastructure products and services. In this article, we will overview three of the most popular tools to deploy machine learning models: EC2 instances, EMR clusters, and SageMaker Notebooks.
| Tool | AWS info | Pros | Cons |
|:---------:|:--------:|:-------------------:|:----------:|
| EC2 | Link | Baseline | Baseline |
| EMR | Link | Cheap, Auto-Scaling | Cumbersome |
| SageMaker | Link | Easy, Powerful | Expensive |
EC2
Amazon Elastic Computed Cloud (Amazon EC2) offers virtual computing machines called instances. Along with Amazon S3 buckets, these are the most popular services in the AWS ecosystem.
- Users can increase or decrease capacity within minutes.
- Learn more about the different instance types here.
- On-Demand Pricing information here.
EMR
Amazon Elastic MapReduce (EMR) is a managed cluster platform that uses EC2 instances, simplifying the use of Apache Hadoop and Apache Spark to process vast amounts of data.
The central component of an EMR is the cluster. A cluster is a collection of EC2 instances. Each one of the instances are called nodes. There are three types of nodes:
- Master node: Manages the cluster. It distributes de data and tasks among other nodes.
- Core node: A node that runs tasks and stores data.
- Task node: A node that only runs tasks.
All clusters have at least one master node. More info about nodes here.
About EMR:
- Dynamically scalable to meet demand.
- Best to use when you heavily rely on Spark, Hadoop, and MapReduce.
- On-Demand Pricing (in addition to the EC2 pricing) here.
SageMaker
Amazon SageMaker is a fully-managed platform that enables data scientists to quickly deploy machine learning models into production with only a few clicks.
- Pricing information here.
Creating and Launching AWS products
EC2
- Select EC2 from the AWS console
2. Launch instance
3. Select the desired instance
4. Launch using the SSH instructions
EMR
- Select EMR from the AWS console
2. Create cluster
3. Go to advanced options and select needed software
4. Click next and proceed to hardware selection. Here you can pick your master, core, and task modes (EC2 instances).
5. Setup final configurations
6. Launch instance: Users can either connect through SSH or through a Notebook.
SSH
NOTEBOOK
(Through the terminal, the user is able to install any additional software that they might need.)
SageMaker
- Select SageMaker from the console
2. Create a Notebook instance
3. Launch the instance
In conclusion, an EC2 instance is a good starting point to deploy any machine learning project. Within a few minutes, the user is capable of scaling up or down and they only pay-as-they-go.
If the user heavily relies on a distributed system to deal with vast amounts of data, then an EMR will be a much better, auto-scalable, cheaper option. However, an EMR should only be used for this purpose since its correct setup might be somewhat cumbersome.
On the other hand, if the price is not an issue, SageMaker offers a wide variety of powerful options out of the box to quickly deploy machine learning models into production.
This is by no means a comprehensive and exhaustive list of AWS products or their uses, but we hope that after reading this article you are able to make a more informed decision for your project ;) Thank you for reading!