Harry's Engineering

Harry's
Engineering
Blog

01.09.17

An on-demand high-powered Jupyter notebook server

By: Andrea Heyman

TL;DR: Harry's Analytics used Docker and AWS CloudFormation to provision a remote Jupyter notebook server for training machine learning models that can be spun up/torn down on demand and offers the same convenience as working locally.

Introduction

Here on the Analytics team at Harry’s, we frequently found ourselves training machine learning models on data coming from the entirety of Harry’s existence (4+ years and counting!) in Jupyter notebooks on our laptops. While perhaps this was a feasible workflow in the earlier days of Harry’s when there wasn’t as much data, as our data grew it became painfully slow to build and iterate on models on our local machines.

We came to desire the ability to perform our model training on machines with more cores, more memory, and even GPUs depending on the task at hand. Further, we wanted working on a remote machine with our desired hardware specs to feel the same as working locally on our laptops in terms of convenience. We also didn’t want to be paying to rent such resources when we weren’t actively using them.

We decided to build a remote Jupyter notebook server to satisfy these needs. With most of our team’s applications already running on AWS infrastructure, as well as the fact that we use Amazon Redshift for our data warehouse, an AWS solution to this problem was desired. We settled on using AWS CloudFormation to configure a stack of resources so that our remote machine could be easily brought up and torn down on demand. We also wanted to use Docker with AWS Elastic Container Service (ECS) to make sure that working on a remote machine came with everything we had in our Python virtual environments locally.

Requirements

Here were the requirements we had for this notebook server:

  1. Main Docker image not maintained by us
  2. When brought up, the server comes pre-loaded with code from a team git repository of Jupyter notebooks
  3. We can push edited code back to the git repository at the end of a working session
  4. We can get/put serialized trained models from a persistent location (i.e. S3 bucket)
  5. For security reasons, the server is only accesible from in-office IP addresses and is password protected
  6. When working on the remote server, we have the credentials and permissions to execute queries against our data warehouse (which lives in its own AWS account)
  7. Easy to spin up and take down the expensive parts of the system as needed with one click/command

Each of these requirements presented their own set of challenges (some more than others), and in the remainder of this post I’ll explain the architecture we landed on, how it satisfies the above requirements, and a little detail on some challenges we faced on the way to the eventual solution.

Architecture

Notebook Server Architecture

To summarize: data scientists working from within the office or over VPN navigate to the host name associated with the Application Load Balancer (ALB) on port 8888. Requests to the ALB are passed to the single EC2 instance running our Docker image which on startup is responsible for installing any necessary requirements and bootstrapping the machine with the notebook code from our Github repo, configuring credentials for our Data Warehouse, and starting the Jupyter notebook server (more below). The EC2 instance communicates with the outside world for purposes of installing requirements or accessing the Data Warehouse via the NAT Gateway, which has an Elastic IP address that is whitelisted by our data warehouse Redshift cluster.

This architecture might at first seem unnecessarily complicated, so here are answers to some reasonable questions you might have:

Q: Why is there a load balancer when there is only one worker?

A: One of our requirements was that it be easy to bring up our notebook server with one click/command. In order to achieve this, the EC2 instance that we bring up must communicate with our Data Warehouse via the same IP address every time, so that we can whitelist that IP address once instead of manually needing to whitelist a new one each time we bring up the server. The easiest way to do this was to put the ECS Cluster in a private subnet, which communicates through the NAT Gateway, which has an Elastic IP address and stays up all the time (see the following section). In order for our developers to talk to the server from the public internet, we need to put a load balancer in a public subnet that’s permitted to talk to the EC2 instance.

Q: Ok, but why is it an Application Load Balancer and not a Classic Load Balancer?

A: The Jupyter webserver uses websockets to maintain a persistent connection to a kernel. Websockets are only supported by Application Load Balancers.

Provisioning with CloudFormation

The VPC in which this entire service sits and its associated NAT Gateway and public/private subnets (shown with bold lines) are shared by other intelligence applications owned by Harry’s Analytics and Data Engineering teams provisioned by what we consider the “Base Stack.” We refer to the outer VPC as the “Data Apps” VPC, since it is shared by multiple applications. The notebook server-specific resources are provisioned by two separate CloudFormation stacks, one that is “persistent” and left up all the time and one that is “per instance” and spun up/torn down on demand (shown with dashed lines).

A pseudo-y version of our two stacks:

Persistent:

Parameters:

    BaseStackName:
        Description: The name of the stack containing the VPC/NAT Gateway/Subnets that the cluster created here should live in
        Type: String

    WhitelistCIDR1:
        Description: Please enter the first IP address range that can be used to communicate with the ALB
        Type: String

    WhitelistCIDR2:
        Description: Please enter the second IP address range that can be used to communicate with the ALB
        Type: String

Resources:

    SecurityGroups: ...

    ECSCluster: ...

A subtle point here is that as part of our persistent stack we bring up an ECS Cluster that has no nodes in it – we add an EC2 instance to the cluster as part of the instance stack. The reason for this is that an ECS Cluster is slow to provision but inexpensive to keep up with no nodes. We want to limit the amount of time our data scientists have to wait for their server to provision.

Instance:

Parameters:
    Prefix:
        Description: The prefix to attach to this set of instance resouces (e.g. developer name). Must be 28 characters or less.
        Type: String

    PersistentStackName:
        Description: The name of the application plus environment, e.g. notebook-server. Must match that used in the persistent stack.
        Type: String

    BaseStackName:
        Description: The name of the stack containing the VPC/NAT Gateway/Subnets that the cluster created here should live in
        Type: String

    Image:
        Description: Location of the Docker image in Amazon ECR to use for the application service
        Type: String

    GithubToken:
        Description: Authentication token for access to Github notebook repo
        Type: String

    DBPassword:
        Description: Password for notebook Data Warehouse user
        Type: String

    NotebookPassword:
        Description: Password for Jupyter notebook server
        Type: String

    InstanceType:
        Description: Which instance type should we use?
        Type: String

Resources:

    ALB: ...
    
    EC2Instance: ...

    ECSServiceNotebookServer: ...

Note that we pass our secrets, namely Github token, data warehouse password, and notebook server password, as parameters to the top-level instance stack. Those parameters then get passed to the child stack for the ECSServiceNotebookServer, which sets them as environment variables in the ECS Container that can then be referenced when the Docker image bootstraps.

Our “one-command” requirement is satisfied in that we can bring up our per-instance resources by invoking the create-stack command with the right parameters:

aws cloudformation create-stack --role-arn [ROLE ARN] --region us-east-1 \
--profile [AWS PROFILE] --disable-rollback --capabilities CAPABILITY_NAMED_IAM \
--stack-name notebook-server-dev-[PREFIX] \
--template-body file://`pwd`/cloudformation/master-instance-resources.yaml \
--parameters ParameterKey=Prefix,ParameterValue=[PREFIX] \
ParameterKey=PersistentStackName,ParameterValue=[PERSISTENT STACK NAME] \
ParameterKey=Image,ParameterValue=[ECR IMAGE LOCATION] \
ParameterKey=BaseStackName,ParameterValue=[BASE STACK NAME] \
ParameterKey=GithubToken,ParameterValue=[GITHUB TOKEN] \
ParameterKey=DBPassword,ParameterValue=[DB PASSWORD] \
ParameterKey=NotebookPassword,ParameterValue=[SHA1 HASH OF NOTEBOOK PASSWORD] \
ParameterKey=InstanceType,ParameterValue=[EC2 INSTANCE TYPE]

Dockerfile and bootstrapping

Recall that we expect our Dockerfile to be responsible for installing requirements and bootstrapping the machine. Bootstrapping entails cloning our Github repo of notebooks, configuring credentials for our Data Warehouse, and starting the Jupyter notebook server.

Also recall that a key requirement was that our main Docker image not be maintained by us. Fortunately, Jupyter maintains a nice collection of Docker images equipped with most packages necessary for core data science work. All we need to do is install the AWS CLI (that we can use to push/fetch serialized models from our S3 bucket), conda install any additional requirements from our requirements.txt file (in our case just psycopg2), copy our bootstrap script, and set the bootstrap script as our container entrypoint.

Dockerfile:

FROM jupyter/scipy-notebook

USER root

RUN apt-get update
RUN apt-get install awscli -y

USER jovyan

COPY requirements.txt .
RUN conda install --yes --file requirements.txt

COPY bootstrap.sh /usr/local/bin/

ENTRYPOINT ["bash", "bootstrap.sh"]

Bootstrap script:

#!/bin/bash
set -e

git config --global user.email "[EMAIL ADDRESS]"
git config --global user.name "[USER NAME]"
git clone https://${GH_TOKEN}:x-oauth-basic@[NOTEBOOK GITHUB REPO.git] ./[FOLDER NAME]
cd ./[FOLDER NAME]

echo "*:*:*:[NOTEBOOK SERVER DB USER]:${DB_PASSWORD}" >> ~/.pgpass
chmod 0600 ~/.pgpass

bash start-notebook.sh --NotebookApp.password=${NOTEBOOK_PASSWORD}

Note that we put our data warehouse credentials into a pgpass file so that we don’t need to reference them in any of the actual notebooks.

We build and push our Docker image once to an AWS ECS repository and only need to update it if our requirements change. When we bring up the instance stack with this Docker image and navigate to port 8888 on the load balancer’s host name, after logging in with our NOTEBOOK_PASSWORD we find ourselves at the Jupyter notebook tree page and see all of the code from the repository! For any shell commands, like changing git branches, making and pushing git commits, or interacting with our S3 bucket, we can open up a terminal right from our browser by using Jupyter/IPython’s Terminal feature, which lets you open a terminal from the /tree page:

Terminal

Multiple developers working concurrently

An important thing to note about our architecture is that for each load balancer, there is only one webserver worker serving content out of the file system. Therefore if multiple users were to use the same server, they would be sharing a file system (and the same .git directory) and therefore subject to overwriting each other’s changes.

Should multiple developers want to work on a bigger machine concurrently, each developer would run the bring up the instance stack with his/her own prefix as part of the stack name and as the value of the Prefix parameter. This will create a second EC2 instance and a second Load Balancer, and each developer simply navigates to the host name of their respective load balancer on port 8888.

Limitations

One limitation of our current setup is that the Load Balancer stops directing traffic to our EC2 instance if it deems it “unhealthy.” Unhealthy means that the load balancer cannot complete a successful health check in a certain period of time, where a health check involves pinging a configurable endpoint in the notebook server. Our only webserver running is Jupyter, but if we are running a computationally expensive task, our server may be unavailable to respond to such a ping for an extended period of time. Making the health check requirements as lenient as allowed by AWS, we’re only able to run tasks that take about an hour or less. Any longer and we lose the ability to communicate with our EC2 Instance through the load balancer. Future work entails running a second, simple webserver on a different port on the same machine that’s solely responsible for responding to health check pings.

Thanks for following along, and see our Github repo for detailed CloudFormation templates!