Introduction to JupyterHub¶

This is part of a blog series on AI/Machine Learning. In the previous blog, we discussed Jupyter Notebooks, how they are different and the challenges organizations run into at scale with it. In this blog, we will look at organizations can use JupyterHub to take to provide access to Jupyter notebooks as a centralized service for their data scientists.

Why Not Standalone Jupyter Notebooks?¶

In the last blog, we summarized the issues that data scientists have to struggle with if they use standalone Jupyter notebooks. Let's review them again.

Installation & Configuration¶

Although data scientists can download, configure and use Jupyter notebooks on their laptops, this approach is not a very effective and scalable approach for organizations because

It requires every data scientist to become an expert on and perform the following:

Spend time downloading and installing the correct versions of Python
Create virtual environments, troubleshooting installs,
Deal with system vs. non-system versions of Python,
Install packages, dealing with folder organization
Understand the difference between conda and pip,
Learn various command-line commands
Understand differences between Python on Windows vs macOS

Lack of Standardization¶

On top of this, there are other limitations that will impact the productivity of data scientists as well.

Being limited by the compute capabilities of the laptop i.e. no GPU etc
Data scientists pursuing Shadow IT for use of compute resources from public clouds
Constantly download and upload large sets of data
Unable to access data because of corporate security policies.
Unable to collaborate (share notebooks etc) with other data scientists effectively

Important

JupyterHub's goal is to eliminate all these issues and help IT teams deliver Jupyter Notebooks as a managed service for their data scientists.

In a nutshell, with JupyterHub, IT/Ops can provide an experience for data scientists where they just need to login, explore data, write Python code. They do not have to worry about installing software on their local machine and can get access to a consistent, standardized and powerful environment to do their job.

JupyterHub Requirements for IT/Ops¶

JupyterHub is a Kubernetes native application and needs to be deployed and operated by IT/Ops in any environment (public/private cloud) where Kubernetes can be made available. Although, a Helm chart for JupyterHub is available, IT/Ops needs to do a whole lot more to operationalize it in their organization. Let's explore these requirements in greater detail.

#	Requirement	Description
1	Prerequisites	As IT/Ops, how can I quickly deploy and operate the necessary pre-requisites for JupyterHub? i.e. Kubernetes cluster with all required add-ons for auto scaling, monitoring, gpu drivers, notifications, Ingress controller, cert-manager, external-dns etc. I may need to do this in different environments (i.e. public cloud: AWS, Azure, GCP, OCI or private cloud: data center)
2	Rapid Turnaround	Upon request by data scientists, IT/Ops need to provide an instance of JupyterHub optimized for their requirements with no delays.
3	Low Infra Costs	The costs associated with infrastructure to support the data scientists should be extremely low. The underlying infra should be configured to use low cost Spot instances and tools like Karpenter to ensure it is right sized.
4	Low Operational Costs	The personnel costs to deploy, operate and support this should be extremely low. This should not require a massive team to develop, maintain and support the infrastructure
5	Secure Access	All access to the JupyterHub instance should be secured via Single Sign On (SSO) with the organization’s Identity Provider (IdP) enforcing access policy.
6	Self Service Experience	IT/Ops should provide data scientists a self-service experience where they can select from a menu of available options, click to deploy in a few minutes and start using it.
7	Connection Security	All connections between various components needs to be secure, ideally using mTLS via a service mesh such as Istio.
8	Data Protection	All data created and used by data scientists needs to be automatically backed up. If/when required, IT/Ops should be able to perform data recovery in minutes.
9	Data Access	The data scientist should have automatic access to data that they can use in their notebooks
10	Data Analytics	The data scientist should have the ability to seamlessly perform data analytics using Spark etc directly from their Jupyter notebooks
11	Environment Promotion	The data scientist should have the ability to seamlessly promote the new code from Dev -> QA and eventually to Prod with appropriate workflows/approvals etc

The diagram below shows what a typical JupyterHub environment managed by IT/Ops would look like at steady state.

Rafay's Templates for AI/GenAI¶

The Rafay team has been assisting many of our customers operationalize JupyterHub for their data scientists. We have packaged the best practices, security and streamlined the entire deployment process so that IT/Ops can save significant amount of time and money. The entire process is now a 1-click deployment experience for IT/Ops and they are now starting to offer this to their data scientist teams as a self service experience. Learn more about Rafay's template for JupyterHub

Important

This is available for all Rafay customers that have licensed the AI/GenAI suite.

Blog Ideas¶

Sincere thanks to readers of our blog who spend time reading our product blogs. This blog was authored because we are working with several customers that are expanding their use of Jupyter notebooks on Kubernetes and AWS Sagemaker using the Rafay Platform. Please contact us if you would like us to write about other topics.