On-prem deployment

Introduction

This guide walks you through how to spawn a ConvectHub on a cluster of on-prem machines.

There are three types of machines that are involved during the provision process:

Provisioner -- This is not a computing node but rather a machine that used by an admin to trigger and control the provisioning process.
Master node -- This is a node that hosts the mission-critical services such as ingress, cluster database, device-plugins, etc. In an HA setting, there can be multiple master nodes.
Worker node -- This is a node that hosts the computing workloads such as a user notebook instance, a ML training job, etc.

Prerequisites

Hardware

Minimum requirements 2 cores, 4GB memory, 10GB storage space for Master and Worker.

1 core, 4GB memory, 25GB storage for Provisioner.

OS

We current support Ubuntu (>=16.04) as the host OS for Master and Worker. CentOS support is under progress.

On your node, make sure you can install system libs by executing

apt update
apt install <LIB_NAME>

If you have public internet access from your nodes, this is done by visiting the official apt repo. If you have a private network setting, then configure a mirror apt repo and point your system to use it by modifying /etc/apt/sources.list.

We recommend to use a linux OS on Provisioner. If you are using Windows, we commend to check out WSL.

Network

We require the following ports to be accessible by each other between any two nodes for Master and Worker.

PROTOCOL	PORT	SOURCE	DESCRIPTION
TCP	6443	K3s agent nodes	Kubernetes API Server
UDP	8472	K3s server and agent nodes	Required only for Flannel VXLAN
TCP	10250	K3s server and agent nodes	Kubelet metrics
TCP	2379-2380	K3s server nodes	Required only for HA with embedded etcd
TCP	80	ConvectHub	Required to visit ConvectHub via http. Can be configurable
TCP	443	ConvectHub	Required to visit ConvectHub via https. Can be configurable
TCP	22	`Provisioner`	Required by `Provisioner` to visit `Master` and `Worker` via SSH

Internet access on Master and Worker is optional, but required on Provisioner. We require Provisioner to be able to visit Master and Worker through their intranet addresses, i.e., Provisioner is connected to the private network.

Dependency libs and software

We require the following libs and software to be pre-configured on Provisioner

MACHINE	Software/lib name	Installation guide
`Provisioner`	Docker	Official guide
`Provisioner`	python3, python3-pip, python3-venv	Official guide

Provision process

We assume all the following steps are done on Provisioner machine unless explicitly specified.

Check the network connectivity

On Provisioner, first make sure you can visit internet

ping -c 3 convect.ai

Clone provision repo

TODO: need a better way to distribute it

$ git clone https://github.com/convect-ai/mlp.git

$ cd mlp/baremetal/airgap-k3s

Download artifacts

TODO: need a better way to distribute it

$ aws s3 sync s3://convect-mlp-assets/bin/ shared/bin/

It may take a while depending on your internet connection.

Set up ssh users and keys

Make sure you can ssh into Master and Worker from Provisioner. We recommend using a passwordless login method and a unified user name for each machine (Refer to this guide if you are not sure how to do it).

For the user, we require sudo permission and recommend disabling the password prompt by creating a file

echo "$USER ALL=(ALL:ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/$USER

Install `ansible`

We use ansible to automate the provision process. To install it

# create a virtualenv 
$ python3 -m venv .venv 
$ source activate .venv/bin/activate

# upgrade pip if needed 
$ pip install --upgrade pip 

# install ansible
$ pip install ansible

# make sure the installation is successful
$ ansible --version

Prepare the inventory file

An inventory file declares the fleet of machines (their address, how to access them) and their roles in the deployment.

A sample file

worker2 ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2201 ansible_ssh_private_key_file='~/.ssh/id_rsa' my_iface=eth1 has_gpu=true
worker1 ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2200 ansible_ssh_private_key_file='~/.ssh/id_rsa' my_iface=eth1
master ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2222 ansible_ssh_private_key_file='~/.ssh/id_rsa' my_iface=eth1

[driver]
master

[worker]
worker1
worker2

[registry]
master

[all:vars]
convect_workdir=/tmp/convecthub
convect_token=k3s_hub_token
has_gpu=false
jhub_config_path=thu-vars/jupyterhub.yaml

The first line

worker2 ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2201 ansible_ssh_private_key_file='~/.ssh/id_rsa' my_iface=eth1 has_gpu=true

declares a machine called worker2. Some notable fields

ansible_ssh_user - The user name to use when ssh into the machine
ansible_ssh_host - The ip address or host name to use when ssh into the machine
ansible_ssh_port - The port to use when ssh into the machine. Default = 22
ansible_ssh_private_key_file - The private key to use when ssh into the machine
my_iface - The network interface name on the machine that is connected to the private network. You can run ip a show to see all the available network interface and the binded IP addresses on a machine to figure out the name
has_gpu - Specifies if the machine is equiped with GPU. Default = false

[worker]
worker1
worker2

groups the machines worker1 and worker2 into the worker group. If you have more nodes to join the cluster as workers, put them here.

For driver group, you usually need one node if HA is not considered.

[all:vars]
convect_workdir=/tmp/convecthub
convect_token=k3s_hub_token
has_gpu=false
jhub_config_path=thu-vars/jupyterhub.yaml

defines some common variables that rarely need change.

To prepare an inventory file for you own environment, the following steps are recommended:

Write down the ip addresses / hostnames for all nodes that are used as part of the cluster
Create a common user that can ssh into each node using a private key from Provisioner
Figure out the network interface name on each node that connects to the main private network if it has multiple interfaces
For each node, define a line like worker2 ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2201 ansible_ssh_private_key_file='~/.ssh/id_rsa' my_iface=eth1 has_gpu=true by substituting ansible_ssh_user, ansible_ssh_host, ansible_ssh_port, ansible_ssh_private_key_file, my_iface, has_gpu to reflect your environment.

Smoke test

Once you have your inventory file defined, say inv.toml, we can check if the configuration is good by running

$ ansible all -i inv.toml -m ping

You should be able to see all nodes returning an OK status.

Start the provision process

If you have set up the passwordless sudo permission of the user, then simply run

ansible-playbook -i inv.toml playbook.yml

otherwise

ansible-playbook -i inv.toml playbook.yml --extra-vars ansible_sudo_pass=YOUR_PASSWORD

where YOUR_PASSWORD is he password required to execute sudo.

The provision process usually takes around 30 minutes depending on the number of nodes and your network condition.

Validate ConvectHub is working

Once finished, on Provisioner, open a browser and visit <MASTER_IP>, where <MASTER_IP> is the private IP address of the Master node and verify you can see the login page.

Configure the provisioning process

Nvidia driver

If you have already configured nvidia display on your nodes, set install_nvidia_driver=false on a machine declaration in the inventory file to skip install the driver automatically.

Docker

If you have already configured docker on your nodes, set install_docker=false on a machine declaration in the inventory file to skip installing docker automatically.

If you have a private network setting and cannot visit the docker official repo, set docker_repo to point to your mirror, e.g., https://download.docker.com/linux/ubuntu

Change ports to serve ConvectHub

If you would like to serve ConvectHub by using non-default ports (e.g., 80 and 443), set http_port=<YOUR_PORT_NUMBER> and https_port=<YOUR_PORT_NUMBER> under [all:vars] section in the inventory file. E.g.,

[all:vars]
convect_workdir=/tmp/convecthub
convect_token=k3s_hub_token
has_gpu=false
jhub_config_path=thu-vars/jupyterhub.yaml
http_port=9000
https_port=9443

will serve the http and https traffic via <MASTER_IP>:9000 and <MASTER_IP>:9443 respectively.

Uninstall the deployments

If you would like to uninstall the deployments from your servers, just run

ansible-playbook -i inv.toml reset.yml