On-prem deployment
Introduction
This guide walks you through how to spawn a ConvectHub on a cluster of on-prem machines.
There are three types of machines that are involved during the provision process:
Provisioner
-- This is not a computing node but rather a machine that used by an admin to trigger and control the provisioning process.Master
node -- This is a node that hosts the mission-critical services such as ingress, cluster database, device-plugins, etc. In an HA setting, there can be multiple master nodes.Worker
node -- This is a node that hosts the computing workloads such as a user notebook instance, a ML training job, etc.
Prerequisites
Hardware
Minimum requirements 2 cores, 4GB memory, 10GB storage space for Master
and Worker
.
1 core, 4GB memory, 25GB storage for Provisioner
.
OS
We current support Ubuntu (>=16.04) as the host OS for Master
and Worker
. CentOS support is under progress.
On your node, make sure you can install system libs by executing
apt update
apt install <LIB_NAME>
If you have public internet access from your nodes, this is done by visiting the official apt repo.
If you have a private network setting, then configure a mirror apt repo and point your system to use it by modifying /etc/apt/sources.list
.
We recommend to use a linux OS on Provisioner
. If you are using Windows, we commend to check out WSL.
Network
We require the following ports to be accessible by each other between any two nodes for Master
and Worker
.
PROTOCOL | PORT | SOURCE | DESCRIPTION |
---|---|---|---|
TCP | 6443 | K3s agent nodes | Kubernetes API Server |
UDP | 8472 | K3s server and agent nodes | Required only for Flannel VXLAN |
TCP | 10250 | K3s server and agent nodes | Kubelet metrics |
TCP | 2379-2380 | K3s server nodes | Required only for HA with embedded etcd |
TCP | 80 | ConvectHub | Required to visit ConvectHub via http. Can be configurable |
TCP | 443 | ConvectHub | Required to visit ConvectHub via https. Can be configurable |
TCP | 22 | Provisioner |
Required by Provisioner to visit Master and Worker via SSH |
Internet access on Master
and Worker
is optional, but required on Provisioner
.
We require Provisioner
to be able to visit Master
and Worker
through their intranet addresses, i.e., Provisioner
is connected to the private network.
Dependency libs and software
We require the following libs and software to be pre-configured on Provisioner
MACHINE | Software/lib name | Installation guide |
---|---|---|
Provisioner |
Docker | Official guide |
Provisioner |
python3, python3-pip, python3-venv | Official guide |
Provision process
We assume all the following steps are done on Provisioner
machine unless explicitly specified.
Check the network connectivity
On Provisioner
, first make sure you can visit internet
ping -c 3 convect.ai
Clone provision repo
TODO: need a better way to distribute it
$ git clone https://github.com/convect-ai/mlp.git
$ cd mlp/baremetal/airgap-k3s
Download artifacts
TODO: need a better way to distribute it
$ aws s3 sync s3://convect-mlp-assets/bin/ shared/bin/
It may take a while depending on your internet connection.
Set up ssh users and keys
Make sure you can ssh into Master
and Worker
from Provisioner
.
We recommend using a passwordless login method and a unified user name for each machine (Refer to this guide if you are not sure how to do it).
For the user, we require sudo
permission and recommend disabling the password prompt by creating a file
echo "$USER ALL=(ALL:ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/$USER
Install ansible
We use ansible
to automate the provision process. To install it
# create a virtualenv
$ python3 -m venv .venv
$ source activate .venv/bin/activate
# upgrade pip if needed
$ pip install --upgrade pip
# install ansible
$ pip install ansible
# make sure the installation is successful
$ ansible --version
Prepare the inventory file
An inventory file declares the fleet of machines (their address, how to access them) and their roles in the deployment.
A sample file
worker2 ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2201 ansible_ssh_private_key_file='~/.ssh/id_rsa' my_iface=eth1 has_gpu=true
worker1 ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2200 ansible_ssh_private_key_file='~/.ssh/id_rsa' my_iface=eth1
master ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2222 ansible_ssh_private_key_file='~/.ssh/id_rsa' my_iface=eth1
[driver]
master
[worker]
worker1
worker2
[registry]
master
[all:vars]
convect_workdir=/tmp/convecthub
convect_token=k3s_hub_token
has_gpu=false
jhub_config_path=thu-vars/jupyterhub.yaml
The first line
worker2 ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2201 ansible_ssh_private_key_file='~/.ssh/id_rsa' my_iface=eth1 has_gpu=true
declares a machine called worker2
. Some notable fields
ansible_ssh_user
- The user name to use whenssh
into the machineansible_ssh_host
- The ip address or host name to use whenssh
into the machineansible_ssh_port
- The port to use whenssh
into the machine. Default =22
ansible_ssh_private_key_file
- The private key to use whenssh
into the machinemy_iface
- The network interface name on the machine that is connected to the private network. You can runip a show
to see all the available network interface and the binded IP addresses on a machine to figure out the namehas_gpu
- Specifies if the machine is equiped with GPU. Default =false
[worker]
worker1
worker2
groups the machines worker1
and worker2
into the worker
group. If you have more nodes to join the cluster as workers
, put them here.
For driver
group, you usually need one node if HA is not considered.
[all:vars]
convect_workdir=/tmp/convecthub
convect_token=k3s_hub_token
has_gpu=false
jhub_config_path=thu-vars/jupyterhub.yaml
defines some common variables that rarely need change.
To prepare an inventory file for you own environment, the following steps are recommended:
- Write down the ip addresses / hostnames for all nodes that are used as part of the cluster
- Create a common user that can ssh into each node using a private key from
Provisioner
- Figure out the network interface name on each node that connects to the main private network if it has multiple interfaces
- For each node, define a line like
worker2 ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2201 ansible_ssh_private_key_file='~/.ssh/id_rsa' my_iface=eth1 has_gpu=true
by substitutingansible_ssh_user
,ansible_ssh_host
,ansible_ssh_port
,ansible_ssh_private_key_file
,my_iface
,has_gpu
to reflect your environment.
Smoke test
Once you have your inventory file defined, say inv.toml
, we can check if the configuration is good by running
$ ansible all -i inv.toml -m ping
You should be able to see all nodes returning an OK
status.
Start the provision process
If you have set up the passwordless sudo permission of the user, then simply run
ansible-playbook -i inv.toml playbook.yml
otherwise
ansible-playbook -i inv.toml playbook.yml --extra-vars ansible_sudo_pass=YOUR_PASSWORD
where YOUR_PASSWORD
is he password required to execute sudo
.
The provision process usually takes around 30 minutes depending on the number of nodes and your network condition.
Validate ConvectHub is working
Once finished, on Provisioner
, open a browser and visit <MASTER_IP>
, where <MASTER_IP>
is the private IP address of the Master
node and verify you can see the login page.
Configure the provisioning process
Nvidia driver
If you have already configured nvidia display on your nodes, set install_nvidia_driver=false
on a machine declaration in the inventory file to skip install the driver automatically.
Docker
If you have already configured docker on your nodes, set install_docker=false
on a machine declaration in the inventory file to skip installing docker automatically.
If you have a private network setting and cannot visit the docker official repo, set docker_repo
to point to your mirror, e.g., https://download.docker.com/linux/ubuntu
Change ports to serve ConvectHub
If you would like to serve ConvectHub by using non-default ports (e.g., 80 and 443), set http_port=<YOUR_PORT_NUMBER>
and https_port=<YOUR_PORT_NUMBER>
under [all:vars]
section in the inventory file. E.g.,
[all:vars]
convect_workdir=/tmp/convecthub
convect_token=k3s_hub_token
has_gpu=false
jhub_config_path=thu-vars/jupyterhub.yaml
http_port=9000
https_port=9443
will serve the http and https traffic via <MASTER_IP>:9000
and <MASTER_IP>:9443
respectively.
Uninstall the deployments
If you would like to uninstall the deployments from your servers, just run
ansible-playbook -i inv.toml reset.yml