Revamped for multi-demo & minimal downtime #14

rgaudin · 2024-08-15T17:42:05Z

Ended-up squashing it all as there were too many back-and-forth in the non-trival commits and none would work independently

Fixes #10

Significant changes

multi-proxy approach with single TLS handler. reverses using HTTP. need alls subdomains
custom homepage
can be configured from the repo
config watcher
update watcher improved: imdeploys existing but gone, starts missing, redeploys updated
authenticates to imager-service API to use private images as well
uses python3.11 (upgraded machine to bookworm)
exposes captive portal UIs on ports 1080, 2080, etc
defaults to all files under /data/demo/{compose,images,data}/
per-deployments subfolders
/var/logs/demo maintained to have a tmpfs
new Deployment type
download to a temp file to reduce downtime. Actual replacement (with downtime) thus lasts seconds
download without RPC/daemon mode to simplify management
download without capturing output so one can follow progress in logs
new --reuse-image param for deploy script
new --force-prepare param for deploy script (and --force for prepare one)
post-prepare calls multi-proxy for regen
no use for IS_ONLINE_DEMO trick on dashboard (will be removed)
multi-proxy is single docker run systemd (not compose) because one service
demo-watcher calls two script to ease dependency and use handy systemd timer
one maint-compose per demo
new undeploy script (with --keep param) for debug

Approach

reverse-proxy

Previous version was quite simple: we had a single compose running at a time, wich
already contained a reverse-proxy to the services. We just had to expose this
reverse proxy's ports (actually it came exposed) and tweak its config
to use ACME certificates instead of the caddy-internal ones for HTTPS.

The multi-demo approach is quite different. First, we need a webserver (:80, :443)
to serve our homepage (list of demos).

This means we cannot expose our hotspot-reverse-proxy anymore. We thus have to use that
homepage server to reverse-proxy to the hotspot reverse-proxy.

That's easy and serves the demo homepage (xxx.demo.hotspot.kiwix.org) well but it's not
enough as wildcard certificates only works up one level: cert for *.demo.hotspot.kiwix.org works for xxx.demo.hotspot.kiwix.org but not for yyy.xxx.demo.hotspot.kiwix.org

Solution is thus to have the main webserver (the multi-proxy) know every demo domain and every sudomains of each and manage all certificates.
Caddy auto_https is a great feature but its magic makes it difficult to configure such scenarios.
Edit those caddyfiles with caution 😅

We thus have two scripts in the mutli-proxy container: gen-server and caddy-reload
Those are called externally to rewrite the caddyfile (and the static homepage) and reload caddy

External human friendly config

Borrowing demo.library's config via a human-friendly YAML file, one can update the list of demos by updating demos.yaml.

ident is the auto-image identifier.
alias is an optionnal custom subdomain (ident used otherwise)
name is an optional name to use in the homepage (ident otherwise)

All tools continues to work of the /etc/demo/environment file.

A new script config-watcher is used to query the YAML file and directly update /etc/demo/environment.

Deployment Type

A dataclass to hold all informations about a demo/deployment. Replaces all the deployment-specific ennviron variables.

List of demos with key infos still stored in a single environ (OFFSPOT_DEMOS_LIST). That's a bit ugly, especially because it looks half baked but I wanted to keep as much old code as possible and that works well for now.

Because the multi-proxy reverses for the hotspot-proxy, it needs to access the hotspot-proxy via HTTP.
As each demo lives inside its own compose, docker network cannot be used to communicate from multi-proxy container to hotspot-proxy container.
We are thus openning the HTTP port of the hotspot-proxy on the host direclty and using it for the reverse.

It's not a security issue, we are not reversing to prevent direct access but to share a single TLS handler. Users cant use them directly anyway as the hotspot-proxy serves for its FQDN only.

Anyway this means using different ports for each demo. We thus have an index data on the deployment that we use to set the ports so we have :3080, :4080, etc.

update-watcher

New name of the watcher script. Changed a lot because we now have to track various deployments.

removes (undeploys) demos that are not in config anymore
deploys those that are new to the config
updates (as before) the ones with updated image

Important new feature is that we make authenticated calls to the API so we can use private images as well.

All of this is sequential. Nothing prevents updating images in parallel but it would be hell to debug with logs in case of issue. At the moment, that seems like the best choice.

Note that we reconfigure the multi-proxy as soon as we enter the update script so the homepage is updated quickly. This means it can have links for stuff that are expected but not ready. That's what we want.

captive-portals

Captive portal UI is set on reverse_port + 1 so if demo is at :3080, captive portal for it can be seen at [*].demo.hotspot.kiwix.org:3081

If there's any interest for that we could reverse it (_captive.xxx.demo... ?)

Lower down times

Previous code was assuming shortage of disk space and thus switching to maintenance mode as soon as an update is detected. Deployment removed first then new image downloaded and deployed.

This one assumes there's enough disk space so on image update it downloads to a tmp file and only once download is complete, it switches to maintenance mode to undeploy/move file/deploy.
This takes a few seconds usually.

To simplify the download part, I removed the use of RPC/daemon mode in aria2 call. As this was done to watch download progress, I've enabled aria2 output so the logs can be looked at to follow progress.

systemd

Previous version relied on systemd to start/stop the demo (the hotspot compose). With the multi demo and its changing number of deployments/compose, that felt like a burden with limited value.

Actually there was a big flaw in previous version: the demo would not recover from a restart as the watcher would be happy to see the latest image on disk but the compose would not be there as it resides on the image itself… which would not have been mounted.

This version keeps systemd for the multi-proxy (a single container started with docker run) and the watchers but there is no start script anymore for the demo itself. The updated update-watcher now takes care of deploying (and thus mounting) what's not deployed.

The main difference is that we are now starting the compose in daemon mode which makes it more difficult to assess whether working or not. A new function checks that there's at least one container and that there's no dead containers but it doesn't account for pulling/building images.

Known issue

Changing order of deployments in demos.yaml is risky as already running demos are already configured with some ports and upon-change multi-proxy reconf expects ports based on order. Should be improved. ports are now deterministic.
When accessing kiwix-serve inside a demo and clicking back on home link, user ends up at right place but on http://. That's because of the reverse-to HTTP. Could be resolved by trusting the upstream server in hotspot-proxy caddy or by enabling http2https on the reverse proxy.
START_DURATION is set long to accomodate image pulling/building resulting in longer downtime (2+mn) while its not necessary most of the time. Improving the checker or simply pulling/building before the compose would allow to reduce this to a few seconds limiting the downtime now pull/build before start with 15s start duration.
[new] Added a redirection from http://captive.xxx.demo.hotspot.kiwix.org to the captive UI for each.

Ended-up squashing it all as there were too many back-and-forth in the non-trival commits and none would work independently - multi-proxy approach with single TLS handler. reverses using HTTP. need alls subdomains - custom homepage - can be configured from the repo - config watcher - update watcher improved: imdeploys existing but gone, starts missing, redeploys updated - authenticates to imager-service API to use private images as well - uses python3.11 (upgraded machine to bookworm) - exposes captive portal UIs on ports `1080`, `2080`, etc - defaults to all files under /data/demo/{compose,images,data}/ - per-deployments subfolders - /var/logs/demo maintained to have a tmpfs - new `Deployment` type - download to a temp file to reduce downtime. Actual replacement (with downtime) thus lasts seconds - download without RPC/daemon mode to simplify management - download without capturing output so one can follow progress in logs - new `--reuse-image` param for deploy script - new `--force-prepare` param for deploy script (and `--force` for prepare one) - post-prepare calls multi-proxy for regen - no use for IS_ONLINE_DEMO trick on dashboard (will be removed) - multi-proxy is single docker run systemd (not compose) because one service - demo-watcher calls two script to ease dependency and use handy systemd timer - one maint-compose per demo - new undeploy script (with `--keep` param) for debug See PR for details https://github.com/offspot/demo/pulls/14

HTTP ports are now ident-based hence allowing order change in config. HTTPs port has been removed as not used A captive_http_port is set based on HTTP one and a _captive. endpoint created

- demo_start is run after pull and build so it starts faster (container creation + startup) - is_healthy checks only for running containers as `created` would show up on previously deployed but not running maint

- that's enough for what it's used for - lighttpd is throwing 400 on reverse for some reason

benoit74

This is HUGE! Well done! I've done my best to review it, but I probably missed few things.

I think the README misses a section regarding how to add or remove a demo (update demo.yaml and add/remove asset image ?)

I'm not sure hosting the demos.yaml and the assets for homepage in this repo is the proper place. It is obviously way simpler, but should rather be hosted in kiwix/operations from my PoV (or offspot/operations). Maybe not something to do now, but worth mentioning in #Next readme section.

install.sh is probably not needed anymore, I would remove it from repo (or do we still need it to install aria2?)

why did you removed setup.py and demo-setup script from pyproject.toml? README still mention it needs to be ran for installation ; maybe you would prefer to install systemd services manually?

I think that the enable_portalinDeploymentandindexinDeployment.using` are not used anymore, they should be dropped.

src/offspot_demo/deploy.py

rgaudin · 2024-09-09T10:55:39Z

I think the README misses a section regarding how to add or remove a demo (update demo.yaml and add/remove asset image ?)

Added

I'm not sure hosting the demos.yaml and the assets for homepage in this repo is the proper place. It is obviously way simpler, but should rather be hosted in kiwix/operations from my PoV (or offspot/operations). Maybe not something to do now, but worth mentioning in #Next readme section.

Swithed to https://github.com/kiwix/operations/blob/main/demos/demo.offspot.yaml

install.sh is probably not needed anymore, I would remove it from repo (or do we still need it to install aria2?)

Removed. Yes aria2 still required. Put the couple install instructions in the README.

why did you removed setup.py and demo-setup script from pyproject.toml? README still mention it needs to be ran for installation ; maybe you would prefer to install systemd services manually?

Yes it's in the README. I forgot to remove the reference to te script. Done.

I think that the enable_portalinDeploymentandindexinDeployment.using` are not used anymore, they should be dropped.

Good catch ; removed.

benoit74

LGTM

rgaudin self-assigned this Aug 15, 2024

rgaudin added 14 commits August 16, 2024 12:07

style and text in homepage

836339c

shop link

4db3020

fixed typo in YAML conf

5cc3796

Reliable internsal HTTP ports and captive endpoint

86654aa

HTTP ports are now ident-based hence allowing order change in config. HTTPs port has been removed as not used A captive_http_port is set based on HTTP one and a _captive. endpoint created

lint fix

32b4ff2

missing field in demo str builder

7de115c

Fixed is_healthy

e7ec020

- demo_start is run after pull and build so it starts faster (container creation + startup) - is_healthy checks only for running containers as `created` would show up on previously deployed but not running maint

always prepare ; no harm

f95362c

captive http is on :2080

b0edbec

make sure watcher awaits proxy if not started

fff87ff

redirect to captive portal

8bb2535

- that's enough for what it's used for - lighttpd is throwing 400 on reverse for some reason

typo

a7f9df8

need tls info to get cert

8018386

caddy/acme not happy with domain

9e918d5

rgaudin requested a review from benoit74 August 17, 2024 19:51

added robots.txt for homepage and all (sub)domains

eb25270

benoit74 requested changes Sep 5, 2024

View reviewed changes

src/offspot_demo/deploy.py Outdated Show resolved Hide resolved

rgaudin added 6 commits September 9, 2024 10:07

rename do_deploy_url to do_deploy

684d166

moved config to kiwix/operations repo

88000a5

removed obsolete install.sh, replaced by instructions in README

ff89062

Added live-demo config instructions

6124615

removed useless enable_portal and index

2d87b70

fixup! rename do_deploy_url to do_deploy

6b45372

rgaudin added 2 commits September 9, 2024 11:00

fixup! rename do_deploy_url to do_deploy

6c12b6b

updated all deps

c9a5f98

rgaudin requested a review from benoit74 September 9, 2024 11:05

benoit74 approved these changes Sep 9, 2024

View reviewed changes

rgaudin merged commit 6505df8 into main Sep 9, 2024
3 checks passed

rgaudin deleted the multi-demo branch September 9, 2024 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamped for multi-demo & minimal downtime #14

Revamped for multi-demo & minimal downtime #14

rgaudin commented Aug 15, 2024 •

edited

Loading

benoit74 left a comment

rgaudin commented Sep 9, 2024

benoit74 left a comment

Revamped for multi-demo & minimal downtime #14

Revamped for multi-demo & minimal downtime #14

Conversation

rgaudin commented Aug 15, 2024 • edited Loading

Significant changes

Approach

reverse-proxy

External human friendly config

Deployment Type

update-watcher

captive-portals

Lower down times

systemd

Known issue

benoit74 left a comment

Choose a reason for hiding this comment

rgaudin commented Sep 9, 2024

benoit74 left a comment

Choose a reason for hiding this comment

rgaudin commented Aug 15, 2024 •

edited

Loading