Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamped for multi-demo & minimal downtime #14

Merged
merged 24 commits into from
Sep 9, 2024
Merged

Revamped for multi-demo & minimal downtime #14

merged 24 commits into from
Sep 9, 2024

Conversation

rgaudin
Copy link
Member

@rgaudin rgaudin commented Aug 15, 2024

Ended-up squashing it all as there were too many back-and-forth in the non-trival commits and none would work independently

Fixes #10

Significant changes

  • multi-proxy approach with single TLS handler. reverses using HTTP. need alls subdomains
  • custom homepage
  • can be configured from the repo
  • config watcher
  • update watcher improved: imdeploys existing but gone, starts missing, redeploys updated
  • authenticates to imager-service API to use private images as well
  • uses python3.11 (upgraded machine to bookworm)
  • exposes captive portal UIs on ports 1080, 2080, etc
  • defaults to all files under /data/demo/{compose,images,data}/
  • per-deployments subfolders
  • /var/logs/demo maintained to have a tmpfs
  • new Deployment type
  • download to a temp file to reduce downtime. Actual replacement (with downtime) thus lasts seconds
  • download without RPC/daemon mode to simplify management
  • download without capturing output so one can follow progress in logs
  • new --reuse-image param for deploy script
  • new --force-prepare param for deploy script (and --force for prepare one)
  • post-prepare calls multi-proxy for regen
  • no use for IS_ONLINE_DEMO trick on dashboard (will be removed)
  • multi-proxy is single docker run systemd (not compose) because one service
  • demo-watcher calls two script to ease dependency and use handy systemd timer
  • one maint-compose per demo
  • new undeploy script (with --keep param) for debug

Approach

reverse-proxy

Previous version was quite simple: we had a single compose running at a time, wich
already contained a reverse-proxy to the services. We just had to expose this
reverse proxy's ports (actually it came exposed) and tweak its config
to use ACME certificates instead of the caddy-internal ones for HTTPS.

The multi-demo approach is quite different. First, we need a webserver (:80, :443)
to serve our homepage (list of demos).

This means we cannot expose our hotspot-reverse-proxy anymore. We thus have to use that
homepage server to reverse-proxy to the hotspot reverse-proxy.

That's easy and serves the demo homepage (xxx.demo.hotspot.kiwix.org) well but it's not
enough as wildcard certificates only works up one level: cert for *.demo.hotspot.kiwix.org works for xxx.demo.hotspot.kiwix.org but not for yyy.xxx.demo.hotspot.kiwix.org

Solution is thus to have the main webserver (the multi-proxy) know every demo domain and every sudomains of each and manage all certificates.
Caddy auto_https is a great feature but its magic makes it difficult to configure such scenarios.
Edit those caddyfiles with caution 😅

We thus have two scripts in the mutli-proxy container: gen-server and caddy-reload
Those are called externally to rewrite the caddyfile (and the static homepage) and reload caddy

External human friendly config

Borrowing demo.library's config via a human-friendly YAML file, one can update the list of demos by updating demos.yaml.

  • ident is the auto-image identifier.
  • alias is an optionnal custom subdomain (ident used otherwise)
  • name is an optional name to use in the homepage (ident otherwise)

All tools continues to work of the /etc/demo/environment file.

A new script config-watcher is used to query the YAML file and directly update /etc/demo/environment.

Deployment Type

A dataclass to hold all informations about a demo/deployment. Replaces all the deployment-specific ennviron variables.

List of demos with key infos still stored in a single environ (OFFSPOT_DEMOS_LIST). That's a bit ugly, especially because it looks half baked but I wanted to keep as much old code as possible and that works well for now.

Because the multi-proxy reverses for the hotspot-proxy, it needs to access the hotspot-proxy via HTTP.
As each demo lives inside its own compose, docker network cannot be used to communicate from multi-proxy container to hotspot-proxy container.
We are thus openning the HTTP port of the hotspot-proxy on the host direclty and using it for the reverse.

It's not a security issue, we are not reversing to prevent direct access but to share a single TLS handler. Users cant use them directly anyway as the hotspot-proxy serves for its FQDN only.

Anyway this means using different ports for each demo. We thus have an index data on the deployment that we use to set the ports so we have :3080, :4080, etc.

update-watcher

New name of the watcher script. Changed a lot because we now have to track various deployments.

  • removes (undeploys) demos that are not in config anymore
  • deploys those that are new to the config
  • updates (as before) the ones with updated image

Important new feature is that we make authenticated calls to the API so we can use private images as well.

All of this is sequential. Nothing prevents updating images in parallel but it would be hell to debug with logs in case of issue. At the moment, that seems like the best choice.

Note that we reconfigure the multi-proxy as soon as we enter the update script so the homepage is updated quickly. This means it can have links for stuff that are expected but not ready. That's what we want.

captive-portals

Captive portal UI is set on reverse_port + 1 so if demo is at :3080, captive portal for it can be seen at [*].demo.hotspot.kiwix.org:3081

If there's any interest for that we could reverse it (_captive.xxx.demo... ?)

Lower down times

Previous code was assuming shortage of disk space and thus switching to maintenance mode as soon as an update is detected. Deployment removed first then new image downloaded and deployed.

This one assumes there's enough disk space so on image update it downloads to a tmp file and only once download is complete, it switches to maintenance mode to undeploy/move file/deploy.
This takes a few seconds usually.

To simplify the download part, I removed the use of RPC/daemon mode in aria2 call. As this was done to watch download progress, I've enabled aria2 output so the logs can be looked at to follow progress.

systemd

Previous version relied on systemd to start/stop the demo (the hotspot compose). With the multi demo and its changing number of deployments/compose, that felt like a burden with limited value.

Actually there was a big flaw in previous version: the demo would not recover from a restart as the watcher would be happy to see the latest image on disk but the compose would not be there as it resides on the image itself… which would not have been mounted.

This version keeps systemd for the multi-proxy (a single container started with docker run) and the watchers but there is no start script anymore for the demo itself. The updated update-watcher now takes care of deploying (and thus mounting) what's not deployed.

The main difference is that we are now starting the compose in daemon mode which makes it more difficult to assess whether working or not. A new function checks that there's at least one container and that there's no dead containers but it doesn't account for pulling/building images.

Known issue

  • Changing order of deployments in demos.yaml is risky as already running demos are already configured with some ports and upon-change multi-proxy reconf expects ports based on order. Should be improved. ports are now deterministic.
  • When accessing kiwix-serve inside a demo and clicking back on home link, user ends up at right place but on http://. That's because of the reverse-to HTTP. Could be resolved by trusting the upstream server in hotspot-proxy caddy or by enabling http2https on the reverse proxy.
  • START_DURATION is set long to accomodate image pulling/building resulting in longer downtime (2+mn) while its not necessary most of the time. Improving the checker or simply pulling/building before the compose would allow to reduce this to a few seconds limiting the downtime now pull/build before start with 15s start duration.
  • [new] Added a redirection from http://captive.xxx.demo.hotspot.kiwix.org to the captive UI for each.

Ended-up squashing it all as there were too many back-and-forth in the non-trival commits and none would work independently

- multi-proxy approach with single TLS handler. reverses using HTTP. need alls subdomains
- custom homepage
- can be configured from the repo
- config watcher
- update watcher improved: imdeploys existing but gone, starts missing, redeploys updated
- authenticates to imager-service API to use private images as well
- uses python3.11 (upgraded machine to bookworm)
- exposes captive portal UIs on ports `1080`, `2080`, etc
- defaults to all files under /data/demo/{compose,images,data}/
- per-deployments subfolders
- /var/logs/demo maintained to have a tmpfs
- new `Deployment` type
- download to a temp file to reduce downtime. Actual replacement (with downtime) thus lasts seconds
- download without RPC/daemon mode to simplify management
- download without capturing output so one can follow progress in logs
- new `--reuse-image` param for deploy script
- new `--force-prepare` param for deploy script (and `--force` for prepare one)
- post-prepare calls multi-proxy for regen
- no use for IS_ONLINE_DEMO trick on dashboard (will be removed)
- multi-proxy is single docker run systemd (not compose) because one service
- demo-watcher calls two script to ease dependency and use handy systemd timer
- one maint-compose per demo
- new undeploy script (with `--keep` param) for debug

See PR for details https://github.com/offspot/demo/pulls/14
@rgaudin rgaudin self-assigned this Aug 15, 2024
HTTP ports are now ident-based hence allowing order change in config.
HTTPs port has been removed as not used
A captive_http_port is set based on HTTP one and a _captive. endpoint created
- demo_start is run after pull and build so it starts faster (container creation + startup)
- is_healthy checks only for running containers as `created` would show up on previously deployed but not running maint
- that's enough for what it's used for
- lighttpd is throwing 400 on reverse for some reason
@rgaudin rgaudin requested a review from benoit74 August 17, 2024 19:51
Copy link
Contributor

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is HUGE! Well done! I've done my best to review it, but I probably missed few things.

I think the README misses a section regarding how to add or remove a demo (update demo.yaml and add/remove asset image ?)

I'm not sure hosting the demos.yaml and the assets for homepage in this repo is the proper place. It is obviously way simpler, but should rather be hosted in kiwix/operations from my PoV (or offspot/operations). Maybe not something to do now, but worth mentioning in #Next readme section.

install.sh is probably not needed anymore, I would remove it from repo (or do we still need it to install aria2?)

why did you removed setup.py and demo-setup script from pyproject.toml? README still mention it needs to be ran for installation ; maybe you would prefer to install systemd services manually?

I think that the enable_portalinDeploymentandindexinDeployment.using` are not used anymore, they should be dropped.

src/offspot_demo/deploy.py Outdated Show resolved Hide resolved
@rgaudin
Copy link
Member Author

rgaudin commented Sep 9, 2024

I think the README misses a section regarding how to add or remove a demo (update demo.yaml and add/remove asset image ?)

Added

I'm not sure hosting the demos.yaml and the assets for homepage in this repo is the proper place. It is obviously way simpler, but should rather be hosted in kiwix/operations from my PoV (or offspot/operations). Maybe not something to do now, but worth mentioning in #Next readme section.

Swithed to https://github.com/kiwix/operations/blob/main/demos/demo.offspot.yaml

install.sh is probably not needed anymore, I would remove it from repo (or do we still need it to install aria2?)

Removed. Yes aria2 still required. Put the couple install instructions in the README.

why did you removed setup.py and demo-setup script from pyproject.toml? README still mention it needs to be ran for installation ; maybe you would prefer to install systemd services manually?

Yes it's in the README. I forgot to remove the reference to te script. Done.

I think that the enable_portalinDeploymentandindexinDeployment.using` are not used anymore, they should be dropped.

Good catch ; removed.

Copy link
Contributor

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rgaudin rgaudin merged commit 6505df8 into main Sep 9, 2024
3 checks passed
@rgaudin rgaudin deleted the multi-demo branch September 9, 2024 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New image rollout generates down time
2 participants