-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamped for multi-demo & minimal downtime #14
Conversation
Ended-up squashing it all as there were too many back-and-forth in the non-trival commits and none would work independently - multi-proxy approach with single TLS handler. reverses using HTTP. need alls subdomains - custom homepage - can be configured from the repo - config watcher - update watcher improved: imdeploys existing but gone, starts missing, redeploys updated - authenticates to imager-service API to use private images as well - uses python3.11 (upgraded machine to bookworm) - exposes captive portal UIs on ports `1080`, `2080`, etc - defaults to all files under /data/demo/{compose,images,data}/ - per-deployments subfolders - /var/logs/demo maintained to have a tmpfs - new `Deployment` type - download to a temp file to reduce downtime. Actual replacement (with downtime) thus lasts seconds - download without RPC/daemon mode to simplify management - download without capturing output so one can follow progress in logs - new `--reuse-image` param for deploy script - new `--force-prepare` param for deploy script (and `--force` for prepare one) - post-prepare calls multi-proxy for regen - no use for IS_ONLINE_DEMO trick on dashboard (will be removed) - multi-proxy is single docker run systemd (not compose) because one service - demo-watcher calls two script to ease dependency and use handy systemd timer - one maint-compose per demo - new undeploy script (with `--keep` param) for debug See PR for details https://github.com/offspot/demo/pulls/14
HTTP ports are now ident-based hence allowing order change in config. HTTPs port has been removed as not used A captive_http_port is set based on HTTP one and a _captive. endpoint created
- demo_start is run after pull and build so it starts faster (container creation + startup) - is_healthy checks only for running containers as `created` would show up on previously deployed but not running maint
- that's enough for what it's used for - lighttpd is throwing 400 on reverse for some reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is HUGE! Well done! I've done my best to review it, but I probably missed few things.
I think the README misses a section regarding how to add or remove a demo (update demo.yaml and add/remove asset image ?)
I'm not sure hosting the demos.yaml and the assets for homepage in this repo is the proper place. It is obviously way simpler, but should rather be hosted in kiwix/operations from my PoV (or offspot/operations). Maybe not something to do now, but worth mentioning in #Next
readme section.
install.sh is probably not needed anymore, I would remove it from repo (or do we still need it to install aria2?)
why did you removed setup.py and demo-setup script from pyproject.toml? README still mention it needs to be ran for installation ; maybe you would prefer to install systemd services manually?
I think that the enable_portalin
Deploymentand
indexin
Deployment.using` are not used anymore, they should be dropped.
Added
Swithed to https://github.com/kiwix/operations/blob/main/demos/demo.offspot.yaml
Removed. Yes aria2 still required. Put the couple install instructions in the README.
Yes it's in the README. I forgot to remove the reference to te script. Done.
Good catch ; removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Ended-up squashing it all as there were too many back-and-forth in the non-trival commits and none would work independently
Fixes #10
Significant changes
1080
,2080
, etcDeployment
type--reuse-image
param for deploy script--force-prepare
param for deploy script (and--force
for prepare one)--keep
param) for debugApproach
reverse-proxy
Previous version was quite simple: we had a single compose running at a time, wich
already contained a reverse-proxy to the services. We just had to expose this
reverse proxy's ports (actually it came exposed) and tweak its config
to use ACME certificates instead of the caddy-internal ones for HTTPS.
The multi-demo approach is quite different. First, we need a webserver (:80, :443)
to serve our homepage (list of demos).
This means we cannot expose our hotspot-reverse-proxy anymore. We thus have to use that
homepage server to reverse-proxy to the hotspot reverse-proxy.
That's easy and serves the demo homepage (xxx.demo.hotspot.kiwix.org) well but it's not
enough as wildcard certificates only works up one level: cert for *.demo.hotspot.kiwix.org works for xxx.demo.hotspot.kiwix.org but not for yyy.xxx.demo.hotspot.kiwix.org
Solution is thus to have the main webserver (the
multi-proxy
) know every demo domain and every sudomains of each and manage all certificates.Caddy auto_https is a great feature but its magic makes it difficult to configure such scenarios.
Edit those caddyfiles with caution 😅
We thus have two scripts in the mutli-proxy container: gen-server and caddy-reload
Those are called externally to rewrite the caddyfile (and the static homepage) and reload caddy
External human friendly config
Borrowing demo.library's config via a human-friendly YAML file, one can update the list of demos by updating
demos.yaml
.ident
is the auto-image identifier.alias
is an optionnal custom subdomain (ident
used otherwise)name
is an optional name to use in the homepage (ident
otherwise)All tools continues to work of the
/etc/demo/environment
file.A new script
config-watcher
is used to query the YAML file and directly update/etc/demo/environment
.Deployment Type
A dataclass to hold all informations about a demo/deployment. Replaces all the deployment-specific ennviron variables.
List of demos with key infos still stored in a single environ (OFFSPOT_DEMOS_LIST). That's a bit ugly, especially because it looks half baked but I wanted to keep as much old code as possible and that works well for now.
Because the multi-proxy reverses for the hotspot-proxy, it needs to access the hotspot-proxy via HTTP.
As each demo lives inside its own compose, docker network cannot be used to communicate from multi-proxy container to hotspot-proxy container.
We are thus openning the HTTP port of the hotspot-proxy on the host direclty and using it for the reverse.
It's not a security issue, we are not reversing to prevent direct access but to share a single TLS handler. Users cant use them directly anyway as the hotspot-proxy serves for its FQDN only.
Anyway this means using different ports for each demo. We thus have an index data on the deployment that we use to set the ports so we have :3080, :4080, etc.
update-watcher
New name of the watcher script. Changed a lot because we now have to track various deployments.
Important new feature is that we make authenticated calls to the API so we can use private images as well.
All of this is sequential. Nothing prevents updating images in parallel but it would be hell to debug with logs in case of issue. At the moment, that seems like the best choice.
Note that we reconfigure the multi-proxy as soon as we enter the update script so the homepage is updated quickly. This means it can have links for stuff that are expected but not ready. That's what we want.
captive-portals
Captive portal UI is set on reverse_port + 1 so if demo is at :3080, captive portal for it can be seen at [*].demo.hotspot.kiwix.org:3081
If there's any interest for that we could reverse it (_captive.xxx.demo... ?)
Lower down times
Previous code was assuming shortage of disk space and thus switching to maintenance mode as soon as an update is detected. Deployment removed first then new image downloaded and deployed.
This one assumes there's enough disk space so on image update it downloads to a tmp file and only once download is complete, it switches to maintenance mode to undeploy/move file/deploy.
This takes a few seconds usually.
To simplify the download part, I removed the use of RPC/daemon mode in aria2 call. As this was done to watch download progress, I've enabled aria2 output so the logs can be looked at to follow progress.
systemd
Previous version relied on systemd to start/stop the demo (the hotspot compose). With the multi demo and its changing number of deployments/compose, that felt like a burden with limited value.
Actually there was a big flaw in previous version: the demo would not recover from a restart as the watcher would be happy to see the latest image on disk but the compose would not be there as it resides on the image itself… which would not have been mounted.
This version keeps systemd for the multi-proxy (a single container started with
docker run
) and the watchers but there is no start script anymore for the demo itself. The updated update-watcher now takes care of deploying (and thus mounting) what's not deployed.The main difference is that we are now starting the compose in daemon mode which makes it more difficult to assess whether working or not. A new function checks that there's at least one container and that there's no dead containers but it doesn't account for pulling/building images.
Known issue
Changing order of deployments inports are now deterministic.demos.yaml
is risky as already running demos are already configured with some ports and upon-change multi-proxy reconf expects ports based on order. Should be improved.now pull/build before start with 15s start duration.START_DURATION
is set long to accomodate image pulling/building resulting in longer downtime (2+mn) while its not necessary most of the time. Improving the checker or simply pulling/building before the compose would allow to reduce this to a few seconds limiting the downtimehttp://captive.xxx.demo.hotspot.kiwix.org
to the captive UI for each.