Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent jobs with configmaps #87

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 36 additions & 34 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,23 @@
# Table of workloads

| Workload/tooling | Short Description | Minimum Requirements |
|:-------------------------------------------------- |:----------------------------------------- | ------------------------------------- |
| [Tooling](tooling.md) | Setup pbench instrumentation tools | Cluster-admin, Privileged Containers |
| [Test](test.md) | Test/Run your workload from ssh Container | Cluster-admin, Privileged Containers |
| [Baseline](baseline.md) | Baseline metrics capture | Tooling job* |
| [Scale](scale.md) | Scales worker nodes | Cluster-admin |
| [NodeVertical](nodevertical.md) | Node Kubelet Density | Labeling Nodes |
| [PodVertical](podvertical.md) | Max Pod Density | None |
| [MasterVertical](mastervertical.md) | Master Node Stress workload | None |
| [HTTP](http.md) | HTTP ingress TPS/Latency | None |
| [Network](network.md) | TCP/UDP Throughput/Latency | Labeling Nodes, [See below](#network) |
| [Deployments Per Namespace](deployments-per-ns.md) | Maximum Deployments | None |
| [PVCscale](pvscale.md) | PVCScale test | Working storageclass |
| [Conformance](conformance.md) | OCP/Kubernetes e2e tests | None |
| [Namespaces per cluster](namespaces-per-cluster.md) | Maximum Namespaces | None |
| [Services per namespace](services-per-namespace.md) | Maximum services per namespace | None |
| [FIO I/O test](fio.md) | FIO I/O test - stress storage backend | Privileged Containers, Working storage class |
| Workload/tooling | Short Description | Minimum Requirements |
|:------------------------------------------------------------------- |:----------------------------------------- | -------------------------------------------- |
| [Tooling](tooling.md) | Setup pbench instrumentation tools | Cluster-admin, Privileged Containers |
| [Test](test.md) | Test/Run your workload from ssh Container | Cluster-admin, Privileged Containers |
| [Baseline](baseline.md) | Baseline metrics capture | Tooling job* |
| [Scale](scale.md) | Scales worker nodes | Cluster-admin |
| [NodeVertical](nodevertical.md) | Node Kubelet Density | Labeling Nodes |
| [PodVertical](podvertical.md) | Max Pod Density | None |
| [MasterVertical](mastervertical.md) | Master Node Stress workload | None |
| [HTTP](http.md) | HTTP ingress TPS/Latency | None |
| [Network](network.md) | TCP/UDP Throughput/Latency | Labeling Nodes, [See below](#network) |
| [Deployments Per Namespace](deployments-per-ns.md) | Maximum Deployments | None |
| [PVCscale](pvscale.md) | PVCScale test | Working storageclass |
| [Conformance](conformance.md) | OCP/Kubernetes e2e tests | None |
| [Namespaces per cluster](namespaces-per-cluster.md) | Maximum Namespaces | None |
| [Services per namespace](services-per-namespace.md) | Maximum services per namespace | None |
| [FIO I/O test](fio.md) | FIO I/O test - stress storage backend | Privileged Containers, Working storage class |
| [Concurent jobs with configmaps](concurent-jobs-with-configmaps.md) | Create and run simple job | None |

* Baseline job without a tooled cluster just idles a cluster. The goal is to capture resource consumption over a period of time to characterize resource requirements thus tooling is required. (For now)

Expand All @@ -36,20 +37,21 @@

Each workload will implement a form of pass/fail criteria in order to flag if the tests have failed in CI.

| Workload/tooling | Pass/Fail |
|:-------------------------------------------------- |:----------------------------- |
| [Tooling](tooling.md) | NA |
| [Test](test.md) | NA |
| [Baseline](baseline.md) | NA |
| [Scale](scale.md) | Yes: Test Duration |
| [NodeVertical](nodevertical.md) | Yes: Exit Code, Test Duration |
| [PodVertical](podvertical.md) | Yes: Exit Code, Test Duration |
| [MasterVertical](mastervertical.md) | Yes: Exit Code, Test Duration |
| [HTTP](http.md) | No |
| [Network](network.md) | No |
| [Deployments Per Namespace](deployments-per-ns.md) | No |
| [PVCscale](pvscale.md) | No |
| [Conformance](conformance.md) | No |
| [Namespaces per cluster](namespaces-per-cluster.md) | Yes: Exit code, Test Duration |
| [Services per namespace](services-per-namespace.md) | Yes: Exit code, Test Duration |
| [FIO I/O test](fio.md) | No |
| Workload/tooling | Pass/Fail |
|:------------------------------------------------------------------- |:----------------------------- |
| [Tooling](tooling.md) | NA |
| [Test](test.md) | NA |
| [Baseline](baseline.md) | NA |
| [Scale](scale.md) | Yes: Test Duration |
| [NodeVertical](nodevertical.md) | Yes: Exit Code, Test Duration |
| [PodVertical](podvertical.md) | Yes: Exit Code, Test Duration |
| [MasterVertical](mastervertical.md) | Yes: Exit Code, Test Duration |
| [HTTP](http.md) | No |
| [Network](network.md) | No |
| [Deployments Per Namespace](deployments-per-ns.md) | No |
| [PVCscale](pvscale.md) | No |
| [Conformance](conformance.md) | No |
| [Namespaces per cluster](namespaces-per-cluster.md) | Yes: Exit code, Test Duration |
| [Services per namespace](services-per-namespace.md) | Yes: Exit code, Test Duration |
| [FIO I/O test](fio.md) | No |
| [Concurent jobs with configmaps](concurent-jobs-with-configmaps.md) | No |
77 changes: 77 additions & 0 deletions docs/concurent-jobs-with-configmaps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Concurrent Jobs With Configmaps Workload

The Concurrent Jobs with Configmaps test playbook is `workloads/concurrent-jobs-with-configmaps.yaml`
This workload test is designed to check how many of ConfigMaps and pods can slowdown the cluster.

```sh
$ cp workloads/inventory.example inventory
$ # Add orchestration host to inventory
$ # Edit vars in workloads/vars/concurrent-jobs-with-configmaps.yml or define Environment vars (See below)
$ time ansible-playbook -vv -i inventory workloads/concurrent-jobs-with-configmaps.yml
```

## Environment variables

### PUBLIC_KEY
Default: `~/.ssh/id_rsa.pub`
Public ssh key file for Ansible.

### PRIVATE_KEY
Default: `~/.ssh/id_rsa`
Private ssh key file for Ansible.

### ORCHESTRATION_USER
Default: `root`
User for Ansible to log in as. Must authenticate with PUBLIC_KEY/PRIVATE_KEY.

### WORKLOAD_IMAGE
Default: `quay.io/openshift-scale/scale-ci-workload`
Container image that runs the workload script.

### WORKLOAD_JOB_NODE_SELECTOR
Default: `false`
Enables/disables the node selector that places the workload job on the `workload` node.

### WORKLOAD_JOB_TAINT
Default: `false`
Enables/disables the toleration on the workload job to permit the `workload` taint.

### WORKLOAD_JOB_PRIVILEGED
Default: `true`
Enables/disables running the workload pod as privileged.

### KUBECONFIG_FILE
Default: `~/.kube/config`
Location of kubeconfig on orchestration host.

### PBENCH_INSTRUMENTATION
Default: `false`
Enables/disables running the workload wrapped by pbench-user-benchmark. When enabled, pbench agents can then be enabled (`ENABLE_PBENCH_AGENTS`) for further instrumentation data and pbench-copy-results can be enabled (`ENABLE_PBENCH_COPY`) to export captured data for further analysis.

### ENABLE_PBENCH_AGENTS
Default: `false`
Enables/disables the collection of pbench data on the pbench agent Pods. These Pods are deployed by the tooling playbook.

### ENABLE_PBENCH_COPY
Default: `false`
Enables/disables the copying of pbench data to a remote results server for further analysis.

### PBENCH_SSH_PRIVATE_KEY_FILE
Default: `~/.ssh/id_rsa`
Location of ssh private key to authenticate to the pbench results server.

### PBENCH_SSH_PUBLIC_KEY_FILE
Default: `~/.ssh/id_rsa.pub`
Location of the ssh public key to authenticate to the pbench results server.

### PBENCH_SERVER
Default: There is no public default.
DNS address of the pbench results server.

### NUMBER_OF_CONCURRENT_JOBS
Default: 300
Number of concurrent jobs with configmaps to create during workload.

### JOB_COMPLETION_POLL_ATTEMPTS
Default: `360`
Number of retries for Ansible to poll if the workload job has completed. Poll attempts delay 10s between polls with some additional time taken for each polling action depending on the orchestration host setup.
skordas marked this conversation as resolved.
Show resolved Hide resolved
140 changes: 140 additions & 0 deletions workloads/concurrent-jobs-with-configmaps.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
---
#
# Runs concurrent jobs with configmaps benchmarks on existing cluster.
#

- name: Runs concurrent jobs with configmaps
hosts: orchestration
gather_facts: true
remote_user: "{{orchestration_user}}"
vars_files:
- vars/concurrent-jobs-with-configmaps.yml
vars:
workload_job: "concurrent-jobs"
tasks:
- name: Create scale-ci-tooling directory
file:
path: "{{ansible_user_dir}}/scale-ci-tooling"
state: directory

- name: Copy workload files
copy:
src: "{{item.src}}"
dest: "{{item.dest}}"
with_items:
- src: scale-ci-tooling-ns.yml
dest: "{{ansible_user_dir}}/scale-ci-tooling/scale-ci-tooling-ns.yml"
- src: workload-concurrent-jobs-with-configmaps-script-cm.yml
dest: "{{ansible_user_dir}}/scale-ci-tooling/workload-concurrent-jobs-with-configmaps-script-cm.yml"

- name: Slurp kubeconfig file
slurp:
src: "{{kubeconfig_file}}"
register: kubeconfig_file_slurp

- name: Slurp ssh private key file
slurp:
src: "{{pbench_ssh_private_key_file}}"
register: pbench_ssh_private_key_file_slurp

- name: Slurp ssh public key file
slurp:
src: "{{pbench_ssh_public_key_file}}"
register: pbench_ssh_public_key_file_slurp

- name: Template workload templates
template:
src: "{{item.src}}"
dest: "{{item.dest}}"
with_items:
- src: pbench-cm.yml.j2
dest: "{{ansible_user_dir}}/scale-ci-tooling/pbench-cm.yml"
- src: pbench-ssh-secret.yml.j2
dest: "{{ansible_user_dir}}/scale-ci-tooling/pbench-ssh-secret.yml"
- src: kubeconfig-secret.yml.j2
dest: "{{ansible_user_dir}}/scale-ci-tooling/kubeconfig-secret.yml"
- src: workload-job.yml.j2
dest: "{{ansible_user_dir}}/scale-ci-tooling/workload-job.yml"
- src: workload-env.yml.j2
dest: "{{ansible_user_dir}}/scale-ci-tooling/workload-{{workload_job}}-env.yml"

- name: Check if scale-ci-tooling namespace exists
shell: |
oc project scale-ci-tooling
ignore_errors: true
changed_when: false
register: scale_ci_tooling_ns_exists

- name: Ensure any stale scale-ci-concurrent-jobs job is deleted
shell: |
oc delete job scale-ci-{{workload_job}} -n scale-ci-tooling
register: scale_ci_tooling_project
failed_when: scale_ci_tooling_project.rc == 0
until: scale_ci_tooling_project.rc == 1
retries: 60
delay: 1
when: scale_ci_tooling_ns_exists.rc == 0

- name: Ensure project concurrent-jobs-workload from previous workload is deleted
shell: |
oc delete project concurrent-jobs-workload
register: concurrent_jobs_workload_project
failed_when: concurrent_jobs_workload_project.rc == 0
until: concurrent_jobs_workload_project.rc == 1
retries: 60
delay: 1

- name: Block for non-existing tooling namespace
block:
- name: Create tooling namespace
shell: |
oc create -f {{ansible_user_dir}}/scale-ci-tooling/scale-ci-tooling-ns.yml

- name: Create tooling service account
shell: |
oc create serviceaccount useroot -n scale-ci-tooling
oc adm policy add-scc-to-user privileged -z useroot -n scale-ci-tooling
when: enable_pbench_agents|bool or workload_job_privileged|bool
when: scale_ci_tooling_ns_exists.rc != 0

- name: Create/replace kubeconfig secret
shell: |
oc replace --force -n scale-ci-tooling -f "{{ansible_user_dir}}/scale-ci-tooling/kubeconfig-secret.yml"

- name: Create/replace the pbench configmap
shell: |
oc replace --force -n scale-ci-tooling -f "{{ansible_user_dir}}/scale-ci-tooling/pbench-cm.yml"

- name: Create/replace pbench ssh secret
shell: |
oc replace --force -n scale-ci-tooling -f "{{ansible_user_dir}}/scale-ci-tooling/pbench-ssh-secret.yml"

- name: Create/replace workload script configmap
shell: |
oc replace --force -n scale-ci-tooling -f "{{ansible_user_dir}}/scale-ci-tooling/workload-concurrent-jobs-with-configmaps-script-cm.yml"

- name: Create/replace workload script environment configmap
shell: |
oc replace --force -n scale-ci-tooling -f "{{ansible_user_dir}}/scale-ci-tooling/workload-{{workload_job}}-env.yml"

- name: Create/replace workload job to that runs workload script
shell: |
oc replace --force -n scale-ci-tooling -f "{{ansible_user_dir}}/scale-ci-tooling/workload-job.yml"

- name: Poll until job pod is running
shell: |
oc get pods --selector=job-name=scale-ci-{{workload_job}} -n scale-ci-tooling -o json
register: pod_json
retries: 60
delay: 2
until: pod_json.stdout | from_json | json_query('items[0].status.phase==`Running`')

- name: Poll until job is complete
shell: |
oc get job scale-ci-{{workload_job}} -n scale-ci-tooling -o json
register: job_json
retries: "{{job_completion_poll_attempts}}"
delay: 10
until: job_json.stdout | from_json | json_query('status.succeeded==`1` || status.failed==`1`')
failed_when: job_json.stdout | from_json | json_query('status.succeeded==`1`') == false
when: job_completion_poll_attempts|int > 0
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: scale-ci-workload-script
data:
run.sh: |
skordas marked this conversation as resolved.
Show resolved Hide resolved
#!/bin/bash
jobs_amount=${NUMBER_OF_CONCURRENT_JOBS}
function create_jobs()
{
for i in $(seq 1 $jobs_amount);
do
cat /root/workload/conc_jobs.yaml | sed "s/%JOB_ID%/$i/g" | oc create -f -
done
}
function wait_for_completion()
{
running=`oc get pods | grep -c Completed`
while [ $running -lt $jobs_amount ]; do
sleep 1
running=`oc get pods -n concurrent-jobs-workload | grep -E "Completed|OOMKilled" | wc -l`
echo "$running jobs are completed"
done
}

oc new-project concurrent-jobs-workload
start_time=`date +%s`
create_jobs
wait_for_completion
end_time=`date +%s`
total_time=`echo $end_time - $start_time | bc`
echo "OOMKILLED jobs:"
oc get pods | grep OOMKilled
echo "Time taken for creating $jobs_amount concurrent jobs with configmaps $total_time seconds"
conc_jobs.yaml: |
# Example from: https://github.com/kubernetes/kubernetes/issues/74412#issue-413387234
---
apiVersion: v1
kind: ConfigMap
metadata:
name: job-%JOB_ID%
namespace: concurrent-jobs-workload
data:
game.properties: |
enemies=aliens
---
apiVersion: batch/v1
kind: Job
metadata:
name: job-%JOB_ID%
namespace: concurrent-jobs-workload
spec:
template:
spec:
containers:
- name: busybox
image: busybox
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we parameterize the image instead of hard coding it please?

resources:
requests:
memory: "50Mi"
cpu: "10m"
command: [ "/bin/echo" ]
args: [ "Hello, World!" ]
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
name: job-%JOB_ID%
restartPolicy: Never
backoffLimit: 4
2 changes: 2 additions & 0 deletions workloads/templates/workload-env.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -103,4 +103,6 @@ data:
PROMETHEUS_GRAPH_PERIOD: "{{prometheus_graph_period}}"
PROMETHEUS_REFRESH_INTERVAL: "{{prometheus_refresh_interval}}"
PROMETHEUS_SCALE_TEST_PREFIX: "{{prometheus_scale_test_prefix}}"
{% elif workload_job == "concurrent-jobs" %}
NUMBER_OF_CONCURRENT_JOBS: "{{number_of_concurrent_jobs}}"
{% endif %}
Loading