Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support AWS EventBridge #4188

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

feat: support AWS EventBridge #4188

wants to merge 24 commits into from

Conversation

npalm
Copy link
Member

@npalm npalm commented Oct 17, 2024

Description

This PR introduces the AWS EventBridge. The EventBridge can be enabled with the options webhook_mode, which can be set to either direct or eventbridge. In the direct mode the olds way of handling is still applied. When setting the mode to eventbridge events will publshed on the AWS EventBridge, which is not limited only to the event workflow_job with statues queued via a target rule events relevant for scaling a snet to the dispatcher lambda to distrute to a SQS queue for sacling.

Todo

  • Refactor lambda and add EventBridge
  • Refactor webhook module (TF) to support EventBridge
  • Test example default
  • Test example multi runner
  • Adjust docs
  • Reduce permissions on webhook and dispatcher lambda for eventbridge mode
  • Add configuration for allowed events on the EventBridge
  • Add support for CMK (encruption) to EventBridge Add support for CMK (encryption) for the EventBridge #4192

MIgration directions

The change is backwards compatible but will recreate resources managed by the internal module webhook. The only resource contianing data is the CloudWatch LogGroup. To retain the log geroup you can run a terraform state move. Or add a moved block to your deployemnt.

Migrating to this version

With module defaults or webhook_mode = direct

moved {
   from = module.<runner-module-name>.module.webhook.aws_cloudwatch_log_group.webhook
   to = module.<runner-module-name>.module.webhook.module.direct[0].aws_cloudwatch_log_group.webhook
}

Or with webhook_mode = eventbridge

moved {
   from = module.<runner-module-name>.module.webhook.aws_cloudwatch_log_group.webhook
   to = module.<runner-module-name>.module.webhook.module.direct[0].aws_cloudwatch_log_group.webhook
}

When switching between direct and event bridge

When enable mode eventbridge

moved {
  from = module.runners.module.webhook.module.direct[0].aws_cloudwatch_log_group.webhook
  to = module.runners.module.webhook.module.eventbridge[0].aws_cloudwatch_log_group.webhook
}

Or vice versa for moving from eventbride to webhook

@npalm npalm marked this pull request as draft October 17, 2024 07:00
examples/default/main.tf Outdated Show resolved Hide resolved
Copy link
Member Author

@npalm npalm Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading the configuration was quite a mess. Replaced by a singleton to load configurations per deployed entrypoint. For each lambda handler function a different ConfigABC claass is created.

}
}

function validateEventBusName(config: ConfigWebhookEventBridge): void {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation is in general limited, but input is controlled via internal terraform modules.

@npalm npalm changed the title feat: support event bridge feat: support AWS EventBridge Oct 17, 2024
}

private loadProperty(propertyName: keyof this, value: string) {
try {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could not find a better way to parse an object or list in to a typed property. In case an error the input is just parsed as string.

Not perfect, but also the terraform modules are typed.


export async function githubWebhook(event: APIGatewayEvent, context: Context): Promise<Response> {
export async function directWebhook(event: APIGatewayEvent, context: Context): Promise<Response> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the configuration object is changed for the old flow (webhook only). This to accomodate other configurations


const logger = createChildLogger('handler');

export async function dispatch(event: WorkflowJobEvent, eventType: string, config: Config): Promise<Response> {
export async function dispatch(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only real change here is the configuration boject

await verifySignature(headers, body, config.webhookSecret);

const eventType = headers['x-github-event'] as string;
checkEventIsSupported(eventType, config.allowedEvents);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust terraform module to expose configuration. This allow to limit the events to accept on the bus.

@npalm npalm marked this pull request as ready for review October 17, 2024 20:03
@npalm
Copy link
Member Author

npalm commented Oct 18, 2024

With is part of terraform code all GitHub events can put in a bucket for further analysis.

Add the code in a .tf file int the multi runner example.

source: https://040code.github.io/2023/01/06/github-event-aws-eventbridge

locals {
  prefix = var.environment
  event_bus_name = module.runners.webhook.eventbridge.event_bus.name
}

resource "aws_cloudwatch_event_rule" "all" {
  name           = "${local.prefix}-github-events-all"
  description    = "Caputure all GitHub events"
  event_bus_name = local.event_bus_name
  event_pattern  = <<EOF
{
  "source": [{
    "prefix": "github"
  }]
}
EOF
}

resource "aws_cloudwatch_event_target" "main" {
  rule           = aws_cloudwatch_event_rule.all.name
  arn            = aws_kinesis_firehose_delivery_stream.extended_s3_stream.arn
  event_bus_name = local.event_bus_name
  role_arn       = aws_iam_role.event_rule_firehose_role.arn
}

data "aws_iam_policy_document" "event_rule_firehose_role" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["events.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "event_rule_firehose_role" {
  name               = "${local.prefix}-eventbridge-github-rule"
  assume_role_policy = data.aws_iam_policy_document.event_rule_firehose_role.json
}

data aws_iam_policy_document firehose_stream {
  statement {
    actions = [
      "firehose:DeleteDeliveryStream",
      "firehose:PutRecord",
      "firehose:PutRecordBatch",
      "firehose:UpdateDestination"
    ]
    resources = [aws_kinesis_firehose_delivery_stream.extended_s3_stream.arn]
  }
}


resource "aws_iam_role_policy" "event_rule_firehose_role" {
  name = "target-event-rule-firehose"
  role = aws_iam_role.event_rule_firehose_role.name
  policy = data.aws_iam_policy_document.firehose_stream.json
}


resource "random_uuid" "firehose_stream" {}

resource "aws_s3_bucket" "firehose_stream" {
  bucket        = "${local.prefix}-${random_uuid.firehose_stream.result}"
  force_destroy = true
}

# resource "aws_s3_bucket_acl" "firehose_stream" {
#   bucket = aws_s3_bucket.firehose_stream.id
#   acl    = "private"
# }

data "aws_iam_policy_document" "firehose_assume_role_policy" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["firehose.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "firehose_role" {
  name               = "${local.prefix}-firehose-role"
  assume_role_policy = data.aws_iam_policy_document.firehose_assume_role_policy.json
}

data "aws_iam_policy_document" "firehose_s3" {
  statement {
    actions = [
      "s3:AbortMultipartUpload",
      "s3:GetBucketLocation",
      "s3:GetObject",
      "s3:ListBucket",
      "s3:ListBucketMultipartUploads",
      "s3:PutObject"
    ]
    resources = [aws_s3_bucket.firehose_stream.arn, "${aws_s3_bucket.firehose_stream.arn}/*"]
  }
}

resource "aws_iam_role_policy" "firehose_s3" {
  name = "${local.prefix}-s3"
  role = aws_iam_role.firehose_role.name
  policy = data.aws_iam_policy_document.firehose_s3.json
}

data "aws_iam_policy_document" "firehose_log" {
  statement {
    actions = [
      "logs:PutLogEvents",
      "logs:CreateLogStream",
      "logs:CreateLogGroup"
    ]
    resources = [aws_cloudwatch_log_group.firehose_delivery_stream.arn]
  }
}

resource "aws_iam_role_policy" "firehose_log" {
  name = "${local.prefix}-log"
  role = aws_iam_role.firehose_role.name
  policy = data.aws_iam_policy_document.firehose_log.json
}

resource "aws_kinesis_firehose_delivery_stream" "extended_s3_stream" {
  name        = "${local.prefix}-stream"
  destination = "extended_s3"

  extended_s3_configuration {
    role_arn   = aws_iam_role.firehose_role.arn
    bucket_arn = aws_s3_bucket.firehose_stream.arn
  }
}

resource "aws_cloudwatch_log_group" "firehose_delivery_stream" {
  name              = "/aws/kinesisfirehose/${aws_kinesis_firehose_delivery_stream.extended_s3_stream.name}"
  retention_in_days = 14
}

docs/configuration.md Outdated Show resolved Hide resolved
docs/index.md Outdated Show resolved Hide resolved
examples/multi-runner/main.tf Outdated Show resolved Hide resolved
webhook_mode = "eventbridge"
# adjust the allow events to only allow specific events, like workflow_job
# eventbridge_allowed_events = ['workflow_job']

# enable this section for tracing
# tracing_config = {
# mode = "Active"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the examples lots of commented TF, this will leave room for mistakes, we tend to forget options that are deleted or have been renamed etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You ar right, but feel free to suggest another way. Would be great to do it different. But the examples are also a great way to check a PR quickly.

main.tf Outdated Show resolved Hide resolved
Copy link
Collaborator

@mpas mpas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not see big problems, only some styling/spelling issues.

@npalm npalm requested a review from mpas October 21, 2024 12:16
@npalm
Copy link
Member Author

npalm commented Oct 21, 2024

@mpas can you check 'my' last commit. Updated the top level object to configure the feature. And marked it beta in the docs.

npalm and others added 2 commits October 21, 2024 20:40
Co-authored-by: Marco Pas <marco.pasopas@gmail.com>
Co-authored-by: Marco Pas <marco.pasopas@gmail.com>
@@ -6,7 +6,7 @@ To be able to support a number of use-cases, the module has quite a lot of confi

- Org vs Repo level. You can configure the module to connect the runners in GitHub on an org level and share the runners in your org, or set the runners on repo level and the module will install the runner to the repo. There can be multiple repos but runners are not shared between repos.
- Multi-Runner module. This modules allows you to create multiple runner configurations with a single webhook and single GitHub App to simplify deployment of different types of runners. Check the detailed module [documentation](modules/public/multi-runner.md) for more information or checkout the [multi-runner example](examples/multi-runner.md).
- Workflow job event. You can configure the webhook in GitHub to send workflow job events to the webhook. Workflow job events were introduced by GitHub in September 2021 and are designed to support scalable runners. We advise using the workflow job event when possible.
- Webhook mode, the module can be deployed in the mode `direct` and `eventbridge` (Experimental). The `direct` mode is the default and will directly distribute to SQS for the scale-up lambda. The `eventbridge` mode will publish the event to an event bus with a target rule the events are sent to a dispatch lambda. The dispatch lambda will send the event to the SQS queue. The `eventbridge` mode is useful when you want to have more control over the events and potentially filter them. The `eventbridge` mode is disabled by default. We expect thhe `eventbridge` mode will be the future direction to build a data lake, build metrics, acto on `workflow_job` job started events, etc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Webhook mode, the module can be deployed in the mode `direct` and `eventbridge` (Experimental). The `direct` mode is the default and will directly distribute to SQS for the scale-up lambda. The `eventbridge` mode will publish the event to an event bus with a target rule the events are sent to a dispatch lambda. The dispatch lambda will send the event to the SQS queue. The `eventbridge` mode is useful when you want to have more control over the events and potentially filter them. The `eventbridge` mode is disabled by default. We expect thhe `eventbridge` mode will be the future direction to build a data lake, build metrics, acto on `workflow_job` job started events, etc.
- Webhook mode, the module can be deployed in the mode `direct` and `eventbridge` (Experimental). The `direct` mode is the default and will directly distribute to SQS for the scale-up lambda. The `eventbridge` mode will publish the event to an event bus with a target rule the events are sent to a dispatch lambda. The dispatch lambda will send the event to the SQS queue. The `eventbridge` mode is useful when you want to have more control over the events and potentially filter them. The `eventbridge` mode is disabled by default. We expect thhe `eventbridge` mode will be the future direction to build a data lake, build metrics, act on `workflow_job` job started events, etc.

@@ -31,7 +31,7 @@ The diagram below shows the architecture of the module, groups are indicating th

### Webhook

The moment a GitHub action workflow requiring a `self-hosted` runner is triggered, GitHub will try to find a runner which can execute the workload. See [additional notes](additional_notes.md) for how the selection is made. This module reacts to GitHub's [`workflow_job` event](https://docs.github.com/en/free-pro-team@latest/developers/webhooks-and-events/webhook-events-and-payloads#workflow_job) for the triggered workflow and creates a new runner if necessary.
The moment a GitHub action workflow requiring a `self-hosted` runner is triggered, GitHub will try to find a runner which can execute the workload. See [additional notes](additional_notes.md) for how the selection is made. The module can be deployed in two modes. One mode called `direct`, after accepting the [`workflow_job` event](https://docs.github.com/en/free-pro-team@latest/developers/webhooks-and-events/webhook-events-and-payloads#workflow_job) event the module will dispatch the event to a SQS queue on which the scale-up function will act. The second mode, `eventbridge` will funnel events via the AWS EventBridge. the EventBridge enables act on other events then only the `workflow_job` event with status `queued`. besides that the EventBridge suppors replay functionality. For future exenstions to act on events or create a data lake we will relay on the EventBridge.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The moment a GitHub action workflow requiring a `self-hosted` runner is triggered, GitHub will try to find a runner which can execute the workload. See [additional notes](additional_notes.md) for how the selection is made. The module can be deployed in two modes. One mode called `direct`, after accepting the [`workflow_job` event](https://docs.github.com/en/free-pro-team@latest/developers/webhooks-and-events/webhook-events-and-payloads#workflow_job) event the module will dispatch the event to a SQS queue on which the scale-up function will act. The second mode, `eventbridge` will funnel events via the AWS EventBridge. the EventBridge enables act on other events then only the `workflow_job` event with status `queued`. besides that the EventBridge suppors replay functionality. For future exenstions to act on events or create a data lake we will relay on the EventBridge.
The moment a GitHub action workflow requiring a `self-hosted` runner is triggered, GitHub will try to find a runner which can execute the workload. See [additional notes](additional_notes.md) for how the selection is made. The module can be deployed in two modes. One mode called `direct`, after accepting the [`workflow_job` event](https://docs.github.com/en/free-pro-team@latest/developers/webhooks-and-events/webhook-events-and-payloads#workflow_job) event the module will dispatch the event to a SQS queue on which the scale-up function will act. The second mode, `eventbridge` will funnel events via the AWS EventBridge. the EventBridge enables act on other events then only the `workflow_job` event with status `queued`. besides that the EventBridge supports replay functionality. For future extensions to act on events or create a data lake we will relay on the EventBridge.

@@ -132,6 +132,7 @@ module "multi-runner" {
| <a name="input_enable_managed_runner_security_group"></a> [enable\_managed\_runner\_security\_group](#input\_enable\_managed\_runner\_security\_group) | Enabling the default managed security group creation. Unmanaged security groups can be specified via `runner_additional_security_group_ids`. | `bool` | `true` | no |
| <a name="input_enable_metrics_control_plane"></a> [enable\_metrics\_control\_plane](#input\_enable\_metrics\_control\_plane) | (Experimental) Enable or disable the metrics for the module. Feature can change or renamed without a major release. | `bool` | `false` | no |
| <a name="input_enable_workflow_job_events_queue"></a> [enable\_workflow\_job\_events\_queue](#input\_enable\_workflow\_job\_events\_queue) | Enabling this experimental feature will create a secondory sqs queue to which a copy of the workflow\_job event will be delivered. | `bool` | `false` | no |
| <a name="input_eventbridge"></a> [eventbridge](#input\_eventbridge) | Enable the use of EventBridge by the module. By enable this feature events will be putted on the EventBridge bhy the webhook instead of directly dispatchting to queues for sacling. | <pre>object({<br/> enable = optional(bool, false)<br/> accept_events = optional(list(string), [])<br/> })</pre> | `{}` | no |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| <a name="input_eventbridge"></a> [eventbridge](#input\_eventbridge) | Enable the use of EventBridge by the module. By enable this feature events will be putted on the EventBridge bhy the webhook instead of directly dispatchting to queues for sacling. | <pre>object({<br/> enable = optional(bool, false)<br/> accept_events = optional(list(string), [])<br/> })</pre> | `{}` | no |
| <a name="input_eventbridge"></a> [eventbridge](#input\_eventbridge) | Enable the use of EventBridge by the module. By enable this feature events will be putted on the EventBridge by the webhook instead of directly dispatching to queues for scaling. | <pre>object({<br/> enable = optional(bool, false)<br/> accept_events = optional(list(string), [])<br/> })</pre> | `{}` | no |

{
"Effect": "Allow",
"Action": ["events:PutEvents"],
"Resource": ${resource_arns}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Resource": ${resource_arns}
"Resource": "${resource_arns}"

"${github_app_webhook_secret_arn}",
"${parameter_runner_matcher_config_arn}"
]
"Resource": ${resource_arns}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Resource": ${resource_arns}
"Resource": "${resource_arns}"

sid = "AllowXRay"
}
}
# data "aws_iam_policy_document" "lambda_xray" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be deleted? Its all commented

@@ -157,6 +157,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
| <a name="input_enable_user_data_debug_logging_runner"></a> [enable\_user\_data\_debug\_logging\_runner](#input\_enable\_user\_data\_debug\_logging\_runner) | Option to enable debug logging for user-data, this logs all secrets as well. | `bool` | `false` | no |
| <a name="input_enable_userdata"></a> [enable\_userdata](#input\_enable\_userdata) | Should the userdata script be enabled for the runner. Set this to false if you are using your own prebuilt AMI. | `bool` | `true` | no |
| <a name="input_enable_workflow_job_events_queue"></a> [enable\_workflow\_job\_events\_queue](#input\_enable\_workflow\_job\_events\_queue) | Enabling this experimental feature will create a secondory sqs queue to which a copy of the workflow\_job event will be delivered. | `bool` | `false` | no |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| <a name="input_enable_workflow_job_events_queue"></a> [enable\_workflow\_job\_events\_queue](#input\_enable\_workflow\_job\_events\_queue) | Enabling this experimental feature will create a secondory sqs queue to which a copy of the workflow\_job event will be delivered. | `bool` | `false` | no |
| <a name="input_enable_workflow_job_events_queue"></a> [enable\_workflow\_job\_events\_queue](#input\_enable\_workflow\_job\_events\_queue) | Enabling this experimental feature will create a secondary sqs queue to which a copy of the workflow\_job event will be delivered. | `bool` | `false` | no |

@mpas mpas self-requested a review October 22, 2024 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants