Git Product home page Git Product logo

cloudsnorkel / cdk-github-runners Goto Github PK

View Code? Open in Web Editor NEW
261.0 6.0 36.0 4.15 MB

CDK constructs for self-hosted GitHub Actions runners

Home Page: https://constructs.dev/packages/@cloudsnorkel/cdk-github-runners/

License: Apache License 2.0

JavaScript 1.55% TypeScript 90.85% Dockerfile 3.18% Shell 0.86% HTML 0.07% Svelte 3.37% SCSS 0.12%
aws cdk github-actions github-actions-runner aws-cdk

cdk-github-runners's Introduction

GitHub Self-Hosted Runners CDK Constructs

NPM PyPI Maven Central Go Nuget Release License

Use this CDK construct to create ephemeral self-hosted GitHub runners on-demand inside your AWS account.

  • ๐Ÿงฉ Easy to configure GitHub integration with a web-based interface
  • ๐Ÿง  Customizable runners with decent defaults
  • ๐Ÿƒ๐Ÿป Multiple runner configurations controlled by labels
  • ๐Ÿ” Everything fully hosted in your account
  • ๐Ÿ”ƒ Automatically updated build environment with latest runner version

Self-hosted runners in AWS are useful when:

  • You need easy access to internal resources in your actions
  • You want to pre-install some software for your actions
  • You want to provide some basic AWS API access (but aws-actions/configure-aws-credentials has more security controls)
  • You are using GitHub Enterprise Server

Ephemeral (or on-demand) runners are the recommended way by GitHub for auto-scaling, and they make sure all jobs run with a clean image. Runners are started on-demand. You don't pay unless a job is running.

API

The best way to browse API documentation is on Constructs Hub. It is available in all supported programming languages.

Providers

A runner provider creates compute resources on-demand and uses actions/runner to start a runner.

EC2 CodeBuild Fargate ECS Lambda
Time limit Unlimited 8 hours Unlimited Unlimited 15 minutes
vCPUs Unlimited 2, 4, 8, or 72 0.25 to 4 Unlimited 1 to 6
RAM Unlimited 3gb, 7gb, 15gb, or 145gb 512mb to 30gb Unlimited 128mb to 10gb
Storage Unlimited 50gb to 824gb 20gb to 200gb Unlimited Up to 10gb
Architecture x86_64, ARM64 x86_64, ARM64 x86_64, ARM64 x86_64, ARM64 x86_64, ARM64
sudo โœ” โœ” โœ” โœ” โŒ
Docker โœ” โœ” (Linux only) โŒ โœ” โŒ
Spot pricing โœ” โŒ โœ” โœ” โŒ
OS Linux, Windows Linux, Windows Linux, Windows Linux, Windows Linux

The best provider to use mostly depends on your current infrastructure. When in doubt, CodeBuild is always a good choice. Execution history and logs are easy to view, and it has no restrictive limits unless you need to run for more than 8 hours.

  • EC2 is useful when you want runners to have complete access to the host
  • ECS is useful when you want to control the infrastructure, like leaving the runner host running for faster startups
  • Lambda is useful for short jobs that can work within time, size and readonly system constraints

You can also create your own provider by implementing IRunnerProvider.

Installation

  1. Install and use the appropriate package

    Python

    Install

    Available on PyPI.

    pip install cloudsnorkel.cdk-github-runners

    Use

    from cloudsnorkel.cdk_github_runners import GitHubRunners
    
    GitHubRunners(self, "runners")
    TypeScript or JavaScript

    Install

    Available on npm.

    npm i @cloudsnorkel/cdk-github-runners

    Use

    import { GitHubRunners } from '@cloudsnorkel/cdk-github-runners';
    
    new GitHubRunners(this, "runners");
    Java

    Install

    Available on Maven.

    <dependency>
       <groupId>com.cloudsnorkel</groupId>
       <artifactId>cdk.github.runners</artifactId>
    </dependency>

    Use

    import com.cloudsnorkel.cdk.github.runners.GitHubRunners;
    
    GitHubRunners.Builder.create(this, "runners").build();
    Go

    Install

    Available on GitHub.

    go get github.com/CloudSnorkel/cdk-github-runners-go/cloudsnorkelcdkgithubrunners

    Use

    import "github.com/CloudSnorkel/cdk-github-runners-go/cloudsnorkelcdkgithubrunners"
    
    NewGitHubRunners(this, jsii.String("runners"))
    .NET

    Install

    Available on Nuget.

    dotnet add package CloudSnorkel.Cdk.Github.Runners

    Use

    using CloudSnorkel;
    
    new GitHubRunners(this, "runners");
  2. Use GitHubRunners construct in your code (starting with default arguments is fine)

  3. Deploy your stack

  4. Look for the status command output similar to aws --region us-east-1 lambda invoke --function-name status-XYZ123 status.json

     โœ…  github-runners-test
    
    โœจ  Deployment time: 260.01s
    
    Outputs:
    github-runners-test.runnersstatuscommand4A30F0F5 = aws --region us-east-1 lambda invoke --function-name github-runners-test-runnersstatus1A5771C0-mvttg8oPQnQS status.json
    
  5. Execute the status command (you may need to specify --profile too) and open the resulting status.json file

  6. Open the URL in github.setup.url from status.json or manually setup GitHub integration as an app or with personal access token

  7. Run status command again to confirm github.auth.status and github.webhook.status are OK

  8. Trigger a GitHub action that has a self-hosted label with runs-on: [self-hosted, linux, codebuild] or similar

  9. If the action is not successful, see troubleshooting

Demo

Customizing

The default providers configured by GitHubRunners are useful for testing but probably not too much for actual production work. They run in the default VPC or no VPC and have no added IAM permissions. You would usually want to configure the providers yourself.

For example:

let vpc: ec2.Vpc;
let runnerSg: ec2.SecurityGroup;
let dbSg: ec2.SecurityGroup;
let bucket: s3.Bucket;

// create a custom CodeBuild provider
const myProvider = new CodeBuildRunnerProvider(this, 'codebuild runner', {
   labels: ['my-codebuild'],
   vpc: vpc,
   securityGroups: [runnerSg],
});
// grant some permissions to the provider
bucket.grantReadWrite(myProvider);
dbSg.connections.allowFrom(runnerSg, ec2.Port.tcp(3306), 'allow runners to connect to MySQL database');

// create the runner infrastructure
new GitHubRunners(this, 'runners', {
   providers: [myProvider],
});

Another way to customize runners is by modifying the image used to spin them up. The image contains the runner, any required dependencies, and integration code with the provider. You may choose to customize this image by adding more packages, for example.

const myBuilder = FargateRunnerProvider.imageBuilder(this, 'image builder');
myBuilder.addComponent(
  RunnerImageComponent.custom({ commands: ['apt install -y nginx xz-utils'] }),
);

const myProvider = new FargateRunnerProvider(this, 'fargate runner', {
   labels: ['customized-fargate'],
   imageBuilder: myBuilder,
});

// create the runner infrastructure
new GitHubRunners(this, 'runners', {
   providers: [myProvider],
});

Your workflow will then look like:

name: self-hosted example
on: push
jobs:
  self-hosted:
    runs-on: [self-hosted, customized-fargate]
    steps:
      - run: echo hello world

Windows images can also be customized the same way.

const myWindowsBuilder = FargateRunnerProvider.imageBuilder(this, 'Windows image builder', {
   architecture: Architecture.X86_64,
   os: Os.WINDOWS,
});
myWindowsBuilder.addComponent(
   RunnerImageComponent.custom({
     name: 'Ninja',
     commands: [
       'Invoke-WebRequest -UseBasicParsing -Uri "https://github.com/ninja-build/ninja/releases/download/v1.11.1/ninja-win.zip" -OutFile ninja.zip',
       'Expand-Archive ninja.zip -DestinationPath C:\\actions',
       'del ninja.zip',
     ],
   }),
);

const myProvider = new FargateRunnerProvider(this, 'fargate runner', {
   labels: ['customized-windows-fargate'],
   imageBuilder: myWindowsBuilder,
});

new GitHubRunners(this, 'runners', {
   providers: [myProvider],
});

The runner OS and architecture is determined by the image it is set to use. For example, to create a Fargate runner provider for ARM64 set the architecture property for the image builder to Architecture.ARM64 in the image builder properties.

new GitHubRunners(this, 'runners', {
   providers: [
      new FargateRunnerProvider(this, 'fargate runner', {
         labels: ['arm64', 'fargate'],
         imageBuilder: FargateRunnerProvider.imageBuilder(this, 'image builder', {
            architecture: Architecture.ARM64,
            os: Os.LINUX_UBUNTU,
         }),
      }),
   ],
});

Architecture

Architecture diagram

Troubleshooting

Runners are started in response to a webhook coming in from GitHub. If there are any issues starting the runner like missing capacity or transient API issues, the provider will keep retrying for 24 hours. Configuration issue related errors like pointing to a missing AMI will not be retried. GitHub itself will cancel the job if it can't find a runner for 24 hours. If your jobs don't start, follow the steps below to examine all parts of this workflow.

  1. Always start with the status function, make sure no errors are reported, and confirm all status codes are OK
  2. Make sure runs-on in the workflow matches the expected labels set in the runner provider
  3. Diagnose relevant executions of the orchestrator step function by visiting the URL in troubleshooting.stepFunctionUrl from status.json
    1. If the execution failed, check your runner provider configuration for errors
    2. If the execution is still running for a long time, check the execution events to see why runner starting is being retried
    3. If there are no relevant executions, move to the next step
  4. Confirm the webhook Lambda was called by visiting the URL in troubleshooting.webhookHandlerUrl from status.json
    1. If it's not called or logs errors, confirm the webhook settings on the GitHub side
    2. If you see too many errors, make sure you're only sending workflow_job events
  5. When using GitHub app, make sure there are active installations in github.auth.app.installations

All logs are saved in CloudWatch.

  • Log group names can be found in status.json for each provider, image builder, and other parts of the system
  • Some useful Logs Insights queries can be enabled with GitHubRunners.createLogsInsightsQueries()

To get status.json, check out the CloudFormation stack output for a command that generates it. The command looks like:

aws --region us-east-1 lambda invoke --function-name status-XYZ123 status.json

Monitoring

There are two important ways to monitor your runners:

  1. Make sure runners don't fail to start. When that happens, jobs may sit and wait. Use GitHubRunners.metricFailed() to get a metric for the number of failed runner starts. You should use this metric to trigger an alarm.
  2. Make sure runner images don't fail to build. Failed runner image builds mean you will get stuck with out-of-date software on your runners. It may lead to security vulnerabilities, or it may lead to slower runner start-ups as the runner software itself needs to be updated. Use GitHubRunners.failedImageBuildsTopic() to get SNS topic that gets notified of failed runner image builds. You should subscribe to this topic.

Other useful metrics to track:

  1. Use GitHubRunners.metricJobCompleted() to get a metric for the number of completed jobs broken down by labels and job success.
  2. Use GitHubRunners.metricTime() to get a metric for the total time a runner is running. This includes the overhead of starting the runner.

Contributing

If you use and love this project, please consider contributing.

  1. ๐Ÿชณ If you see something, say something. Issues help improve the quality of the project.
    • Include relevant logs and package versions for bugs.
    • When possible, describe the use-case behind feature requests.
  2. ๐Ÿ› ๏ธ Pull requests are welcome.
    • Run npm run build before submitting to make sure all tests pass.
    • Allow edits from maintainers so small adjustments can be made easily.
  3. ๐Ÿ’ต Consider sponsoring the project to show your support and optionally get your name listed below.

Other Options

  1. philips-labs/terraform-aws-github-runner if you're using Terraform
  2. actions/actions-runner-controller if you're using Kubernetes

cdk-github-runners's People

Contributors

beeehappyandfree avatar chrisneal avatar christophgysin avatar cloudsnorkelbot avatar diegoaguilar avatar jacobhjustice avatar jaypea avatar kichik avatar mgius-ae avatar pharindoko avatar quad avatar quinnypig avatar toast-gear avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cdk-github-runners's Issues

Unrecognized labels cancel entire workflow

From #72 (comment)

In your case it sounds like it's because we expect every job with the self-hosted label to also have a label we recognize. It sounds like you have multiple installations of cdk-github-runners in different accounts, each implementing different labels. That means some of the installations will not recognize the label and cancel the job. We can probably add a flag to disable this behavior. Let's do this in a separate ticket.

@laxgoalie392

Web-based interface to automatically setup GitHub app

We can create GitHub App from manifest. It doesn't include client id and secret, but are those truly required? Can we use API to generate them? It also doesn't install the app, but there is probably a way to direct the user to the right URL automatically. We can then collect the installation webhook event and save installation id from there.

There are security considerations with this. We definitely don't want to let anyone who can guess the URL attach itself to our service. But we want to keep it simple so authentication against Cognito or IdP is a bit too much. Maybe we can settle for a one time setup token that gets deleted once the installation is complete. It can be given to the user in the stack output or with another function like the status function.

FargateRunner: missing property to select subnets

Hey @kichik,

thanks for this amazing piece of software.
I`m missing the option to set a property "subnetSelection" for the subnets of the fargate runner explicitely.
Is there some intention behind ?

   const fargateRunner = new FargateRunner(this, "fargate runner", {
      label: "fargate",
      vpc: existingVpc,
      ...
      ...
      subnetSelection: vpcSubnets,

    });

background:

I used this

    const fargateRunner = new FargateRunner(this, "fargate runner", {
      label: "fargate",
      vpc: existingVpc,
      assignPublicIp: false,
      spot: true,
    });

and received this error message

(node:27784) UnhandledPromiseRejectionWarning: Error: There are no 'Private' subnet groups in this VPC. Available types: Isolated,Deprecated_Isolated

br,

flo

Lambda Runners do not allow writing to disk

Hi!

When using a lambda runner I cannot install dependencies at runtime. Looks like the $HOME would need to be set to /tmp so that lambda have write privileges.

Error:

npm WARN read-shrinkwrap This version of npm is compatible with lockfileVersion@1, but package-lock.json was generated for lockfileVersion@2. I'll try to do my best with it!
npm ERR! correctMkdir failed to make directory /home/sbx_user1051/.npm/_locks
npm WARN @[email protected] No repository field.

npm ERR! code EROFS
npm ERR! syscall mkdir
npm ERR! path /home/sbx_user[10](https://github.com/REDACTED#step:3:11)51
npm ERR! errno -30
npm ERR! rofs EROFS: read-only file system, mkdir '/home/sbx_user1051'
npm ERR! rofs Often virtualized file systems, or other file systems
npm ERR! rofs that don't support symlinks, give this error.
Error: Process completed with exit code 226.

To reproduce

Action:

name: Cypress tests - Integration
on: [push]

jobs:
  cypress-tests:
    runs-on: ["self-hosted", "lambda_m1024_s1024"]
    environment:
      name: development
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Prepare environment
        run: |
          npm  install

Runner:

        self.lambda_runner = LambdaRunner(
            scope=self,
            id="LambdaRunner",
            log_retention=logs.RetentionDays.ONE_MONTH,
            labels=["lambda_m1024_s1024"],
            vpc=self.vpc,
            security_group=self.security_group,
            memory_size=1024,
            ephemeral_storage_size=cdk.Size.gibibytes(1),
            timeout=cdk.Duration.minutes(10),
            subnet_selection=ec2.SubnetSelection(subnet_group_name="PrivateSubnet"),
        )

Bursting EC2 starts will fail

Starting many instances at once is error prone, and will fail. This can happen when many jobs are submitted simultaneously. For us, we have several workflows for each PR, and we hit this limit pretty easily.

{
  "resourceType": "aws-sdk:ec2",
  "resource": "runInstances.waitForTaskToken",
  "error": "Ec2.Ec2Exception",
  "cause": "Request limit exceeded. (Service: Ec2, Status Code: 503, Request ID: dde217ee-eb44-4443-98a6-5bd134a8bf0f)"
}

I think a retry needs to be added, that handles this specific error case.

GitHub server does not support configuring a self-hosted runner with 'DisableUpdate' flag.

Trying to use this library with my GHE resulted in the following error message:

The GitHub server does not support configuring a self-hosted runner with 'DisableUpdate' flag.

I'm using the codebuild provider and narrowed it down to the --disableupdate flag.

I'm new to this whole topic, but removing it from by buildspec manually resolved the issue. I'm guessing that the flag might even be irrelevant anyways since the runners are ephemeral?

What are the impacts of removing that flag? could we make it optional?

EC2 runners

  • Spot pricing
  • Docker support for Windows
  • Only way to get MacOS support?
  • Many more instance size options

Missing export for AmiBuilder

Similar to #161, the exports for AmiBuilder and co are missing.

I would suggest to refactor and simplify how this projects handles exports, and hopefully avoid similar issues in the future.

running docker compose inside codebuild

Scenario is: in the CodeBuild Provider, when I try to run

docker compose up -d

in my GitHub Actions script, I get an error output like this:

unknown shorthand flag: 'd' in -d

I'm using the LINUX_X64_DOCKERFILE_PATH for my docker image, I guess the "issue" is with docker-in-docker.

It's possible to run it with docker-compose up -d but since docker-compose has been deprecated, I think it's a good move to update this. What do you think?

Runner image customization

User should be able to install additional dependencies, setup internal repositories, and in general fully customize the image to their needs.

Mac OS runner

Challenges:

  • Must pay for at least 24 hours of usage of a dedicated host for each test (sponsorship?)
  • Instance can take 15-40 minutes to start (keeping some runners warm is a must?)
  • EC2 Image Builder doesn't support Mac OS X (packer?)

[Question] How do I utilize the role that a codebuild runner is using?

I'm using a codebuild runner and trying to run an action that relies on aws credentials.

i've tried using the following step in my workflow:

- name: Test
  run: aws sts get-caller-identity

but i end up getting the following error:

Unable to locate credentials. You can configure credentials by running "aws configure".

I also tried using the aws configure credentials action but am hitting basically the same error

- name: AWS Secure Access
  uses: aws-actions/configure-aws-credentials@v1
  with:
    aws-region: us-east-1

My first impression was that I would be able to inherit the role that the codebuild runner was using

Make all props structures optional

For example:

constructor(scope: Construct, id: string, props: LambdaRunnerProps) {

Should be:

constructor(scope: Construct, id: string, props?: LambdaRunnerProps) {

feat: multiple labels

Would be nice to have the option to set multiple labels (String Array) instead of just one label for a runner.

ImageBuilder getting stuck due to lack of TZ

When installing extra packages with ImageBuilder, we might come across a tzdata dependency. When that happens, codebuild will get stuck here:

Configuring tzdata
--
990 | ------------------
991 | ย 
992 | Please select the geographic area in which you live. Subsequent configuration
993 | questions will narrow this down by presenting a list of cities, representing
994 | the time zones in which they are located.
995 | ย 
996 | 1. Africa        6. Asia            11. System V timezones
997 | 2. America       7. Atlantic Ocean  12. US
998 | 3. Antarctica    8. Europe          13. None of the above
999 | 4. Australia     9. Indian Ocean
1000 | 5. Arctic Ocean  10. Pacific Ocean

I suggest setting the TZ to UTC in all docker images, like this:

ENV TZ=UTC
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

Re-use EC2 runner instances

Does it make sense to launch multiple EC2 instances for each workflow?

This really slows the process down. I think ideally it should launch one instance, and then re-use it for all runs, and then when idle - shut down.

Missing export of Ec2Runner and Ec2RunnerProps

With the last version Ec2Runners have been introduced. I just wanted to give it a try and came over a missing export of Ec2Runner and Ec2RunnerProps within the index.ts

import { Ec2Runner} from "@cloudsnorkel/cdk-github-runners";
lib/github-runners-stack.ts:15:3 - error TS2305: Module '"../node_modules/@cloudsnorkel/cdk-github-runners/lib"' has no exported member 'Ec2Runner'.

Seams the export is missing.

Bug: multi-labels don`t work as expected

I added the ec2 runner for test ...

cdk code:

    const cloudPlatformsEc2Runner = new Ec2Runner(
      this,
      `${props.serviceName}-${props.stage}-ec2-runner`,
      {
        labels: ["cloudplatforms", "ec2"],
        vpc: existingVpc,
        subnetSelection: vpcSubnets,
        spot: true,
        instanceType: ec2.InstanceType.of(ec2.InstanceClass.T3, ec2.InstanceSize.MICRO),
      }
    );

workflow.yml:

name: fargate example
on: workflow_dispatch
jobs:
  self-hosted:
    runs-on: [self-hosted,cloudplatforms,ec2]
    steps:
      - run: echo hello world

stepfunctions:

these labels have been transmitted:

  "labels": {
    "self-hosted": true,
    "cloudplatforms": true,
    "ec2": true
  },

result:

it`s taking the wrong label :O

image

I assume because of this weird parent label
image

Best way to scope lambda functions Allowed principals

Discussed in #137

Originally posted by diegoaguilar October 28, 2022
I'm deploying hosted runners into an application with some automated policies check. Both setup and webhook handle lambdas are getting flagged because they're pretty open and publicly exposed.

Is there a way to access to these? Should I resolve these from the outputs and then configure? I want to avoid doing a sort of manual check and then getting code to invoke by ARN or something.

Should these even be taken and resolved from the outputs somehow?

Expectations

Allow the following configurable and selectable setups:

  1. Default: Create Lambda URL for the function (simplest zero headache approach)
  2. Disable function access
  3. Create API Gateway for the function that limits access to just GitHub webhook IPs (statically defined in code with action on this project that updates it once in a while, assuming they don't change those IPs often)
  4. Create API Gateway for the function that limits access to a given list of IPs/CIDRs
  5. Nice-to-have: private API Gateway that can only be accessed from certain VPC where GitHub Enterprise is installed

Metrics

Label distribution and failures. More?

feat: Ec2Runner VPC selection

I just tried out to use the EC2 runners. Within our accounts we do not have a default VPC which leads to the stack can not be deployed.

const ec2Provider = new Ec2Runner(this, "ec2Runner", {
      subnet: subnets[0],
      spot: true,
    })
10:28:42 AM | CREATE_FAILED        | AWS::ImageBuilder::Image                       | Dev/GitHubRunners/...mage Builder/Image
Resource handler returned message: "Error occurred during operation 'No default VPC for this user. GroupName is only supported for EC2-Classic and default VPC.'." (RequestToken: ...., Handle
rErrorCode: GeneralServiceException)

Currently the Ec2RunnerProps also do not provide the option to set a VPC.

https://github.com/CloudSnorkel/cdk-github-runners/blob/main/src/providers/ec2.ts#L126-L186

ECR login does not work with inherited role

Related to #113.

I haven't looked into this one yet but i just want to jot it down.

I applied the fix in #114. Initial use of the inherited worked to get secrets but i and ran into the following in a later step:

Error saving credentials: open /home/runner/.docker/config.json2271103188: permission denied

i'm trying to use this action:

- name: Login to Amazon ECR
  uses: aws-actions/amazon-ecr-login@v1
  with:
    registries: "****"

feels like a file system issue

Image builder status

The status function should return some information on the image builder. Latest build, image digest, recent failures, etc.

Git version problem with Ubuntu 18.04

Problem

Because the images use ubuntu:18.04, the git version is 2.17.1, checkout@v3 throws this warning message:

The repository will be downloaded using the GitHub REST API
To create a local Git repository instead, add Git 2.18 or higher to the PATH

This causes some problems with committing inside the self-hosted runner, meaning you will get an error like this when you try to commit:

fatal: not a git repository (or any of the parent directories): .git

I don't know if there's a specific reason we use 18.04 but is it possible to bump the ubuntu version to 20.04 since it is LTS as well?

Bug: Ec2 runner

When I try to use the ec2 runner I do get following error message in the step-function.

"error": {
    "Error": "Ec2.Ec2Exception",
    "Cause": "The provided credentials do not have permission to create the service-linked role for EC2 Spot Instances. (Service: Ec2, Status Code: 403, Request ID: 0ed21094-6fec-467b-bc8f-777207682f73)"
  },

Allow role to be passed in

We have an existing build role that i would like to reuse. Would be be able to allow that to be passed in and used?

Logging into ECR private registry with Docker

When logging into a private ECR, the runner has this error:

Error: Could not login to registry ***: WARNING! Using -*** the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/runner/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Error saving credentials: open /home/runner/.docker/config.json3597419489: permission denied

any idea why? this is the action I'm using

- name: login to ECR
  env:
    REGION: ${{ secrets.AWS_REGION }}
    ACCOUNT_ID: ${{ secrets.AWS_ACCOUNT_ID }}
  run: aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

when I run the same step on github servers, I get the same warning, but without the error:

Run aws ecr get-login*** $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com
WARNING! Your password will be stored unencrypted in /home/runner/.docker/config.json.
Configure a credential helper to remove this warning. See
Login Succeeded
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

I'm suspicious of that file name, config.json3597419489, ๐Ÿค” maybe I'm missing something in my command?

Better error handling

Problematic scenarios

The following scenarios are not properly handled yet.

  1. User cancels workflow. We currently don't recognize the cancellation and if the user cancels the workflow before the runner boots up, the runner will stay there until it times out or another job comes along. If the runner is finally assigned another job, that means it took it away from another runner that was started just for the new job. So this can cause an endless resource waste cycle where a runner is always on, even when a job is not running. For Fargate that has no time limit, this can mean the runner will basically run forever.
  2. Runner failure like configuration issues, missing capacity, or any random AWS failure. If the runner fails to even get the job, the job will just sit there waiting for the next runner to boot. But as we only create one runner per job, that means the old job is stealing the runner from the new job. The old job had to wait for the new job, and the new job will have to wait for the next job to come up. That could delay jobs for no reason. Our current solution for that is cancelling the workflow so the failure is clear and no job stealing occurs. However this can lead back to scenario number 1 if it happens fast enough.

Another complication to all of this is having to remove runners. There is a limited number of self-hosted runners allowed per-repo, and if we don't clean them up, we can get stuck unable to add runners. The runner usually removes itself, but any error conditions like provider timeout (lambda runs out of time), can prevent removal. That's why we always delete the runner using API on error. But to be able to delete the runner, we have to first stop any jobs running on it. That's another reason why we sadly have to cancel the entire workflow.

Wishlist

If we can get GitHub to make some changes, the following would really help simplify the solution for these corner cases.

  1. GitHub runner should support a timeout configuration that allows us to tell it to only wait a certain amount of time before giving up on getting a job.
  2. GitHub runner should allow us to configure a runner for a specific job and only for that job.
  3. GitHub API should allow us to mark just one job as failed instead of the whole workflow.

Other solutions

Assuming we can't get our wishlist items, here some incomplete ideas to help resolve these issues.

  • Have the step function monitor the job using /repos/{owner}/{repo}/actions/runs/{run_id}/jobs or the other one that has attempt information too. If we detect the job was cancelled, we can stop the runner. This means we won't be able to use .sync variants and will have to monitor the runners in the step function. This will also not work for Lambda runners as you can't stop a Lambda execution. But Lambdas are limited to 15 minutes which is not too bad.
  • Monitor the job with the method above and stop the runner if the job hasn't started in 5 minutes. That would mean the runner was stolen or the labels don't match or there is another issue. Will this work with multiple jobs running at the same time "stealing" runners from each-other? We already create the runners in the repo scope to limit cross-job stealing.
  • Figure out the undocumented actions API so we can fail a single job. We have the credentials and connection information in the .runner and .credentials files. Another option is creating a special runner fork that fails a single job.
  • Some kind of external to the step function monitor that makes sure all jobs are behaving right and has the power to start more runners if needed so stolen runners can be "fixed".

Setup page is blank (only seems to work in Google Chrome?)

I deployed a simple codebuilder setup via Python and CDK, but visiting the github.setup.url just shows a blank page. The Javascript console has these errors (I've replaced sensitive IDs with <my-foo>):

Content Security Policy: The page's settings blocked the loading of a resource at inline ("default-src"). [<my-function>.lambda-url.eu-west-2.on.aws:7:1](https://<my-function>.lambda-url.eu-west-2.on.aws/?token=<my-token>)
Content Security Policy: The page's settings blocked the loading of a resource at inline ("default-src"). [<my-function>.lambda-url.eu-west-2.on.aws:28:1](https://<my-function>.lambda-url.eu-west-2.on.aws/?token=<my-token>)
Content Security Policy: The page's settings blocked the loading of a resource at https://<my-function>.lambda-url.eu-west-2.on.aws/favicon.ico ("img-src"). resource:186:19
Content Security Policy: The page's settings blocked the loading of a resource at inline ("default-src"). moz-extension:1:52727

This was in Firefox. I tried installing Google Chrome, and the page seems to work in that.

Do not try to delete idle providers if no runner label match

There was no label match, yet the state machine still waits to reap idle providers. No real harm here, but I think it'd be more optimized if it didn't wait in the cases where there was no match.

In fact, ideally, I think this selection should as early as possible.

screenshot-20221215T163236-pJUkKUi2@2x

Question: About lambda function URLs

I have a question about the webhook lambda that github calls(?): is there any way to avoid unnecessary invocations with this? for example, if I have the function URL, even though I'll get unauthorized message every time, I'm still invoking the function, so is there a way to limit this invocation?

I've tried creating a security group in a VPC which only allows for github action's IPs to invoke the function, but I get an error due to there being too many IPs ๐Ÿ˜„.

Upgrade Instructions

What's the "golden path" to upgrade the version of the runner contained within the Docker image (specifically for Lambda, but across the board I suppose)? GitHub is whining about it being outdated on Action invocation...

bug: ContainerImageBuilder - cannot find image for Windows

Hey @kichik,

can`t get that image - maybe because I`m in another region .. eu-central-1
I wanted to build a windows container image for codebuild ...

code

    const windowsDefaultImage = new ContainerImageBuilder(
      this,
      `${props.serviceName}-${props.stage}-windows-image`,
      {
        architecture: Architecture.X86_64,
        os: Os.WINDOWS,
        runnerVersion: RunnerVersion.latest(),
        rebuildInterval: Duration.days(14),
        instanceType: ec2.InstanceType.of(
          ec2.InstanceClass.T3A,
          ec2.InstanceSize.LARGE
        ),
      }
    );

error:

The following required resource 'Image' cannot be found: 'arn:aws:imagebuilder:us-east-1:aws:image/windows-server-20
19-x86-core-ltsc2019-amd64/2020.12.8'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.