Git Product home page Git Product logo

Comments (24)

kichik avatar kichik commented on August 24, 2024 1

No chance. Too much repos at all. This is no fun even when you have more than 10 repos.

Wouldn't you have the same problem with your proposed solution? You'd have to specify all those labels somewhere. Maybe your CDK code can automatically generate it using gh?

The schedule option is something I have considered often, for many reasons. I am still not sure exactly what I want it to look like or if it should even be a schedule and not some kind of monitoring system for stolen/missing runners.

You can already define a schedule manually, if you need to. I have done something similar for one of my installations.

const sm = runners.node.findChild('Runner Orchestrator') as stepfunctions.StateMachine;
const rule = new events.Rule(this, 'Start Runner Rule', {
  schedule: events.Schedule.rate(cdk.Duration.hours(1)),
  targets: [
    new targets.SfnStateMachine(
      sm, 
      {
        input: events.RuleTargetInput.fromObject({
          "owner": "my-org",
          "repo": "my-repo",
          "jobId": 0,
          "jobUrl": "manual-runner",
          "installationId": xxx, // TODO get from github secret
          "labels": {
            "self-hosted": true,
            "something": true, // TODO update
          },
          "provider": "stack-name/provider-id" // TODO update
        }) 
      }
    )
  ],
});

from cdk-github-runners.

kichik avatar kichik commented on August 24, 2024 1

It doesnt happen all the time right. And for most of the users its quite ok how it works even with org runners.
But for some of them its not acceptable when its used for some production deployments.
And for those it would be fine if they need to attach additional labels to make it more reliable.

You can create a provider just for those special cases then. Isn't that basically what you were asking for? What am I missing?

Hmm thats a nice idea. But ok I cant do this statically in cdk code for all repos. I would get lost.

Since you're registering on organization level, the repo part doesn't really matter. You can leave it as a placeholder. No need for Lambda or anything.

from cdk-github-runners.

pharindoko avatar pharindoko commented on August 24, 2024 1

@kichik hope you have seen my invitation - created a private repo for this.

from cdk-github-runners.

kichik avatar kichik commented on August 24, 2024

Can you please include the specific labels used? It will be easier to discuss with a concrete example.

from cdk-github-runners.

pharindoko avatar pharindoko commented on August 24, 2024

As mentioned this might occur no matter if the runner are registered on org or repo level.
But it will happen more often because more jobs are executed.

Background:

We register the runners now on org level - means those are visible to all repositories the github app is configured for in an organization.

The runner has label [self-hosted, large-runner]
We have different workflows in two different repositories A,B in the same org that use these labels.

We trigger workflow A in repository A using workflow dispatch.
This workflow A gets queued and queued...

In the meantime another workflow B that listened for a runner already for 15 hours having the label [self-hosted, large-runner] takes over this runner.
It can have several reasons why this happens e.g. user missed for organizations to select the runner group setting to support Allow public repositories and has a public repository.

I can see this directly in the cloudwatch logs because in this case the workflow job failed and returned Job FrontendDeployment completed with result: Failed

Job FrontendDeployment is only availabe in repository B for workflow B.
Stepfunction succeeded, EC2 terminated...
Workflow A still waits for a runner ...

from cdk-github-runners.

kichik avatar kichik commented on August 24, 2024

That would be one of the organization level registration cons:

https://github.com/CloudSnorkel/cdk-github-runners/blob/994f474deb38c904d94d2ad824b0493bac09d942/setup/src/App.svelte#L333C1-L334C49

You can bring up more runners to take over the stolen ones by creating a new step function execution. Open an existing one from the UI and hit the New Execution button.

I'm still not sure what you originally meant with the labels. Where would the label be added to resolve this?

from cdk-github-runners.

pharindoko avatar pharindoko commented on August 24, 2024

I assume you can have this issue no matter if it`s org level or repo runner.
The propability is just higher on org level.
Furthermore you have an issue because as a user you might not see all repos in the org and it`s quite hard to find jobs that got stuck and still listen for a runner if you have a lot of repos in one org.

Here`s a sample for a solution proposal:

  1. Runner in cdk solution created with labels: [label-for-instance-type]
  2. User creates a workflow with [self-hosted, label-for-instance-type, label-for-repository]
  3. Workflow gets triggered and webhook.lambda.ts is triggered with labels [self-hosted, label-for-instance-type, label-for-repository]
    matchLabelsToProvider checks only if the runner labels are included: ["self-hosted, label-for-instance-type"]
  4. Stepfunction determines a runner based on the labels that match a runner: ["self-hosted, label-for-instance-type"]
  5. All labels [self-hosted, label-for-instance-type, label-for-repository] are injected into runner and attached to the config.sh command.
  6. Runner is registered in github with labels [self-hosted, label-for-instance-type, label-for-repository]

This would mean that runner labels and workflow labels can be different and you can attach additional workflow labels.

I`ll try to create a PR and test this to see if this could work.
@kichik please let me know if you have concerns ...

from cdk-github-runners.

kichik avatar kichik commented on August 24, 2024

That won't prevent other repositories from using label-for-repository and still stealing the runner. Our label situation is already not the simplest to understand. I think this will make it even harder and still won't solve runner stealing.

from cdk-github-runners.

pharindoko avatar pharindoko commented on August 24, 2024

well it`s not like people steal runners by intention. And the label-for-repository is just a sample for a tag.

You could use multiple tags and choose whatever you want.
But it could happen that people just copy / paste those labels from another github action from another repository by mistake.
It would still heavily reduce the propability that one runner can steal another one.
And you can trace it in the logs as well.

from cdk-github-runners.

pharindoko avatar pharindoko commented on August 24, 2024

btw if you know a better solution for this dilemma, please let me know :)

from cdk-github-runners.

kichik avatar kichik commented on August 24, 2024

But it could happen that people just copy / paste those labels from another github action from another repository by mistake.

Exactly.

Have you considered creating multiple providers all using the same settings and image builder but with different labels? That feels like basically what you're asking for. Each runner will start with exactly the labels you want and for just the repositories you want.

btw if you know a better solution for this dilemma, please let me know :)

Repository level registration is the only thing I found. I do wish GitHub would let us aim specific runners at specific jobs, but they have no such option.

from cdk-github-runners.

pharindoko avatar pharindoko commented on August 24, 2024

But it could happen that people just copy / paste those labels from another github action from another repository by mistake.

Exactly.

Have you considered creating multiple providers all using the same settings and image builder but with different labels? That feels like basically what you're asking for. Each runner will start with exactly the labels you want and for just the repositories you want.

No chance. Too much repos at all. This is no fun even when you have more than 10 repos.

btw if you know a better solution for this dilemma, please let me know :)

Repository level registration is the only thing I found. I do wish GitHub would let us aim specific runners at specific jobs, but they have no such option.

What I struggle with is the administration permission on repo level. A single permission to manage runners on the repository level is still missing.

Another option

Add a schedule feature to start additional runners on a regular base by

  • querying cloudwatch logs to get organizations that used the runner solution e.g. the last 2 weeks and start runners
  • querying the github instance to get all jobs that queued longer than e.g. 6 hours and start runners

The idleTimeout setting will kill those ec2 and deregister the runner by default if no job is taken after 5 minutes, right ?

@kichik please give me feedback
I could try to create a draft PR for this topic.

from cdk-github-runners.

pharindoko avatar pharindoko commented on August 24, 2024

No chance. Too much repos at all. This is no fun even when you have more than 10 repos.

Wouldn't you have the same problem with your proposed solution? You'd have to specify all those labels somewhere. Maybe your CDK code can automatically generate it using gh?

It doesn`t happen all the time right. And for most of the users it`s quite ok how it works even with org runners.
But for some of them it`s not acceptable when it`s used for some production deployments.
And for those it would be fine if they need to attach additional labels to make it more reliable.

The schedule option is something I have considered often, for many reasons. I am still not sure exactly what I want it to look like or if it should even be a schedule and not some kind of monitoring system for stolen/missing runners.

You can already define a schedule manually, if you need to. I have done something similar for one of my installations.

const sm = runners.node.findChild('Runner Orchestrator') as stepfunctions.StateMachine;
const rule = new events.Rule(this, 'Start Runner Rule', {
  schedule: events.Schedule.rate(cdk.Duration.hours(1)),
  targets: [
    new targets.SfnStateMachine(
      sm, 
      {
        input: events.RuleTargetInput.fromObject({
          "owner": "my-org",
          "repo": "my-repo",
          "jobId": 0,
          "jobUrl": "manual-runner",
          "installationId": xxx, // TODO get from github secret
          "labels": {
            "self-hosted": true,
            "something": true, // TODO update
          },
          "provider": "stack-name/provider-id" // TODO update
        }) 
      }
    )
  ],
});

Hmm that`s a nice idea. But ok I can`t do this statically in cdk code for all repos. I would get lost.

Create schedule to trigger a lambda.
Create a lambda to query cloudwatch insights to get the owner, repo and labels aggregated for the last 2 weeks.
Go through all rows found and trigger the statemachine.
I`ll try.
I still would like to have this as part of the cdk construct :)

from cdk-github-runners.

pharindoko avatar pharindoko commented on August 24, 2024

Ok, I found a solution that is quite ok I guess to circumvent this problem:

image
Added a construct with a step function and a few lambdas with following workflow:

  1. On the left side I get all the information of the state machine logs about the jobs transmitted (repo, owner, job, labels)
  2. On the right side I query github in a lambda authenticated as github app to get all the organisation installations

  1. Based on the "owner"|"organisation" property I merge the information
  2. I check the jobs and their current state (is the job queued for longer than 10 minutes ?)
  3. I trigger new stepfunction executions to get new runners with the specific labels registered in the right organisation.

I trigger this step function each half an hour.

What would be great to get as output from the GithubRunners construct:

  • the statemachine arn (use to trigger new step function executions
  • the loggroup name (used for the cloudwatch insights query)

Currently this is what I need to execute the construct:

const stateMachine = githubRunners.node.children.find((child) => child instanceof stepfunctions.StateMachine)! as stepfunctions.StateMachine;
const logGroupName = `${this.stackName}-state-machine`;

new GithubScheduledRunnerHandler(this, 'github-scheduled-runner', {
      internalSubnetIds: props.internalSubnetIds,
      vpcId: props.vpcId,
      logGroupName,
      githubSecretId: githubRunners.secrets.github.secretName,
      githubPrivateKeySecretId: githubRunners.secrets.githubPrivateKey.secretName,
      mappingTable: providers.map((provider) => { return { "provider": provider.toString(), "labels": provider.labels } }),
      stateMachineArn: stateMachine.stateMachineArn,
      githubRunnerWaitTimeInMinutes: 10,
    });

from cdk-github-runners.

pharindoko avatar pharindoko commented on August 24, 2024

Would you be interested to integrate this construct in your solution ? Should I add a PR or should I create a separate construct ?
I could share the code in a private repo in a simplified version (not a perfect construct) if you want to get more details.
Please let me know @kichik.

from cdk-github-runners.

kichik avatar kichik commented on August 24, 2024

I do want something like this one day, but I don't think we have all the details figured out yet. For example, when using ECS with a limited ASG, a job might be pending for >10 minutes just because the cluster is busy. If we keep submitting more jobs, it may only make the situation worse.

I do like that it doesn't go off just by the webhook. I have seen cases where the webhook was just not called and so runners were missing.

from cdk-github-runners.

kichik avatar kichik commented on August 24, 2024

Please do share the code, if you can. I think I may have misunderstood the matching you were doing between jobs and the existing state machine. It might already cover the problem I mentioned.

from cdk-github-runners.

kichik avatar kichik commented on August 24, 2024

@kichik hope you have seen my invitation - created a private repo for this.

I did. Thanks.

from cdk-github-runners.

pharindoko avatar pharindoko commented on August 24, 2024

After more than week of running the scheduler step function I found out there a clear correlation between the webhook lambda function error message ExecutionAlreadyExists and new runners triggered by my scheduler step function. It only seems to happen in this case.

Detailed error message in cloudwatch:

__type
com.amazonaws.swf.service.v2.model#ExecutionAlreadyExists
$fault
client
$metadata.attempts
1
$metadata.httpStatusCode
400
$metadata.requestId
f1cad492-8906-4967-9c99-f920a010bf90
$metadata.totalRetryDelay
0
@ingestionTime
1704816064769
@log
********:/aws/lambda/gh-runner-ghrunnergithubrunnerWebhookHandlerwebhoo....
@logStream
2024/01/09/[$LATEST]e18f1646a851483d93a0490663a2cde4
@message
2024-01-09T16:00:56.047Z f8bf04d7-561b-4d7a-bbd0-b66f23121b91 ERROR Invoke Error {"errorType":"ExecutionAlreadyExists","errorMessage":"Execution Already Exists: 'arn:aws:states:********:********:execution:ghrunnergithubrunnerRunnerOrchestrator********:******'-44b33b20-af08-'","name":"ExecutionAlreadyExists","$fault":"client","$metadata":{"httpStatusCode":400,"requestId":"f1cad492-8906-4967-9c99-f920a010bf90","attempts":1,"totalRetryDelay":0},"__type":"com.amazonaws.swf.service.v2.model#ExecutionAlreadyExists","stack":["ExecutionAlreadyExists: Execution Already Exists: 'arn:aws:states:********:*******:execution:ghrunnergithubrunnerRunnerOrchestrator********:*****'"," at de_ExecutionAlreadyExistsRes (/var/runtime/node_modules/@aws-sdk/client-sfn/dist-cjs/protocols/Aws_json1_0.js:1694:23)"," at de_StartExecutionCommandError (/var/runtime/node_modules/@aws-sdk/client-sfn/dist-cjs/protocols/Aws_json1_0.js:1321:25)"," at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"," at async /var/runtime/node_modules/@aws-sdk/middleware-serde/dist-cjs/deserializerMiddleware.js:7:24"," at async /var/runtime/node_modules/@aws-sdk/middleware-signing/dist-cjs/awsAuthMiddleware.js:14:20"," at async /var/runtime/node_modules/@aws-sdk/middleware-retry/dist-cjs/retryMiddleware.js:27:46"," at async /var/runtime/node_modules/@aws-sdk/middleware-logger/dist-cjs/loggerMiddleware.js:7:26"," at async Runtime.handler (/var/task/index.js:13997:21)"]}
@requestId
f8bf04d7-561b-4d7a-bbd0-b66f23121b91
@timestamp
1704816056047

from cdk-github-runners.

kichik avatar kichik commented on August 24, 2024

The name uses GitHub webhook delivery id to remain unique.

let executionName = `${payload.repository.full_name.replace('/', '-')}-${getHeader(event, 'x-github-delivery')}`.slice(0, 64);

Do you maybe have the same webhook hooked up more than once somehow?

from cdk-github-runners.

pharindoko avatar pharindoko commented on August 24, 2024

The name uses GitHub webhook delivery id to remain unique.

let executionName = `${payload.repository.full_name.replace('/', '-')}-${getHeader(event, 'x-github-delivery')}`.slice(0, 64);

Do you maybe have the same webhook hooked up more than once somehow?

Ok then I assume it might be the apigw/lambda retry mechanism...
will check it.

from cdk-github-runners.

pharindoko avatar pharindoko commented on August 24, 2024

I assume that this really only happens when the apigw (my internal customlambdaaccess implementation) does a retry...

seems like this occurs for 1 % of the runner requests and it`s always the same correlation between
"the stepfunction alreadyexists message" <> a 502 bad gateway error <> a queued job waiting for a runner in github

from cdk-github-runners.

kichik avatar kichik commented on August 24, 2024

Can you tell what's the initial failure? It makes sense for the second webhook retry to fail with "the stepfunction alreadyexists message" and return 502. But why would the first one fail and require a retry? Is it a timeout maybe?

from cdk-github-runners.

pharindoko avatar pharindoko commented on August 24, 2024

you`re right should investigate in this :) will do ... timeout could be an option yes

from cdk-github-runners.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.