Git Product home page Git Product logo

Comments (11)

PatrickXYS avatar PatrickXYS commented on July 19, 2024 1

Sorry for the late response given the limited bandwidth on my side.

A few things to bring up here:

The timeline of deprecating optional-test-infra:

  1. By May 23rd, I'll start to file PR to remove presubmit request to optional-test-infra for existing kubeflow repos.
  2. By June 6th, I'll remove resources from AWS account and send a deprecation report to the community.

https://github.com/orgs/kubeflow/teams/wg-automl-leads
https://github.com/orgs/kubeflow/teams/wg-manifests-leads
https://github.com/orgs/kubeflow/teams/wg-notebooks-leads
https://github.com/orgs/kubeflow/teams/wg-training-leads
@yuzisun

Kubeflow WG folks, let's start off finding proper alternatives and migrate to those solutions that comply with the timeline.


To answer the question from @kimwnasptd :

  1. How can someone new get access to the AWS infra, for example, see things via the AWS console?
    A: This is the hardest problem to solve here given current AWS account is a personal account unless people can obtain the trust of the account owner, AWS, and Kubeflow community, otherwise I don't think it's possible.

  2. Do we have some relevant documentation on the moving parts of this infra? For example:

  • What is the webhook flow, triggered by GitHub?
    A: Yes there are some webhooks configured in some kubeflow repos, this is set up by previous Google folks. There's no public documentation given there's no well-defined privacy rule set up.
    What's the entrypoint code that is run for each Prow Job, which are triggered by PRs?
    A: Those technical questions could vary person by person, I'd recommend reading https://github.com/kubeflow/testing/tree/master/aws.
  1. Which of these moving parts need periodic care? For example:
  • What's the effort for maintaining the Prow cluster?
    A: Resources, time-to-time failure check, and monitoring.
    How can we update the entrypoint code/image that Prow runs for every PR?
    A: Those technical questions could vary person by person, I'd recommend reading https://github.com/kubeflow/testing/tree/master/aws.

I think the main thing here is: that the account is a personal account, we don't have well-defined privacy rules set up, and it's difficult to transfer ownership to other community folks. Thinking about finding alternatives might be a way easier thing to do.

from testing.

terrytangyuan avatar terrytangyuan commented on July 19, 2024

That’s unfortunate to hear but best of luck on your new journey.

@kubeflow/wg-training-leads @kubeflow/wg-automl-leads Should we consider switching to GitHub Actions with K3d or minikube since our tests do not require special hardware?

@kubeflow/project-steering-group Any plans of additional resources from Google on this?

from testing.

surajkota avatar surajkota commented on July 19, 2024

Hi all,

Instead of depreciating the infrastructure, can we decouple the funding of account from design and implementation of the testing infrastructure running in this account? Creating a new infra might be a big effort

Funding of account:
I am from AWS and want to clarify that the AWS program for funding the account has not stopped. We did look at the account earlier this year and there were enough credits at that time for the CI to run. Please refer to this comment for the process of renewing the credits: kubeflow/manifests#2099 (comment).

Design and maintenance of testing infra
Thank you @PatrickXYS for owning the initial infrastructure design and implementation. In the long term, we need more than one person to maintain it. IMO the questions to ask are - are there folks interested in owning and driving the efforts required to maintaining this infrastructure? What is the effort required to maintain this? What docs/tutorials do we need for new contributors?

from testing.

thesuperzapper avatar thesuperzapper commented on July 19, 2024

@surajkota @terrytangyuan @PatrickXYS I really like the idea of using GitHub actions for most parts of Kubeflow!

This will make our tests portable (not tightly integrated with a specific sponsor's infrastructure).

Considerations:

  • github-hosted runners:
    • have no time-limits for open-source projects (the 2000 minute limit is only for private repos)
    • only have 2-CPU, 7GB-RAM, 14GB-SSD
  • self-hosted runners:
    • can be provided by a sponsor like AWS, with access to larger resources, and GPUs
  • action runners are quite different from the current optional-testing-infra approach:
    • the workflows run on isolated VMs (not Kubernetes clusters), however, we can dynamically spawn kind / k3d clusters in those VMs (if the test requires it)

The next question is how can an "infra sponsor" (like AWS) provide the self-hosted runners to the Kubeflow org:

  • One possible option is using something like philips-labs/terraform-aws-github-runner to dynamically scale AWS Spot instances for the runners
  • Alternatively, there are many other projects which help people manage their runners, see awesome-runners
  • Finally, if necessary to meet our security requirements, we could create a new project that focuses on securely managing self-hosted runners for "public" projects (we should probably collaborate with GitHub directly on this, as a good solution is needed for other open-source projects)

It's important to highlight that self-hosted runners have security considerations for the infra sponsor, but these can be mitigated if the self-hosted runers are set up in an isolated/ephemeral way (and if workflows require "approval" to run from untrusted people).

from testing.

surajkota avatar surajkota commented on July 19, 2024

Are you suggesting each working group own their own infra and setup?

If not, we are again missing the first step, i.e. finding new maintainers for testing infra.

On side note, I see no major issues on this repository which seem to suggest it is working pretty well. So before we jump to redesign, let's find out if we can have new maintainers, probably a wg-testing?

Regarding portability, isn't the current infra using prow? If yes, I think it is very much portable as well.

from testing.

thesuperzapper avatar thesuperzapper commented on July 19, 2024

Are you suggesting each working group own their own infra and setup?

If not, we are again missing the first step, i.e. finding new maintainers for testing infra.

@surajkota You are correct I was still suggesting that someone (like AWS) provides the self-hosted GitHub actions runners, which the rest of the WGs can then use on their kubeflow-org GitHub repos.

On side note, I see no major issues on this repository which seem to suggest it is working pretty well. So before we jump to redesign, let's find out if we can have new maintainers, probably a wg-testing?

I agree that it's not "broken", but GitHub actions is a very nice developer experience, so I think it's worth a look.
But I agree, if we can find new maintainers for the shared infrastructure, this will give us more time to consider our options. I assume at least a few of these new maintainers will have to work at AWS (because AWS is currently providing the physical infrastructure).

To ensure we don't end up with a single person risk again, we could form a wg-testing, with a similar mandate the Kubernetes sig-testing.

Regarding portability, isn't the current infra using prow? If yes, I think it is very much portable as well.

Yes, we are using prow in most repos (but some WGs are already migrating some things to GitHub actions).

from testing.

surajkota avatar surajkota commented on July 19, 2024

@PatrickXYS can you clarify what you mean by optional-test-infra may stop working around June 2022? Specifically what will stop working and how can this be stopped?

from testing.

PatrickXYS avatar PatrickXYS commented on July 19, 2024

As I said in the issue description, there are two aspects of NOT-WORKING:

  1. The account doesn't have enough resources/credits to continue working
  2. I don't have enough bandwidth in maintaining the infra

Kubeflow community / AWS could invest more credits to existing test-infra, but I may not be able to continue maintaining it, so I'd prefer not to go with this option.

The option that I prefer is the community should avoid establishing a horizontal team to maintain a centralized test-infra. Instead, allowing WGs to choose their own solutions should be more scalable and maintainable.

Created sub-issues in all repos which consume optional-test-infra as of now.

from testing.

kimwnasptd avatar kimwnasptd commented on July 19, 2024

@PatrickXYS thank you very much for all your efforts on this infra, it has really served us great throughout these years! This is also evident from the number of opened issues despite the fact that is being heavily used by at least kubeflow/{kubeflow,katib,training-operator}.

I really agree with @surajkota's approach on this situation. Let's try to understand first what are the commitment requirements for maintaining this infra, as well as which parts of the infra will need periodic care/maintenance. This way we can all better evaluate if it makes sense for us as a community to stick with this infra or start investing time in other solutions.

@PatrickXYS it's completely understandable that you don't have cycles on this anymore, and if there's no commitment from the rest of the community on helping maintain it then indeed let's deprecate it. But, again, please help us understand the maintenance burden first.

More specifically these are the first questions that come to mind:

  1. How can someone new get access to the AWS infra, for example see things via the AWS console?
  2. Do we have seme relevant documentation on the moving parts of this infra? For example:
    1. What is the webhook flow, triggered by GitHub?
    2. What's the entrypoint code that is run for each Prow Job, which are triggered by PRs?
  3. Which of these moving parts need periodic care? For example:
    1. What's the effort for maintaining the Prow cluster?
    2. How can we update the entrypoint code/image that Prow runs for every PR?

These are just some initial questions that come to mind, but I think can get us far enough for now.

from testing.

yuzisun avatar yuzisun commented on July 19, 2024

@PatrickXYS I request holding off the deprecation as we have not reached a decision for the migration plan yet. Also as previously discussed with @surajkota aws is willing to keep sponsoring the account, I am not sure if there is anything changed.

from testing.

PatrickXYS avatar PatrickXYS commented on July 19, 2024

@yuzisun and other Kubeflow WG folks, I posted the deprecation notice on March 4th, trying to provide as much buffer time as possible to the community. Such that Kubeflow WGs can find their preferred alternatives for presubmit E2E testing. Also, I tagged all the WG and created sub-issues in corresponding repositories.

I'm not sure what's the main reason holding the community not finding other options for two months, and what's the current progress.

The AWS account is running out of credits, if we don't deprecate by the end of this month, it will charge my personal banking account (set up as backup) for thousands of dollars per month.

Please take any action to migrate to preferred alternatives ASAP.

from testing.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.