Git Product home page Git Product logo

pre-commit-hooks's Issues

Add hooks to lint conda recipes and pyproject.toml files

Currently all of the non-dependency components of our package files are managed manually. This can lead to problems when some fields need to be specified in certain ways. It should be fairly trivial to implement a pre-commit hook that checks things like the license fields to ensure that they are appropriate. We could add as much complexity as we really need to handle various, although I'd probably err on the side of starting with only linting simple cases like common license typos and only adding more complex checks later if necessary.

[FEA] Add pre-commit hook to verify Markdown links

Is your feature request related to a problem? Please describe.

I recently setup a GitHub action in the Thrust repository to validate links in Markdown docs to ensure links don't get stale/broken.

NVIDIA/thrust#1640

It would be nice to do the same thing for libcudf (and other RAPIDS libraries).

Add copyright hook

The first hook that we would like to support is for checking copyright headers to ensure that they are up to date. The copyright script that we would like to adapt is already present in a few repos such as cudf. The purpose of this issue is to capture discussions that we have had recently regarding issues with this script, discuss potential workarounds, and determine a path forward.

The current script tends to work well for our style checks in CI, but it occasionally runs into issues on local runs where it incorrectly identifies modified files, resulting in updated copyrights for unmodified files. Here are some of the solutions suggested so far:

  • I proposed a modification to the logic that eschews "upstream branch" detection altogether in favor of relying on pre-commit to tell us which files were modified and then use git to query first and last modified times. This approach is far simpler and more robust, but @bdice pointed out that it has a potential weakness due to our usage of squash merges rewriting history and therefore having a different final year after a squash merge than before when a PR opened in one year is merged in the following year. Specifically, the problematic case is the following situation:
    • A PR is created in December, with files foo.cpp and foo.hpp modified and given the copyright year X
    • In January, foo.cpp is modified, resulting in an updated copyright year X + 1
    • The PR is squash-merged, with foo.cpp having the copyright year X + 1 but foo.hpp only having year X. However, due to the squash, the last modified time of foo.hpp is actually year X + 1 according to the git history.
    • All subsequent runs of the copyright hook with --all-files result in errors on foo.hpp
    • Note that the same issue is also present with the current copyright hook logic, but only in the much more limited scenario where a PR is created in December and then merged in January with no additional modifications to files. The primary difference is that with the current approach modifying any file in the PR is sufficient to update the copyright for all of the files in the diff with the main branch, whereas with the above proposal only the subset of actually modified files will be updated.
  • Another proposed alternative was modifying the RAPIDS /merge GH command to account for this discrepancy by performing a squash and copyright update prior to doing the squash merge on GH, but we rejected this solution as requiring too much engineering effort.

Given the above considerations, our current best path forward may be simply using the existing script and trying to identify its weak points. @ajschmidt8 and I considered the possibility of including further git logic into the copyright script to verify whether the relevant git target branches are sufficiently up-to-date. If we went this route, my inclination would be to throw errors rather than do any implicit git actions in the hook, but we may at least be able to provide more robust error modes in this way and avoid incorrect modifications.

verify-copyright should consider local and remote branches when determining target branch upstream commit

Description

When the verify-copy right hook is not provided a target branch via command-line flags or environment variables, and when it finds a branch-{YY}.{MM} one locally, it currently applies the following procedure to find the commit to compare to:

  • determine the upstream branch the local branch-{YY}.{MM} is tracking
  • target_commit = HEAD of that upstream branch

It should instead do this:

  • determine the upstream branch the local branch-{YY}.{MM} is tracking
  • determine the HEAD of that upstream branch
  • determine the HEAD of the local branch
  • target_commit = whichever of those commits is newest

Benefits of this work

  • reduces the risk of CI and local pre-commit disagreeing as a result of your fork being out of date with the upstream repo

Acceptance Criteria

  • when the target branch is not supplied via command line args or environment variables, and when the commit ID on the local copy of the target branch is different from the one on the upstream branch it's tracking, the newer of the 2 commits should be used as the reference to decide what files have changed

Approach

Modify the implementation of get_target_branch_upstream_commit() to accomplish this:

def get_target_branch_upstream_commit(repo, args):

Notes

The verify-copyright hook only modifies copyright headers for files that it considers to have "changed".
"changed" in that hooks definition means "the file content is different between HEAD wherever the hook is being run and the latest commit on the target branch".

"target branch" can be passed to the hook via environment variables or command-line flags, as described here:

The target branch is determined in the following order:

When none of those methods are used, the hook tries to determine the target branch and its latest commit by inspecting the git repo it's being run from and looking for branches that follow the RAPIDS conventions like branch-{YY}.{MM}.

This logic says "take the latest branch-{YY}.{MM} locally, and get the commit ID of the the latest commit on the upstream branch it's tracking".

if target_branch:
target_branch_upstream = target_branch.tracking_branch()
if target_branch_upstream:
return target_branch_upstream.commit

References

Created based on discussions about this: rapidsai/rmm#1564 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.