Git Product home page Git Product logo

Comments (10)

weiji14 avatar weiji14 commented on June 19, 2024 3

There's currently a movement towards replacing the 'master' branch name with a more neutral name like 'main' (see https://github.com/github/renaming), and this ties in with our code of conduct to be mindful of our language. Such changes are always disruptive, but I think it presents an opportunity for us to 'tidy' up our history here. This is one way it would work:

  1. Create a new branch called 'main' (or some other name) that would be a copy of the current 'master' branch without the big files.
  2. Rebase our 'development' branch (and other work in progress branches) on top of this new 'main' branch
  3. Once the migration is done, remove the 'master' branch (or we can keep it to preserve history).

Theoretically this would be a win-win, since we get the new 'main' branch name, and also prevent a force push from happening. I haven't tested it out yet though, but I think it is a good idea to consider 😄

from icepyx.

JessicaS11 avatar JessicaS11 commented on June 19, 2024

@scottyhq Any ideas, since you helped solve this issue in the hackweek repos?

from icepyx.

scottyhq avatar scottyhq commented on June 19, 2024

@JessicaS11 - yes I'd recommend scrubbing big files from the commit history, which would mean contributors need to re-sync with this upstream repository. It's a bit disruptive, but better than a bloated repository.

I see the following big files (using second answer here https://stackoverflow.com/questions/10622179/how-to-find-identify-large-commits-in-git-history)

84124d0a539d  3.9MiB doc/examples/ICESat-2_ATL08Parse_and_Visualize.ipynb
4ab8cf835487  8.0MiB dev-notebooks/ICESat-2Hackweek_tutorial_locations_draft.ipynb
c17285052462  8.6MiB examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea_hs.tif
01b7705584a9  8.9MiB dev-notebooks/ICESat-2Hackweek_tutorial_locations_draft.ipynb
19372def82df   26MiB examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea.tif
caf019c3c4de   42MiB icepyx/data/ATL06_20190101001723_00540210_209_01.h5
5b45ad2b7edb   46MiB icepyx/data/ATL06_20190101003047_00540212_209_01.h5
72cc69254ca8   76MiB icepyx/data/ATL06_20190101002504_00540211_209_01.h5

You'll have to follow this documentation to remove the files https://docs.github.com/en/github/authenticating-to-github/removing-sensitive-data-from-a-repository

A couple other suggestions:

  1. Note that you can add *tif and *h5 to .gitignore to avoid this happening in the future and remove output cells before saving jupyter notebooks.

  2. A versatile solution is to use a git "pre-commit hook" to check file sizes before committing (https://github.com/pre-commit/pre-commit-hooks#hooks-available).

  3. If people create pull requests with large files you'll see them in the list of files, and can therefore have them removed before merging. You could also consider adding a GitHub Action that ensures PR files are <X MB, which has to pass before merging. That way creators of a PR would see the issue in a more automated way.

from icepyx.

JessicaS11 avatar JessicaS11 commented on June 19, 2024

Thanks @scottyhq. These excellent resources are all open tabs for me (that's how I found the big files to begin with!) and I was just hoping for a less potentially disruptive solution. I'm guessing this means that any existing branches will need to be closed or rebased to prevent reintroduction, right? Is the best way to do this going to be fix the master branch directly locally, force push the changes to GH, and then force push the changes up through the other branches, resolving conflicts as we go? I like the idea of a pre-commit hook or GitHub Action to prevent future issues (and I thought that *h5 and *tif were in .gitignore, but they're not!).

from icepyx.

JessicaS11 avatar JessicaS11 commented on June 19, 2024

Great idea @weiji14. I'd been thinking about when/how to switch to the use of "main", so this seems like a great opportunity to make several big changes at once (and even make some of them easier). I've got a few other boxes to check and then hopefully I can get started on this!

from icepyx.

weiji14 avatar weiji14 commented on June 19, 2024

Been having trouble finding the *.h5 files in the git history, does anyone know when they were inserted? Specifically the commit date and/or hash.

  • icepyx/data/ATL06_20190101001723_00540210_209_01.h5
  • icepyx/data/ATL06_20190101003047_00540212_209_01.h5
  • icepyx/data/ATL06_20190101002504_00540211_209_01.h5

I can isolate the two Tandem X DEM *.tif files, but was wondering why the *.h5 files seem so hard to track down. Will need to find that commit (let's call it commit abcdef) so that we can branch off right before those big files were added (i.e. keep the git history intact before commit abcdef).

Just to be careful, I'll create an empty 'main' branch right at commit abcdef, and a mirror main-clean branch which is based on the current master branch (but without the large files). That way, I can open up a Pull Request from main-clean to main and everyone can double-check to make sure things look alright.

Edit: Nevermind, I've found the *.h5 files, they were sitting on a side branch so we don't have to remove them. Also created a draft PR to handle the 'master' to 'main' migration at #130 (ready for comments).

from icepyx.

JessicaS11 avatar JessicaS11 commented on June 19, 2024

@scottyhq I get a fatal error (Cannot open existing pack file 'objects/pack/*.idx') when I run the join one-liner from the SO post you linked, but I've been using the "blazingly fast shell one-liner" answer with success. I've modified the first line of it to run on specific branches (e.g.
git rev-list --objects --branches remotes/origin/master | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 | cut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

I'm having trouble understanding the output though. I've isolated the h5 files to the get-variables branch, so they (as expected) don't show up if I run the above command as written (i.e. on master). However, the tif files show up on the main branch, which @weiji14 created (see #130) from a commit prior to their addition. A couple questions are:

  1. If the command is only showing all files (including history) for a given branch, why do those large files show up on the main branch?
  2. Is our assumption that we can use migrating to main as a way to "rewrite" (via rebasing, essentially) the history by branching it from a commit prior to the addition of the large files and ensuring that the commits that included adding the large files are squash merged with a commit removing them (i.e. not included in the history), then ultimately deleting the master branch (and any other branches that have the files in their history) a valid approach? Or are we missing something in how git stores history for deleted branches.

from icepyx.

JessicaS11 avatar JessicaS11 commented on June 19, 2024

UPDATE: I've rebased the current master branch (so up to tag v.0.3.0) onto the main branch (currently sitting at commit deebe98 (origin/main) fix version, without the large .tif files (see #130). I updated .gitignore to include .tif and other large data files. Now, git diff origin/jsmain origin/master shows

diff --git a/.gitignore b/.gitignore
index ea72257..83468ea 100644
--- a/.gitignore
+++ b/.gitignore
@@ -109,11 +109,3 @@ venv.bak/
 
 # directory files
 .DS_Store
-
-# data files
-*.h5
-*.hdf
-*.hdf5
-*.csv
-*.zarr
-*.tif
\ No newline at end of file
diff --git a/examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea.tif b/examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea.tif
new file mode 100644
index 0000000..19372de
Binary files /dev/null and b/examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea.tif differ
diff --git a/examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea_hs.tif b/examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea_hs.tif
new file mode 100644
index 0000000..c172850
Binary files /dev/null and b/examples/supporting_files/TDM1_DEM_90m_colombia_DEM_masked_aea_hs.tif differ

as I'd expect. What I'd like to do is force push the jsmain branch to origin/master, essentially rewriting the history. Then I could rebase the development branch onto the updated master and force push those changes to origin/development. Then, any open PRs to the current development branch, as long as they are squash merged, shouldn't have any issues (excepting any files on which there are merge conflicts, which we'll have to resolve as we go).

When GitHub finishes developing tools to make it easy to transition from master to main, we'll be the first in line. I'd prefer to make that transition now (as previously discussed), but if the plan I've outlined to remove the large files from the git history will work, not manually making the switch now will save a lot of work as far as having to [re]set up branch rules, permissions, controls, etc. It will also make the transition more seamless for users, since GitHub plans to include warnings/instructions. @weiji14, how does that plan sound? @scottyhq @jhamman @fperez, am I creating a bigger problem than I'm solving with this approach?

from icepyx.

weiji14 avatar weiji14 commented on June 19, 2024

as I'd expect. What I'd like to do is force push the jsmain branch to origin/master, essentially rewriting the history.

After taking a quick look, it seems like you've done a pretty good job here (essentially rewriting only the history on 'master' after June 2020 but keeping the rest intact), but someone else should double check this. Since you've informed everyone already (via email or otherwise) that this is happening, I'm +1 for going ahead with the rewrite.

Then I could rebase the development branch onto the updated master and force push those changes to origin/development.

Yep, this is ok.

When GitHub finishes developing tools to make it easy to transition from master to main, we'll be the first in line. I'd prefer to make that transition now (as previously discussed), but if the plan I've outlined to remove the large files from the git history will work, not manually making the switch now will save a lot of work as far as having to [re]set up branch rules, permissions, controls, etc. It will also make the transition more seamless for users, since GitHub plans to include warnings/instructions.

I'm happy with this (separating the 'large *.tif file removal' task from the 'master -> main rename task), the other suggestion (combining the two) was really just a throwaway idea. Let's just get on with this as it's been taking up far too much time already 😶

from icepyx.

JessicaS11 avatar JessicaS11 commented on June 19, 2024

I've updated the master and development branches as indicated. For anyone with a current branch open (@annavalentine @bidhya @CharlieHewitt1 @wallinb @fspaolo @trevorskaggs @kelseybisson), so long as your PR to development is SQUASH MERGED, it should not reintroduce the large tif files to the history. However, your PR may show a lot of changes relative to the development branch due to the history rewriting (especially if you have merge conflicts), making it difficult to review (until you've dealt with the merge conflicts). To see what's actually changed and would be included in your PR commit, run git diff origin/<yourbranch> origin/development. Another strategy would be to rebase your branch to the development branch using git rebase -i <your branch name> development and then submit a PR. As always, please get in touch if you have any questions or issues!

from icepyx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.