Git Product home page Git Product logo

Comments (27)

ltrzesniewski avatar ltrzesniewski commented on May 22, 2024 29

Here's a workaround to remove a given directory by path with BFG:

git rev-list --all --objects -- path/to/the/directory/to/delete | git cat-file --batch-check='%(objectname) %(objecttype) %(rest)' | grep -Pe '^\w+ blob' | cut -d' ' -f1 > ./to-delete.txt
java -jar bfg.jar --no-blob-protection --strip-blobs-with-ids ./to-delete.txt

The principle is simple: create a list of object IDs to strip, and input that to BFG. This means that if an object is referenced through a different path it will be nuked nonetheless.

  • git rev-list --all --objects -- path/to/the/directory/to/delete
    This will list all objects in the subdirectory referenced in all commits which modify the given path. The format is objectid filepath.

    You should run this command to check its output matches what you'd expect.

  • git cat-file --batch-check='%(objectname) %(objecttype) %(rest)'
    This will qualify the object with its type. It will turn the previous format objectid filepath into objectid type filepath.

  • grep -Pe '^\w+ blob'
    This will filter out non-blob objects.

  • cut -d' ' -f1 > ./to-delete.txt
    This will extract the object ID and redirect the output into the to-delete.txt file.

  • java -jar bfg.jar --no-blob-protection --strip-blobs-with-ids ./to-delete.txt
    This runs BFG, giving it the list of objects to remove.

Needless to say, it's much faster than git filter-branch 😄

from bfg-repo-cleaner.

gqy117 avatar gqy117 commented on May 22, 2024 5

Thank @ltrzesniewski for his awesome answer.
In my case, I need to delete 2 files with full path provided.
So I tweaked @ltrzesniewski 's answer:

git rev-list --all --objects | grep -P '^\w+ Path/to/your/file1.txt' | cut -d" " -f1 >> ../to-delete.txt
git rev-list --all --objects | grep -P '^\w+ Path/to/your/file2.txt' | cut -d" " -f1 >> ../to-delete.txt

java -jar bfg.jar --no-blob-protection --strip-blobs-with-ids ./to-delete.txt

from bfg-repo-cleaner.

kellyohair avatar kellyohair commented on May 22, 2024 1

First off, a great tool, and very much appreciated.

Our company went through a "split repos" stage under SVN where one very large repo (200,000 files?), let's call it repo "a",and it contained 5 top level directories: a1, a2, a3, a4, and a5, and it got turned into 5 separate SVN repos: a1, a2, a3, a4, and a5. I wasn't around when this happened, but apparently they must have copied "a" 5 times, then did SVN deletes to trim each one, then pulled the subdirs up to the top of each repo (e.g. for the "a1" repository: rm -f -r a2 a3 a4 a5 ; mv a1/* . ; rmdir a1).

So now the transition to 5 GIT repositories (and preserving at least the SVN source code change history, using git-svn) creates 5 rather bloated GIT repositories. So some kind of simple delete any a1 repo file with a prefix pattern of "a[2-5]/" in it's full path would be nice. For the most part, it's the top level deleted SVN directories, or a simple prefix on the full path.

If I delete all a1, a2, a3, a4, and a5 directories, that might work, I'll try it, but when you are dealing with old SVN repositories and 100's of engineers with no proper repository rules, who knows what will happen. :(

Of course the biggest bloat comes from jar and zip files people shoved into the SVN repos over the years, but BFG does a great job on that.

from bfg-repo-cleaner.

rtyley avatar rtyley commented on May 22, 2024

@Fryguy It would be possible to have a switch based on directory name rather than directory path, if that is useful? For instance:

--delete-dirs <glob> - delete directories with the specified names

Can you fill me in with some more context about your use case? Are you removing sensitive/private data, or just want to remove large files to reduce repo size?

from bfg-repo-cleaner.

Fryguy avatar Fryguy commented on May 22, 2024

@rtyley Given a structure below, and I only want to remove the root-level aaa directory, would it remove both since it's by name?

  • aaa/
  • bbb/
    • aaa/

I guess my use case falls into the "reduce repo size" category. My use case is I'm trying to split a massive repo with a long history into separate repos while keeping history. Most of the split is based on top-level directories. For example, a, b, and c will go to one repo; d to a second repo; and e and f to a third repo. git-filter-branch works ok for a single directory using --subdirectory-filter (when that directory doesn't have much activity in the history), but to do multiple directories I have to use --index-filter 'git rm -rf ', which takes forever. Considering I have to split it into about 5 repos, this approach will take forever.

I found bfg-repo-cleaner and ran it to clean up some large files and was amazed by the performance, hence why I would love to be able to use it.

from bfg-repo-cleaner.

rtyley avatar rtyley commented on May 22, 2024

Given a structure below, and I only want to remove the root-level aaa directory, would it remove both since it's by name?

The implementation I was thinking of in my head would, yes, which makes it a fairly blunt instrument. Part of the special-sauce that makes the BFG so fast is that it is path-independent, which makes me cautious about adding path-dependent features - unless they can be added without effecting performance. Let me have a think about it, I can see it's a valid feature.

I found bfg-repo-cleaner and ran it to clean up some large files and was amazed by the performance

That makes a good quote - would you mind if I added it to http://rtyley.github.com/bfg-repo-cleaner/#feedback, attributing it to you as 'Jason Frey, Software Engineer at Red Hat'?

from bfg-repo-cleaner.

Fryguy avatar Fryguy commented on May 22, 2024

That makes a good quote - would you mind if I added it to http://rtyley.github.com/bfg-repo-cleaner/#feedback, attributing it to you as 'Jason Frey, Software Engineer at Red Hat'?

Sure, go for it.

Let me have a think about it, I can see it's a valid feature.

Thanks! I was trying to go through the code to find where the file name checking was done, to see if I could help out with a patch or something, but I couldn't find it. I've also never done Scala before, so it's also a new thing to look at.

from bfg-repo-cleaner.

rtyley avatar rtyley commented on May 22, 2024

Sure, go for it.

Thanks - I've added your quote, much appreciated.

Thanks! I was trying to go through the code to find where the file name checking was done, to see if I could help out with a patch or something, but I couldn't find it.

These are the lines currently associated with stripping out out files that match a particular text-pattern....

https://github.com/rtyley/bfg-repo-cleaner/blob/v1.0.2/src/main/scala/com/madgag/git/bfg/cli/CLIConfig.scala#L118-L124

...but the code really is not setup to deal with paths, it's a non-trivial change to get it there :)

Out of interest, would you want/expect 'empty' commits (ie commits that appear to change nothing, because they relate to stuff from deleted directories) to be completely removed by the BFG, or for them to stay in place, with the commit message intact?

I've also never done Scala before, so it's also a new thing to look at.

If you've got several weeks to invest (!) and you'd like to learn Scala I can really recommend this free course:

https://www.coursera.org/course/progfun

A lot of us in the office took it a few months ago and it really brought us up to speed.

from bfg-repo-cleaner.

Fryguy avatar Fryguy commented on May 22, 2024

Out of interest, would you want/expect 'empty' commits (ie commits that appear to change nothing, because they relate to stuff from deleted directories) to be completely removed by the BFG, or for them to stay in place, with the commit message intact?

I would expect them to be removed for my use case, but others might(?) want to keep them. git-filter-branch has the --prune-empty option, which I was using.

from bfg-repo-cleaner.

Fryguy avatar Fryguy commented on May 22, 2024

I know very little about git internals, so I'm not sure this helps, but I know if I do git rev-list --all --objects, I get a list of every object with full path. With my repo of ~365,000 objects it takes only 24 seconds to run (with an unprimed file cache). This can then be easily grepped to get a list of object SHAs, which I assume can be run through the "delete a file from history" method.

Would that help?

from bfg-repo-cleaner.

yeago avatar yeago commented on May 22, 2024

"Fairly blunt instrument" to say the least. So if I someone committed a lib directory in the parent, I can't use this tool without removing all lib directories anywhere in the path?

from bfg-repo-cleaner.

rtyley avatar rtyley commented on May 22, 2024

So if I someone committed a lib directory in the parent, I can't use this tool without removing all lib directories anywhere in the path?

Yep, that's true of the --delete-folder [foo] option. Folders named [foo] are removed from anywhere within history (apart from latest commit, which is 'protected'), and yep, this is a blunt instrument.

Often though, it's reasonable to ask exactly what you're trying to achieve by removing the lib folder. I would guess you're just trying to make your repo smaller. In which case, you can get pretty close to that aim by just using --strip-blobs-bigger-than 10M (or whatever size is appropriate to your repository).

from bfg-repo-cleaner.

yeago avatar yeago commented on May 22, 2024

i can't use strip blobs now because i work at a company and i'm not sure about some of the files yet. i am sure about the directory in question.

honestly, in terms of design i don't see how a generic delete directory by name could be very useful. it certainly isn't in my case :P it seems like a loaded gun waiting to destroy current directories... but you said not the latest commit. does that mean currently existing directories called 'lib' won't be touched?

from bfg-repo-cleaner.

rtyley avatar rtyley commented on May 22, 2024

but you said not the latest commit. does that mean currently existing directories called 'lib' won't be touched?

Sure- here's some further documentation:

http://rtyley.github.io/bfg-repo-cleaner/#protected-commits

from bfg-repo-cleaner.

lfilho avatar lfilho commented on May 22, 2024

+1 for specifying an absolute path for removal.

My case is that unfortunately the team has commited a lot of libs in the repo (since SVN times...) and, until we refactor the project and put them into Maven or something external like that, we gotta keep the libs in the repo. But, we did identify several libs that were useless by now, which could be removed and save already a good space in the repo...

So suppose I have:

/libs/certain-lib/2.1/certain-lib.jar
/libs/certain-lib/3.4/certain-lib.jar

In this example, we're still using version 3.4. So we couldn't delete "certain-lib" folder. And also we couldn't delete "certain-lib.jar" otherwise the 3.4 version would also go.

As for the performance, I think it's fine: for this full path case, you could just warn the users (manual, readme, before running the command...). I wouldn't mind at all.

Nonetheless, congrats and thanks a lot for this great tool!!!!!

from bfg-repo-cleaner.

rtyley avatar rtyley commented on May 22, 2024

In this example, we're still using version 3.4. So we couldn't delete "certain-lib" folder. And also we couldn't delete "certain-lib.jar" otherwise the 3.4 version would also go.

@lfilho in your example, you should be fine to do:

$ bfg --delete-files *.jar

...this will delete all jars that are not in your latest commit - because, by default, the BFG protects the contents of your latest commit. So /libs/certain-lib/2.1/certain-lib.jar will be deleted from your repo (because it's not present in your latest commit) - but /libs/certain-lib/3.4/certain-lib.jar won't be deleted (because it is present in your latest commit).

This command is short and sweet, and should definitely be used unless there's a good reason not to. Although I appreciate that for some use-cases path-dependent action is necessary, for the large majority of cases, it's not. For some of the cases where path-dependent action is necessary, there may actually already be a decent alternative tool (perhaps git-subtree, which is decently performant) that can perform the task.

I'm always *very happy to hear explanations of why users do need path-dependent action, and if people explain the need here on this issue, that'll help my prioritise this feature. So far, of the two people who've discussed their requirements yet, FryGuy had a legitimate use-case, whereas yeago, I believe, would have been served perfectly well by the BFG's protected-commit behaviour.

As for the performance, I think it's fine: for this full path case, you could just warn the users (manual, readme, before running the command...). I wouldn't mind at all.

The cost of implementing path-dependent action:

  • Possibly Performance
  • Definitely a big chunk of dev time - almost certainly my time, unpaid, when I could be creating something more useful to more people.
  • Definitely complexity in the implementation of the BFG :-) The BFG implementation is relatively simple because it does not care about the path, the implementation of this feature would not be trivial.

Given the cost of implementing the feature, vs the benefit it provides to a limited percentage of users, what would you do!? Personally, I would like to try implement it, but that will have to be in a world where I have considerably more time.

Nonetheless, congrats and thanks a lot for this great tool!!!!!

Thank you, I appreciate your thanks :-)

from bfg-repo-cleaner.

xanderdunn avatar xanderdunn commented on May 22, 2024

I would also really love to see bfg support removing of specific subdirectories. This would make it useful in my situation.

from bfg-repo-cleaner.

javabrett avatar javabrett commented on May 22, 2024

+1 for this feature. Sometimes it is prudent to prune entire paths from history. I imagine that this is a fairly common need. This can be achieved now, but at the moment it requires a lot of pre-BFG scripting to generate a list of objects that are in those delete-target trees but not in HEADs, then feed that to delete-by-objectId using -bi. It works, but it's pretty cumbersome.

Part of the special-sauce that makes the BFG so fast is that it is path-independent, which makes me cautious about adding path-dependent features - unless they can be added without effecting performance. Let me have a think about it, I can see it's a valid feature.

...

...but the code really is not setup to deal with paths, it's a non-trivial change to get it there :)

@rtyley Did you ever draw any performance-conclusions on this? Is it just a coding-exercise without fundamental overwhelming space/time concerns?

It looks like this wants scala-git Tree to be able to maintain maps on both the blob short/relative filename, and the full path, which would require generating the two maps and the storage required for that, but doesn't otherwise seem like a big burden.

from bfg-repo-cleaner.

bschindler avatar bschindler commented on May 22, 2024

Here's a workaround to remove a given directory by path with BFG:

You have to be careful with this approach. If a file has been copied from another location, this approach will also delete it in the other location as git uses the same hash for different locations. It has happened to me on trial runs a number of times.

The best way to deal with this is to delete the directory in git first, commit and then run your script but without blob-protection. This seemed to have worked for me.

from bfg-repo-cleaner.

ltrzesniewski avatar ltrzesniewski commented on May 22, 2024

@bschindler yes that's dangerous, that's what I said in bold in my comment.
I made a PR which enables a safe method, see #166 - unfortunately the maintainer doesn't seem to care about PRs.

from bfg-repo-cleaner.

rtyley avatar rtyley commented on May 22, 2024

I do, I just get through them real slowly

On 23 Aug 2016 12:00 p.m., "Lucas Trzesniewski" [email protected]
wrote:

@bschindler https://github.com/bschindler yes that's dangerous, that's
what I said in bold in my comment.
I made a PR which enables a safe method, see #166
#166 - unfortunately the
maintainer doesn't seem to care about PRs.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#12 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADLRkxFsBtEve1AzvPHCtc_J-v_R9pNks5qitLAgaJpZM4Ae5o5
.

from bfg-repo-cleaner.

ltrzesniewski avatar ltrzesniewski commented on May 22, 2024

@rtyley oh sorry I just noticed you started working on the project again recently

from bfg-repo-cleaner.

 avatar commented on May 22, 2024

I erroneously included a folder containing release builds in my git repo. Can I use BFG to undo this, removing that folder and its contents from my git history? It's just been sitting there taking up space for no reason.

from bfg-repo-cleaner.

Fryguy avatar Fryguy commented on May 22, 2024

@TharosTheDragon If the directory name is consistent throughout history, and doesn't conflict by name with other directories in the tree, then you could use --delete-dirs <glob> - delete directories with the specified names. Note that command is based on name, not path, so if you have multiple directories with the same name, even at different depths, they will both be removed.

from bfg-repo-cleaner.

 avatar commented on May 22, 2024

What's the difference between --delete-dirs and --delete-folders?

from bfg-repo-cleaner.

Fryguy avatar Fryguy commented on May 22, 2024

I'm sorry...I copied the wrong string, not paying attention...should be --delete-folders <glob>

from bfg-repo-cleaner.

 avatar commented on May 22, 2024

from bfg-repo-cleaner.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.