Comments (27)
Here's a workaround to remove a given directory by path with BFG:
git rev-list --all --objects -- path/to/the/directory/to/delete | git cat-file --batch-check='%(objectname) %(objecttype) %(rest)' | grep -Pe '^\w+ blob' | cut -d' ' -f1 > ./to-delete.txt
java -jar bfg.jar --no-blob-protection --strip-blobs-with-ids ./to-delete.txt
The principle is simple: create a list of object IDs to strip, and input that to BFG. This means that if an object is referenced through a different path it will be nuked nonetheless.
-
git rev-list --all --objects -- path/to/the/directory/to/delete
This will list all objects in the subdirectory referenced in all commits which modify the given path. The format isobjectid filepath
.You should run this command to check its output matches what you'd expect.
-
git cat-file --batch-check='%(objectname) %(objecttype) %(rest)'
This will qualify the object with its type. It will turn the previous formatobjectid filepath
intoobjectid type filepath
. -
grep -Pe '^\w+ blob'
This will filter out non-blob objects. -
cut -d' ' -f1 > ./to-delete.txt
This will extract the object ID and redirect the output into theto-delete.txt
file. -
java -jar bfg.jar --no-blob-protection --strip-blobs-with-ids ./to-delete.txt
This runs BFG, giving it the list of objects to remove.
Needless to say, it's much faster than git filter-branch 😄
from bfg-repo-cleaner.
Thank @ltrzesniewski for his awesome answer.
In my case, I need to delete 2 files with full path provided.
So I tweaked @ltrzesniewski 's answer:
git rev-list --all --objects | grep -P '^\w+ Path/to/your/file1.txt' | cut -d" " -f1 >> ../to-delete.txt
git rev-list --all --objects | grep -P '^\w+ Path/to/your/file2.txt' | cut -d" " -f1 >> ../to-delete.txt
java -jar bfg.jar --no-blob-protection --strip-blobs-with-ids ./to-delete.txt
from bfg-repo-cleaner.
First off, a great tool, and very much appreciated.
Our company went through a "split repos" stage under SVN where one very large repo (200,000 files?), let's call it repo "a",and it contained 5 top level directories: a1, a2, a3, a4, and a5, and it got turned into 5 separate SVN repos: a1, a2, a3, a4, and a5. I wasn't around when this happened, but apparently they must have copied "a" 5 times, then did SVN deletes to trim each one, then pulled the subdirs up to the top of each repo (e.g. for the "a1" repository: rm -f -r a2 a3 a4 a5 ; mv a1/* . ; rmdir a1).
So now the transition to 5 GIT repositories (and preserving at least the SVN source code change history, using git-svn) creates 5 rather bloated GIT repositories. So some kind of simple delete any a1 repo file with a prefix pattern of "a[2-5]/" in it's full path would be nice. For the most part, it's the top level deleted SVN directories, or a simple prefix on the full path.
If I delete all a1, a2, a3, a4, and a5 directories, that might work, I'll try it, but when you are dealing with old SVN repositories and 100's of engineers with no proper repository rules, who knows what will happen. :(
Of course the biggest bloat comes from jar and zip files people shoved into the SVN repos over the years, but BFG does a great job on that.
from bfg-repo-cleaner.
@Fryguy It would be possible to have a switch based on directory name rather than directory path, if that is useful? For instance:
--delete-dirs <glob> - delete directories with the specified names
Can you fill me in with some more context about your use case? Are you removing sensitive/private data, or just want to remove large files to reduce repo size?
from bfg-repo-cleaner.
@rtyley Given a structure below, and I only want to remove the root-level aaa directory, would it remove both since it's by name?
- aaa/
- bbb/
- aaa/
I guess my use case falls into the "reduce repo size" category. My use case is I'm trying to split a massive repo with a long history into separate repos while keeping history. Most of the split is based on top-level directories. For example, a, b, and c will go to one repo; d to a second repo; and e and f to a third repo. git-filter-branch works ok for a single directory using --subdirectory-filter (when that directory doesn't have much activity in the history), but to do multiple directories I have to use --index-filter 'git rm -rf ', which takes forever. Considering I have to split it into about 5 repos, this approach will take forever.
I found bfg-repo-cleaner and ran it to clean up some large files and was amazed by the performance, hence why I would love to be able to use it.
from bfg-repo-cleaner.
Given a structure below, and I only want to remove the root-level aaa directory, would it remove both since it's by name?
The implementation I was thinking of in my head would, yes, which makes it a fairly blunt instrument. Part of the special-sauce that makes the BFG so fast is that it is path-independent, which makes me cautious about adding path-dependent features - unless they can be added without effecting performance. Let me have a think about it, I can see it's a valid feature.
I found bfg-repo-cleaner and ran it to clean up some large files and was amazed by the performance
That makes a good quote - would you mind if I added it to http://rtyley.github.com/bfg-repo-cleaner/#feedback, attributing it to you as 'Jason Frey, Software Engineer at Red Hat'?
from bfg-repo-cleaner.
That makes a good quote - would you mind if I added it to http://rtyley.github.com/bfg-repo-cleaner/#feedback, attributing it to you as 'Jason Frey, Software Engineer at Red Hat'?
Sure, go for it.
Let me have a think about it, I can see it's a valid feature.
Thanks! I was trying to go through the code to find where the file name checking was done, to see if I could help out with a patch or something, but I couldn't find it. I've also never done Scala before, so it's also a new thing to look at.
from bfg-repo-cleaner.
Sure, go for it.
Thanks - I've added your quote, much appreciated.
Thanks! I was trying to go through the code to find where the file name checking was done, to see if I could help out with a patch or something, but I couldn't find it.
These are the lines currently associated with stripping out out files that match a particular text-pattern....
...but the code really is not setup to deal with paths, it's a non-trivial change to get it there :)
Out of interest, would you want/expect 'empty' commits (ie commits that appear to change nothing, because they relate to stuff from deleted directories) to be completely removed by the BFG, or for them to stay in place, with the commit message intact?
I've also never done Scala before, so it's also a new thing to look at.
If you've got several weeks to invest (!) and you'd like to learn Scala I can really recommend this free course:
https://www.coursera.org/course/progfun
A lot of us in the office took it a few months ago and it really brought us up to speed.
from bfg-repo-cleaner.
Out of interest, would you want/expect 'empty' commits (ie commits that appear to change nothing, because they relate to stuff from deleted directories) to be completely removed by the BFG, or for them to stay in place, with the commit message intact?
I would expect them to be removed for my use case, but others might(?) want to keep them. git-filter-branch has the --prune-empty option, which I was using.
from bfg-repo-cleaner.
I know very little about git internals, so I'm not sure this helps, but I know if I do git rev-list --all --objects
, I get a list of every object with full path. With my repo of ~365,000 objects it takes only 24 seconds to run (with an unprimed file cache). This can then be easily grepped to get a list of object SHAs, which I assume can be run through the "delete a file from history" method.
Would that help?
from bfg-repo-cleaner.
"Fairly blunt instrument" to say the least. So if I someone committed a lib directory in the parent, I can't use this tool without removing all lib directories anywhere in the path?
from bfg-repo-cleaner.
So if I someone committed a lib directory in the parent, I can't use this tool without removing all lib directories anywhere in the path?
Yep, that's true of the --delete-folder [foo]
option. Folders named [foo]
are removed from anywhere within history (apart from latest commit, which is 'protected'), and yep, this is a blunt instrument.
Often though, it's reasonable to ask exactly what you're trying to achieve by removing the lib
folder. I would guess you're just trying to make your repo smaller. In which case, you can get pretty close to that aim by just using --strip-blobs-bigger-than 10M
(or whatever size is appropriate to your repository).
from bfg-repo-cleaner.
i can't use strip blobs now because i work at a company and i'm not sure about some of the files yet. i am sure about the directory in question.
honestly, in terms of design i don't see how a generic delete directory by name could be very useful. it certainly isn't in my case :P it seems like a loaded gun waiting to destroy current directories... but you said not the latest commit. does that mean currently existing directories called 'lib' won't be touched?
from bfg-repo-cleaner.
but you said not the latest commit. does that mean currently existing directories called 'lib' won't be touched?
Sure- here's some further documentation:
http://rtyley.github.io/bfg-repo-cleaner/#protected-commits
from bfg-repo-cleaner.
+1 for specifying an absolute path for removal.
My case is that unfortunately the team has commited a lot of libs in the repo (since SVN times...) and, until we refactor the project and put them into Maven or something external like that, we gotta keep the libs in the repo. But, we did identify several libs that were useless by now, which could be removed and save already a good space in the repo...
So suppose I have:
/libs/certain-lib/2.1/certain-lib.jar
/libs/certain-lib/3.4/certain-lib.jar
In this example, we're still using version 3.4. So we couldn't delete "certain-lib" folder. And also we couldn't delete "certain-lib.jar" otherwise the 3.4 version would also go.
As for the performance, I think it's fine: for this full path case, you could just warn the users (manual, readme, before running the command...). I wouldn't mind at all.
Nonetheless, congrats and thanks a lot for this great tool!!!!!
from bfg-repo-cleaner.
In this example, we're still using version 3.4. So we couldn't delete "certain-lib" folder. And also we couldn't delete "certain-lib.jar" otherwise the 3.4 version would also go.
@lfilho in your example, you should be fine to do:
$ bfg --delete-files *.jar
...this will delete all jars that are not in your latest commit - because, by default, the BFG protects the contents of your latest commit. So /libs/certain-lib/2.1/certain-lib.jar
will be deleted from your repo (because it's not present in your latest commit) - but /libs/certain-lib/3.4/certain-lib.jar
won't be deleted (because it is present in your latest commit).
This command is short and sweet, and should definitely be used unless there's a good reason not to. Although I appreciate that for some use-cases path-dependent action is necessary, for the large majority of cases, it's not. For some of the cases where path-dependent action is necessary, there may actually already be a decent alternative tool (perhaps git-subtree, which is decently performant) that can perform the task.
I'm always *very happy to hear explanations of why users do need path-dependent action, and if people explain the need here on this issue, that'll help my prioritise this feature. So far, of the two people who've discussed their requirements yet, FryGuy had a legitimate use-case, whereas yeago, I believe, would have been served perfectly well by the BFG's protected-commit behaviour.
As for the performance, I think it's fine: for this full path case, you could just warn the users (manual, readme, before running the command...). I wouldn't mind at all.
The cost of implementing path-dependent action:
- Possibly Performance
- Definitely a big chunk of dev time - almost certainly my time, unpaid, when I could be creating something more useful to more people.
- Definitely complexity in the implementation of the BFG :-) The BFG implementation is relatively simple because it does not care about the path, the implementation of this feature would not be trivial.
Given the cost of implementing the feature, vs the benefit it provides to a limited percentage of users, what would you do!? Personally, I would like to try implement it, but that will have to be in a world where I have considerably more time.
Nonetheless, congrats and thanks a lot for this great tool!!!!!
Thank you, I appreciate your thanks :-)
from bfg-repo-cleaner.
I would also really love to see bfg support removing of specific subdirectories. This would make it useful in my situation.
from bfg-repo-cleaner.
+1 for this feature. Sometimes it is prudent to prune entire paths from history. I imagine that this is a fairly common need. This can be achieved now, but at the moment it requires a lot of pre-BFG scripting to generate a list of objects that are in those delete-target trees but not in HEADs, then feed that to delete-by-objectId using -bi
. It works, but it's pretty cumbersome.
Part of the special-sauce that makes the BFG so fast is that it is path-independent, which makes me cautious about adding path-dependent features - unless they can be added without effecting performance. Let me have a think about it, I can see it's a valid feature.
...
...but the code really is not setup to deal with paths, it's a non-trivial change to get it there :)
@rtyley Did you ever draw any performance-conclusions on this? Is it just a coding-exercise without fundamental overwhelming space/time concerns?
It looks like this wants scala-git
Tree
to be able to maintain maps on both the blob short/relative filename, and the full path, which would require generating the two maps and the storage required for that, but doesn't otherwise seem like a big burden.
from bfg-repo-cleaner.
Here's a workaround to remove a given directory by path with BFG:
You have to be careful with this approach. If a file has been copied from another location, this approach will also delete it in the other location as git uses the same hash for different locations. It has happened to me on trial runs a number of times.
The best way to deal with this is to delete the directory in git first, commit and then run your script but without blob-protection. This seemed to have worked for me.
from bfg-repo-cleaner.
@bschindler yes that's dangerous, that's what I said in bold in my comment.
I made a PR which enables a safe method, see #166 - unfortunately the maintainer doesn't seem to care about PRs.
from bfg-repo-cleaner.
I do, I just get through them real slowly
On 23 Aug 2016 12:00 p.m., "Lucas Trzesniewski" [email protected]
wrote:
@bschindler https://github.com/bschindler yes that's dangerous, that's
what I said in bold in my comment.
I made a PR which enables a safe method, see #166
#166 - unfortunately the
maintainer doesn't seem to care about PRs.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#12 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADLRkxFsBtEve1AzvPHCtc_J-v_R9pNks5qitLAgaJpZM4Ae5o5
.
from bfg-repo-cleaner.
@rtyley oh sorry I just noticed you started working on the project again recently
from bfg-repo-cleaner.
I erroneously included a folder containing release builds in my git repo. Can I use BFG to undo this, removing that folder and its contents from my git history? It's just been sitting there taking up space for no reason.
from bfg-repo-cleaner.
@TharosTheDragon If the directory name is consistent throughout history, and doesn't conflict by name with other directories in the tree, then you could use --delete-dirs <glob> - delete directories with the specified names
. Note that command is based on name, not path, so if you have multiple directories with the same name, even at different depths, they will both be removed.
from bfg-repo-cleaner.
What's the difference between --delete-dirs and --delete-folders?
from bfg-repo-cleaner.
I'm sorry...I copied the wrong string, not paying attention...should be --delete-folders <glob>
from bfg-repo-cleaner.
from bfg-repo-cleaner.
Related Issues (20)
- Can't build from source HOT 2
- Password protected: https://repository.sonatype.org/
- Disable pruning of what has been processed before
- Verifying commits HOT 3
- Cleanup only takes place after second run HOT 1
- How fix "Cleaning commits: 92% (1199/1303)java.lang.reflect.InvocationTargetException"? HOT 1
- Just thanks for helping god for you
- Including link: https://github.com/jarhot1992/Remote-ADB/issues
- Including link: https://rtyley.github.io/
- Including link: https://rtyley.github.io/
- Including link: https://rtyley.github.io/
- Including link: https://rtyley.github.io/
- Including link: https://rtyley.github.io/
- fft2d.tgz (53KB) updated: 2006/12/28
- How to build the tool? HOT 1
- Can I tag a commit as "protected"?
- Delete files with no extension
- Can not push - remote: GitLab: You cannot create a branch with an invalid name
- Unable to replace text in a large file HOT 3
- example banned.txt HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bfg-repo-cleaner.