Git Product home page Git Product logo

git-filter-repo's Introduction

git filter-repo is a versatile tool for rewriting history, which includes capabilities I have not found anywhere else. It roughly falls into the same space of tool as git filter-branch but without the capitulation-inducing poor performance, with far more capabilities, and with a design that scales usability-wise beyond trivial rewriting cases. git filter-repo is now recommended by the git project instead of git filter-branch.

While most users will probably just use filter-repo as a simple command line tool (and likely only use a few of its flags), at its core filter-repo contains a library for creating history rewriting tools. As such, users with specialized needs can leverage it to quickly create entirely new history rewriting tools.

Table of Contents

Prerequisites

filter-repo requires:

  • git >= 2.22.0 at a minimum; some features require git >= 2.24.0 or later
  • python3 >= 3.5

How do I install it?

git-filter-repo is a single-file python script, which was done to make installation for basic use on many systems trivial: just place that file into your $PATH.

See INSTALL.md for things beyond basic usage or special cases. The more involved instructions are only needed if one of the following apply:

  • you do not find the above comment about trivial installation intuitively obvious
  • you are working with a python3 executable named something other than "python3"
  • you want to install documentation (beyond the builtin docs shown with -h)
  • you want to run some of the contrib examples
  • you want to create your own python filtering scripts using filter-repo as a module/library

How do I use it?

For comprehensive documentation:

  • see the user manual
  • alternative formating of the user manual is available on various external sites (example), for those that don't like the htmlpreview.github.io layout, though it may only be up-to-date as of the latest release

If you prefer learning from examples:

Why filter-repo instead of other alternatives?

This was covered in more detail in a Git Rev News article on filter-repo, but some highlights for the main competitors:

filter-branch

BFG Repo Cleaner

  • great tool for its time, but while it makes some things simple, it is limited to a few kinds of rewrites.

  • its architecture is not amenable to handling more types of rewrites.

  • its architecture presents some shortcomings and bugs even for its intended usecase.

  • fans of bfg may be interested in bfg-ish, a reimplementation of bfg based on filter-repo which includes several new features and bugfixes relative to bfg.

  • a cheat sheet is available showing how to convert example commands from the manual of BFG Repo Cleaner into filter-repo commands.

Simple example, with comparisons

Let's say that we want to extract a piece of a repository, with the intent on merging just that piece into some other bigger repo. For extraction, we want to:

  • extract the history of a single directory, src/. This means that only paths under src/ remain in the repo, and any commits that only touched paths outside this directory will be removed.
  • rename all files to have a new leading directory, my-module/ (e.g. so that src/foo.c becomes my-module/src/foo.c)
  • rename any tags in the extracted repository to have a 'my-module-' prefix (to avoid any conflicts when we later merge this repo into something else)

Solving this with filter-repo

Doing this with filter-repo is as simple as the following command:

  git filter-repo --path src/ --to-subdirectory-filter my-module --tag-rename '':'my-module-'

(the single quotes are unnecessary, but make it clearer to a human that we are replacing the empty string as a prefix with my-module-)

Solving this with BFG Repo Cleaner

BFG Repo Cleaner is not capable of this kind of rewrite; in fact, all three types of wanted changes are outside of its capabilities.

Solving this with filter-branch

filter-branch comes with a pile of caveats (more on that below) even once you figure out the necessary invocation(s):

  git filter-branch \
      --tree-filter 'mkdir -p my-module && \
                     git ls-files \
                         | grep -v ^src/ \
                         | xargs git rm -f -q && \
                     ls -d * \
                         | grep -v my-module \
                         | xargs -I files mv files my-module/' \
          --tag-name-filter 'echo "my-module-$(cat)"' \
	  --prune-empty -- --all
  git clone file://$(pwd) newcopy
  cd newcopy
  git for-each-ref --format="delete %(refname)" refs/tags/ \
      | grep -v refs/tags/my-module- \
      | git update-ref --stdin
  git gc --prune=now

Some might notice that the above filter-branch invocation will be really slow due to using --tree-filter; you could alternatively use the --index-filter option of filter-branch, changing the above commands to:

  git filter-branch \
      --index-filter 'git ls-files \
                          | grep -v ^src/ \
                          | xargs git rm -q --cached;
                      git ls-files -s \
                          | sed "s%$(printf \\t)%&my-module/%" \
                          | git update-index --index-info;
                      git ls-files \
                          | grep -v ^my-module/ \
                          | xargs git rm -q --cached' \
      --tag-name-filter 'echo "my-module-$(cat)"' \
      --prune-empty -- --all
  git clone file://$(pwd) newcopy
  cd newcopy
  git for-each-ref --format="delete %(refname)" refs/tags/ \
      | grep -v refs/tags/my-module- \
      | git update-ref --stdin
  git gc --prune=now

However, for either filter-branch command there are a pile of caveats. First, some may be wondering why I list five commands here for filter-branch. Despite the use of --all and --tag-name-filter, and filter-branch's manpage claiming that a clone is enough to get rid of old objects, the extra steps to delete the other tags and do another gc are still required to clean out the old objects and avoid mixing new and old history before pushing somewhere. Other caveats:

  • Commit messages are not rewritten; so if some of your commit messages refer to prior commits by (abbreviated) sha1, after the rewrite those messages will now refer to commits that are no longer part of the history. It would be better to rewrite those (abbreviated) sha1 references to refer to the new commit ids.
  • The --prune-empty flag sometimes misses commits that should be pruned, and it will also prune commits that started empty rather than just ended empty due to filtering. For repositories that intentionally use empty commits for versioning and publishing related purposes, this can be detrimental.
  • The commands above are OS-specific. GNU vs. BSD issues for sed, xargs, and other commands often trip up users; I think I failed to get most folks to use --index-filter since the only example in the filter-branch manpage that both uses it and shows how to move everything into a subdirectory is linux-specific, and it is not obvious to the reader that it has a portability issue since it silently misbehaves rather than failing loudly.
  • The --index-filter version of the filter-branch command may be two to three times faster than the --tree-filter version, but both filter-branch commands are going to be multiple orders of magnitude slower than filter-repo.
  • Both commands assume all filenames are composed entirely of ascii characters (even special ascii characters such as tabs or double quotes will wreak havoc and likely result in missing files or misnamed files)

Solving this with fast-export/fast-import

One can kind of hack this together with something like:

  git fast-export --no-data --reencode=yes --mark-tags --fake-missing-tagger \
      --signed-tags=strip --tag-of-filtered-object=rewrite --all \
      | grep -vP '^M [0-9]+ [0-9a-f]+ (?!src/)' \
      | grep -vP '^D (?!src/)' \
      | perl -pe 's%^(M [0-9]+ [0-9a-f]+ )(.*)$%\1my-module/\2%' \
      | perl -pe 's%^(D )(.*)$%\1my-module/\2%' \
      | perl -pe s%refs/tags/%refs/tags/my-module-% \
      | git -c core.ignorecase=false fast-import --date-format=raw-permissive \
            --force --quiet
  git for-each-ref --format="delete %(refname)" refs/tags/ \
      | grep -v refs/tags/my-module- \
      | git update-ref --stdin
  git reset --hard
  git reflog expire --expire=now --all
  git gc --prune=now

But this comes with some nasty caveats and limitations:

  • The various greps and regex replacements operate on the entire fast-export stream and thus might accidentally corrupt unintended portions of it, such as commit messages. If you needed to edit file contents and thus dropped the --no-data flag, it could also end up corrupting file contents.
  • This command assumes all filenames in the repository are composed entirely of ascii characters, and also exclude special characters such as tabs or double quotes. If such a special filename exists within the old src/ directory, it will be pruned even though it was intended to be kept. (In slightly different repository rewrites, this type of editing also risks corrupting filenames with special characters by adding extra double quotes near the end of the filename and in some leading directory name.)
  • This command will leave behind huge numbers of useless empty commits, and has no realistic way of pruning them. (And if you tried to combine this technique with another tool to prune the empty commits, then you now have no way to distinguish between commits which were made empty by the filtering that you want to remove, and commits which were empty before the filtering process and which you thus may want to keep.)
  • Commit messages which reference other commits by hash will now reference old commits that no longer exist. Attempting to edit the commit messages to update them is extraordinarily difficult to add to this kind of direct rewrite.

Design rationale behind filter-repo

None of the existing repository filtering tools did what I wanted; they all came up short for my needs. No tool provided any of the first eight traits below I wanted, and no tool provided more than two of the last four traits either:

  1. [Starting report] Provide user an analysis of their repo to help them get started on what to prune or rename, instead of expecting them to guess or find other tools to figure it out. (Triggered, e.g. by running the first time with a special flag, such as --analyze.)

  2. [Keep vs. remove] Instead of just providing a way for users to easily remove selected paths, also provide flags for users to only keep certain paths. Sure, users could workaround this by specifying to remove all paths other than the ones they want to keep, but the need to specify all paths that ever existed in any version of the repository could sometimes be quite painful. For filter-branch, using pipelines like git ls-files | grep -v ... | xargs -r git rm might be a reasonable workaround but can get unwieldy and isn't as straightforward for users; plus those commands are often operating-system specific (can you spot the GNUism in the snippet I provided?).

  3. [Renaming] It should be easy to rename paths. For example, in addition to allowing one to treat some subdirectory as the root of the repository, also provide options for users to make the root of the repository just become a subdirectory. And more generally allow files and directories to be easily renamed. Provide sanity checks if renaming causes multiple files to exist at the same path. (And add special handling so that if a commit merely copied oldname->newname without modification, then filtering oldname->newname doesn't trigger the sanity check and die on that commit.)

  4. [More intelligent safety] Writing copies of the original refs to a special namespace within the repo does not provide a user-friendly recovery mechanism. Many would struggle to recover using that. Almost everyone I've ever seen do a repository filtering operation has done so with a fresh clone, because wiping out the clone in case of error is a vastly easier recovery mechanism. Strongly encourage that workflow by detecting and bailing if we're not in a fresh clone, unless the user overrides with --force.

  5. [Auto shrink] Automatically remove old cruft and repack the repository for the user after filtering (unless overridden); this simplifies things for the user, helps avoid mixing old and new history together, and avoids problems where the multi-step process for shrinking the repo documented in the manpage doesn't actually work in some cases. (I'm looking at you, filter-branch.)

  6. [Clean separation] Avoid confusing users (and prevent accidental re-pushing of old stuff) due to mixing old repo and rewritten repo together. (This is particularly a problem with filter-branch when using the --tag-name-filter option, and sometimes also an issue when only filtering a subset of branches.)

  7. [Versatility] Provide the user the ability to extend the tool or even write new tools that leverage existing capabilities, and provide this extensibility in a way that (a) avoids the need to fork separate processes (which would destroy performance), (b) avoids making the user specify OS-dependent shell commands (which would prevent users from sharing commands with each other), (c) takes advantage of rich data structures (because hashes, dicts, lists, and arrays are prohibitively difficult in shell) and (d) provides reasonable string manipulation capabilities (which are sorely lacking in shell).

  8. [Old commit references] Provide a way for users to use old commit IDs with the new repository (in particular via mapping from old to new hashes with refs/replace/ references).

  9. [Commit message consistency] If commit messages refer to other commits by ID (e.g. "this reverts commit 01234567890abcdef", "In commit 0013deadbeef9a..."), those commit messages should be rewritten to refer to the new commit IDs.

  10. [Become-empty pruning] Commits which become empty due to filtering should be pruned. If the parent of a commit is pruned, the first non-pruned ancestor needs to become the new parent. If no non-pruned ancestor exists and the commit was not a merge, then it becomes a new root commit. If no non-pruned ancestor exists and the commit was a merge, then the merge will have one less parent (and thus make it likely to become a non-merge commit which would itself be pruned if it had no file changes of its own). One special thing to note here is that we prune commits which become empty, NOT commits which start empty. Some projects intentionally create empty commits for versioning or publishing reasons, and these should not be removed. (As a special case, commits which started empty but whose parent was pruned away will also be considered to have "become empty".)

  11. [Become-degenerate pruning] Pruning of commits which become empty can potentially cause topology changes, and there are lots of special cases. Normally, merge commits are not removed since they are needed to preserve the graph topology, but the pruning of parents and other ancestors can ultimately result in the loss of one or more parents. A simple case was already noted above: if a merge commit loses enough parents to become a non-merge commit and it has no file changes, then it too can be pruned. Merge commits can also have a topology that becomes degenerate: it could end up with the merge_base serving as both parents (if all intervening commits from the original repo were pruned), or it could end up with one parent which is an ancestor of its other parent. In such cases, if the merge has no file changes of its own, then the merge commit can also be pruned. However, much as we do with empty pruning we do not prune merge commits that started degenerate (which indicates it may have been intentional, such as with --no-ff merges) but only merge commits that become degenerate and have no file changes of their own.

  12. [Speed] Filtering should be reasonably fast

How do I contribute?

See the contributing guidelines.

Is there a Code of Conduct?

Participants in the filter-repo community are expected to adhere to the same standards as for the git project, so the git Code of Conduct applies.

Upstream Improvements

Work on filter-repo and its predecessor has also driven numerous improvements to fast-export and fast-import (and occasionally other commands) in core git, based on things filter-repo needs to do its work:

git-filter-repo's People

Contributors

amake avatar autumn-traveller avatar benblo avatar bulldy80 avatar codym48 avatar cryptomilk avatar dscho avatar fmigneault avatar gwymor avatar ikke avatar jamesramsay avatar jgfouca avatar julian avatar katef avatar lassik avatar malmaud avatar marcows avatar matthisk avatar mo-gul avatar mwilck avatar newren avatar qwerbzuio avatar ragingcactus avatar rhaschke avatar siriobalmelli avatar slietzau avatar stefanor avatar tarsius avatar tmzullinger avatar xorangekiller avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

git-filter-repo's Issues

subdirectory-filter corrupting repository with nested directory named like its parent

Hello, I am trying to split repository into subprojects and got an unexpected behaviour.
Roughly my repo has directory structure like this:

code/
  src/
  (some other files and directoriess)
tests/
  tests/
    src/
  test-runner/

When I do git --subdirectory-filter tests I expected to end up with:

tests/
  src/
test-runner/

Instead, I got:

tests/
  src/
test-runner/
src/

Project root got all the files and directories from the original tests/ and tests/tests/ directories mixed instead of only the ones from tests/.

If you need a more formal reproduction steps or an example repo, I can provide those after the weekend.

Ability to run code formatters

Hi, thanks for this tool. Everything is pointing me here, but I can't seem to find a way to run things like black with this tool. With git filter-branch or filter-lamely, I can use the --tree-filter to run something like:

FILTER_BRANCH_SQUELCH_WARNING=1 time git filter-branch --tree-filter '\
    git show $GIT_COMMIT --name-status\
        | grep "^[AM]" \
        | grep "\.py$" \
        | cut -f2 \
        | xargs black \
          || echo “Error formatting, possibly invalid python“ \
' -- --all

Given every command is directing me to use stock git-filter-repo, what's the right way to do this? It'd be nice to only run the command on files matching a pattern (ex: *.py), but still keep unmatched files.

Removing subtrees?

Can this tool be used to remove subtrees? I want to remove their files and all related commits, so that it's as if I had never added subtrees to my repo.

When I created my subtrees, I did not use the --squash option.

My commit history right now looks like this (newest to oldest):

  • Commits to this git repo (outside the subtrees)
  • Group of subtree merge commits (created by git subtree add)
  • Initial commit to this git repo
  • Many earlier commits which came from the subtreed repositories

When I try to run git-filter-repo --invert-paths --path path/to/subtree --path path/to/othersubtree ..., the files are removed but not the commits. I've experimented with using the --replace-refs, --prune-empty and --prune-degenerate flags, but nothing has worked so far. On the plus side, each run is much faster than when I was trying to use git filter-branch for this!

Any ideas? Am I just missing something simple?

"fatal: cannot lock ref" when issuing filter-repo command

Hi,

I'm trying to move one subdirectory out of my git repo like but I keep getting a "fatal cannot lock"

chambers:~/tmp/baker (master) $ git filter-repo --path baker/assets
fatal: cannot lock ref 'refs/heads/remote_branch': Unable to create '/Users/chambers/tmp/baker/.git/refs/heads/remote_branch.lock': File exists.

I checked the dir and verified that "remote_branch.lock" is not present.

I am running [email protected] & python 3.7.4 on Mac OSX Mojave.

TypeError: CreateProcess() argument 8 must be str or None, not bytes

On Windows I am seeing the following error:

git filter-repo --mailmap dan-mailmap.txt

Traceback (most recent call last):

  File "git-filter-repo", line 3437, in <module>

  File "git-filter-repo", line 3335, in run

  File "git-filter-repo", line 2904, in _run_sanity_checks

  File "git-filter-repo", line 3154, in results_tmp_dir

  File "git-filter-repo", line 1870, in determine_git_dir

  File "subprocess.py", line 395, in check_output

  File "subprocess.py", line 472, in run

  File "subprocess.py", line 775, in __init__

  File "subprocess.py", line 1178, in _execute_child

TypeError: CreateProcess() argument 8 must be str or None, not bytes

[66484] Failed to execute script git-filter-repo

This is using the latest version of the repo. I've altered the script to force it to use the subprocess wrapper and even forced it not to use the wrapper and changed all "cwd=" statements to use decode and still I am seeing an error. I am using Python 3.7.

error after moving subdirectory to root

I'm trying to pull a subdirectory of a repo to a separate repo with all the files directly in the repo root.
So...

% git-filter-repo --path modules/openapi-generator/src/main/resources/python --path-rename modules/openapi-generator/src/main/resources/python:.`
Parsed 13471 commits
New history written in 8.65 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
error: invalid path './README.mustache'
fatal: Could not reset index file to revision 'HEAD'.

Looks like it's a problem when moving a contents of a subdirectory to repo-root as it works with

% git-filter-repo --path modules/openapi-generator/src/main/resources/python --path-rename modules/openapi-generator/src/main/resources/python:root
Parsed 13471 commits
New history written in 6.59 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
Updating files: 100% (24796/24796), done.
HEAD is now at 96473f9305 [Python] Remove mutable default argument (#4665)
Enumerating objects: 434, done.
Counting objects: 100% (434/434), done.
Delta compression using up to 4 threads
Compressing objects: 100% (172/172), done.
Writing objects: 100% (434/434), done.
Total 434 (delta 209), reused 392 (delta 182)
Completely finished after 9.71 seconds.

lint-history example breaks with --filenames-important with a "delete file" commit

Thanks for this project!
I'm trying to use it to rewrite a repo to only use unix lineendings.
However, I'm running into an issue with lint-history together with --filenames-important (note: it does works when omitting --filenames-important).

Here is a minimal repo with which to reproduce this:

rm -rf /tmp/repo_with_delete
mkdir /tmp/repo_with_delete
cd /tmp/repo_with_delete

git init .
echo "A" > a.cpp
git add a.cpp
git commit -am "1"
git rm a.cpp
git commit -am "2"

export PYTHONPATH=/w/src/git-conversion/3rdparty/git-filter-repo/
/w/src/git-conversion/3rdparty/git-filter-repo/contrib/filter-repo-demos/lint-history --filenames-important dos2unix

I get:

dos2unix: converting file /tmp/tmp8vmqz50x/a.cpp to Unix format...
Traceback (most recent call last):
  File "/w/src/git-conversion/3rdparty/git-filter-repo/contrib/filter-repo-demos/lint-history", line 107, in <module>
    filter.run()
  File "/w/src/git-conversion/3rdparty/git-filter-repo/git_filter_repo.py", line 3813, in run
    self._parser.run(self._input, self._output)
  File "/w/src/git-conversion/3rdparty/git-filter-repo/git_filter_repo.py", line 1396, in run
    self._parse_commit()
  File "/w/src/git-conversion/3rdparty/git-filter-repo/git_filter_repo.py", line 1249, in _parse_commit
    self._commit_callback(commit, aux_info)
  File "/w/src/git-conversion/3rdparty/git-filter-repo/git_filter_repo.py", line 3363, in _tweak_commit
    self._commit_callback(commit, self.callback_metadata(aux_info))
  File "/w/src/git-conversion/3rdparty/git-filter-repo/contrib/filter-repo-demos/lint-history", line 64, in lint_with_real_filenames
    cat_file_process.stdin.write(change.blob_id + b'\n')
TypeError: unsupported operand type(s) for +: 'NoneType' and 'bytes'
fatal: stream ends early
fast-import: dumping crash report to .git/fast_import_crash_22223

I'm on commit 4ea19c0 of git-filter-repo.

filter-repo removes my remotes

I am splitting a subtree, then I want to use filter-repo to rename and delete some files in the new branch. Here is my script:

$ git remote -v
origin	https://github.com/MirrorNG/MirrorNG (fetch)
origin	https://github.com/MirrorNG/MirrorNG (push)

$ git subtree split --prefix=Assets/Mirror -b $BRANCH
1/3257 (0) [0]
2/3257 (1) [0]
...
d262bbf6fd2262d3a551aa3e7cf72ea41d0f8b53
Created branch 'upm_test'

$ git remote -v
origin	https://github.com/MirrorNG/MirrorNG (fetch)
origin	https://github.com/MirrorNG/MirrorNG (push)

$git gc
Computing commit graph generation numbers:   0% (1/4981)
Computing commit graph generation numbers:   1% (50/4981)
...
Computing commit graph generation numbers: 100% (4981/4981), done.

$ git remote -v
origin	https://github.com/MirrorNG/MirrorNG (fetch)
origin	https://github.com/MirrorNG/MirrorNG (push)

# remove the Tests folder
$ git filter-repo --force --invert-paths --path Tests --refs $BRANCH
Parsed 51 commits
Parsed 265 commits
Parsed 794 commits
Parsed 1251 commits
Parsed 1597 commits
Parsed 1724 commitsHEAD is now at 9f326ba9 What happened to origin?
New history written in 0.66 seconds...
Completely finished after 0.91 seconds.

$ git remote -v
origin	https://github.com/MirrorNG/MirrorNG (fetch)
origin	https://github.com/MirrorNG/MirrorNG (push)


$ git filter-repo --force --path-rename "Examples:Samples~"
Parsed 533 commits
Parsed 1242 commits
Parsed 1906 commits
Parsed 2489 commits
Parsed 3088 commits
Parsed 3567 commits
Parsed 4145 commits
Parsed 4546 commits
Parsed 4924 commitsHEAD is now at d680d139 What happened to origin?

New history written in 0.91 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
Completely finished after 2.68 seconds.

# origin is not there anymore,  why is this messing with my remotes?
$ git remote -v

After the command git filter-repo --force --path-rename "Examples:Samples~" there are no remotes anymore.

Feature Request / Question: Can git-filter-repo merge commits into a repo rather than replace?

I want to use git-filter-repo to do all the awesome stuff it does from a source repository, but then I want the output to merge with the target repository, not overwrite and replace.

I'm trying to create a mono-repo out of a bunch of separate repositories. Some of them merge in fine with other tools, but some only seem to work with git-filter-repo. Plus, git-filter-repo is so much faster and easier to use!

However, I can't use git-filter-repo, at least not alone, as it always wipes the target repo entirely out.

Thank you!

filter-repo --path doesn't preserve history of content in path that earlier had been out of path (multi-path rename or copy detection)

STEPS TO REPRODUCE:

$ git clone https://fuchsia.googlesource.com/fuchsia
$ cd fuchsia
$ git --no-pager log --pretty=oneline --follow ee838c60b2ff5b957898323cf27f0bd92fa7c93d -- src/ledger/bin/app/app.cc|wc
$ git filter-repo --path=src/ledger/
$ git --no-pager log --pretty=oneline --follow 93946749a1 -- src/ledger/bin/app/app.cc|wc

(Here 93946749a1 comes from HEAD is now at 93946749a1 [ledger] Migrate ptr.h to ledger/lib in the output of git filter-repo.)

EXPECTED RESULTS:
The first number in the output of the two git log commands will be the same, indicating that all of the history of src/ledger/bin/app/app.cc that was available before filtering the repo remains available.

OBSERVED RESULTS:
The numbers are 144 and 25, indicating that a great deal of the history of src/ledger/bin/app/app.cc was not preserved. Additionally, browsing the new log for the file indicates that history for the file only goes back as far as when the file was moved into src/ledger/.

Am I holding it wrong? I would have thought from "extracting wanted paths and their history (stripping everything else)" that the full history of the content at the given path at HEAD would be retained, even if the content at the given path at HEAD had not always been located at the given path.

Rewrite only non-master branches

Hi,

I'm busy with a SVN to GIT migration of a large repo. My current trunk branch on the SVN repo is located in /trunk/src/ and my branches are in e.g. /branches/branch-name/.

In the new git repo, I would like to keep the ./src directory and have all the code in there. After the git svn migration (using standard tools) the master branch's files end up correctly in ./src, but the branches' files end up in ./, so I would like to rewrite their history to move them into ./src.

I started by playing with git filter-branch, but, as you know, that is incredibly slow.

I tried this tool (thanks!), but it seems these operations are done on a repo level, not specific to branches.

Am I able to rewrite only non-master branches' history?

--commit-callback - delete file fails

Hi
i getting following error when trying to delete file file using the commit-callback

git filter-repo --force --commit-callback "if not commit.parents:commit.file_changes.append(FileChange(b'D', b'$(git hash-object -w '.tfignore')' ,b'.tfignore', b'100644'))"

Traceback (most recent call last):
  File "C:/Program Files/Git/mingw64/libexec/git-core\git-filter-repo", line 3840, in <module>
    filter.run()
  File "C:/Program Files/Git/mingw64/libexec/git-core\git-filter-repo", line 3777, in run
    self._parser.run(self._input, self._output)
  File "C:/Program Files/Git/mingw64/libexec/git-core\git-filter-repo", line 1396, in run
    self._parse_commit()
  File "C:/Program Files/Git/mingw64/libexec/git-core\git-filter-repo", line 1249, in _parse_commit
    self._commit_callback(commit, aux_info)
  File "C:/Program Files/Git/mingw64/libexec/git-core\git-filter-repo", line 3328, in _tweak_commit
    self._commit_callback(commit, self.callback_metadata(aux_info))
  File "<string>", line 2, in callback
  File "C:/Program Files/Git/mingw64/libexec/git-core\git-filter-repo", line 583, in __init__
    assert id_ is None and mode is None
AssertionError
fatal: stream ends early

--path-rename with spaces

Hi,

I can't get the --path-rename command to function if there is a space in the path..

Works:
git filter-repo --path new.folder/ --path-rename new.folder/:'' --force --dry-run

Fails:
git filter-repo --path 'new folder/' --path-rename 'new folder/':'' --force --dry-run

This simply throws the exception..

Traceback (most recent call last):
  File "C:\Python38\Scripts\\git-filter-repo", line 3871, in <module>  
    args = FilteringOptions.parse_args(sys.argv[1:])
  File "C:\Python38\Scripts\\git-filter-repo", line 2141, in parse_args
    args = parser.parse_args(input_args)
  File "C:\Python38\lib\argparse.py", line 1768, in parse_args
    args, argv = self.parse_known_args(args, namespace)
  File "C:\Python38\lib\argparse.py", line 1800, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
  File "C:\Python38\lib\argparse.py", line 2006, in _parse_known_args
    start_index = consume_optional(start_index)
  File "C:\Python38\lib\argparse.py", line 1946, in consume_optional
    take_action(action, args, option_string)
  File "C:\Python38\lib\argparse.py", line 1874, in take_action
    action(self, namespace, argument_values, option_string)
  File "C:\Python38\Scripts\\git-filter-repo", line 1611, in __call__
    if values[0] and values[1] and not (
IndexError: list index out of range

Is this a python version issue? I have tried all kinds of combinations, double quotes, escape chars, etc.

incorrect results when filtering non-master branch

git clone https://github.com/google/gvisor -b go
cd gvisor
ls -la pkg/ilist

total 20K
drwxr-x---  2 tamird primarygroup 4.0K Oct  3 15:07 .
drwxr-x--- 38 tamird primarygroup 4.0K Oct  3 15:07 ..
-rwxr-x---  1 tamird primarygroup  819 Oct  3 15:07 ilist_state_autogen.go
-rwxr-x---  1 tamird primarygroup 4.5K Oct  3 15:07 interface_list.go

~/src/git-filter-repo/git-filter-repo --path pkg/ilist
ls -la pkg/ilist

total 40K
drwxr-x--- 2 tamird primarygroup 4.0K Oct  3 15:06 .
drwxr-x--- 3 tamird primarygroup 4.0K Oct  3 15:06 ..
-rw-r----- 1 tamird primarygroup 1.2K Oct  3 15:06 BUILD
-rwxr-x--- 1 tamird primarygroup  819 Oct  3 15:06 ilist_state_autogen.go
-rwxr-x--- 1 tamird primarygroup 4.5K Oct  3 15:06 interface_list.go
-rw-r----- 1 tamird primarygroup 5.1K Oct  3 15:06 list.go
-rw-r----- 1 tamird primarygroup 4.4K Oct  3 15:06 list_test.go

Somehow, git-filter-repo brought those files back from the dead. Note that those files exist on the master branch.

formatting repo with git-filter-repo

Good day,

First of all - thanks for the tool 🙇‍♂️. Recently git started to print a deprecation warning for git filter-branch, so I got here, while googling how to reformat my repo without a big format commit 🔗. What I want to do is to rewrite history of the project applying formatting (new) to each commit.

After I read through the docs, I understand, that I can apply formatting by using --blob-callback. And that I can do some conditioning, based on the filenames, using --filename-callback or --commit-callback. But what I really want to do - apply formatting only for certain files, for example, if it has extension .java.

My question:
Given that in blob the filename is not present and that in commit - the blobs are not present, how can I change file content during the history rewrite based on the file name and other conditions?

Regards,

Rewrite a branch with merges to delete a file

This is more a question about git-filter-repo capabilities.

I recently found myself having to clean a branch to remove a file. This branch was never merged into master, but master was merged in it several times after the faulty commit.

I have been looking for a way to rewrite this branch history, replaying the same merges, but without rewriting master history. I think this could be summarized  as rewriting the "right hand side of the branch history" (or left hand side?).

It turns out that it was more a task for git rebase --rebase-merges. However, I couldn’t find an way to ask git to limit itself to the "right hand side", so I had to play with the interactive mode, writing the operations by myself.

I think this could be automated, and I wanted to know in I missed something in your tool.

o_o_o_o_o_o_o_o_o_o_o_o_o_o_o_o_o_o_o_o_o_o_master
       \     \           \                 \                               
        o_o_o_M_o_x_o_o_o_M_o_o_o_o_o_o_o_o_M_feature
                  ^
                  faulty commit

Thanks a lot

PS: I am not sure if it relates to #39. But I don’t think so.

File renaming caused colliding pathnames

Hi,

I tried this with git-filter-repo v2.25:

$ git clone https://github.com/dlang/dmd.git
$ cd dmd
$ git-filter-repo --subdirectory-filter test
Parsed 22266 commitsFile renaming caused colliding pathnames!
  Commit: b'4473071207c78d691f7ed2ec4713172663f3bea8'
  Filename: b'README.md'
fatal: stream ends early
fast-import: dumping crash report to .git/fast_import_crash_1734

I've then tried to work around it; removing the trailing slash passes a --dry-run (no errors during commits parsing), but then yields

$ git-filter-repo --path test/ --path-rename test/:
same as above
$ git-filter-repo --path test --path-rename test:
Parsed 333 commitsfatal: Empty path component found in input
fast-import: dumping crash report to .git/fast_import_crash_2000
Traceback (most recent call last):
...
BrokenPipeError: [Errno 32] Broken pipe

A similar git filter-branch cmdline I'm trying to replace works:

git filter-branch --tag-name-filter cat --prune-empty --subdirectory-filter test -- --all

using git filter-repo to migrate out of LFS while maintaining history

I have a somewhat large repo (20 GB, 20k commits, from a video game project), that was using LFS for big assets.
I'd like to reuse the engine part for another project, while removing the game-specific part and not carry that history (especially the assets). I also want to ditch LFS (which proved unsatisfying for various reasons).

I figure using filter-repo to remove game-specific directories will be trivial, and very few lfs objects should be left over after that, so perhaps I could also use filter-repo to reinject those objects as non-lfs ones? Any pointers on how to go about it?

I know lfs has a "migrate export" command but from what I understand it will only inject the latest version; I'd prefer to preserve the whole history, so I could if necessary rollback to any point in time and have the actual files instead of dummy pointers.

I understand filter-repo is still somewhat fresh, as I can't find that much info about it (btw I apologize if this is not the proper place to ask for help).
BTW, there are quite a few open issues in git-lfs along the lines of "OMG I want out, how do I untangle myself from this thing now??!", and there doesn't seem to be a clear consensus 😄 ! (That might also be because most people aren't ready to rewrite history, but I am.)

Upload to PyPI

Hi. Would you consider uploading this to PyPI?

Thanks.

Wrong commit branch refs from git fast export?

Hello, I'm currently working on creating a separated repo from a subfolder in my main repo,
but I got some unexpected 'commit.branch' references while using your script.

Let's say I have a simple repo with these commits on the master branch:

 A--B--C--D--E 
        \
        TAG_XXX

So, if I run git-filter-repo on the repo with this callback:

git_filter_repo.py --force --commit-callback '
  print("[%s] %s" % (str(commit.branch), str(commit.message)))
'

I would expect an output like:
[b'refs/heads/master'] b'rev1: Commit A\n'
[b'refs/heads/master'] b'rev2: Commit B\n'
[b'refs/tags/TAG_XXX'] b'rev3: Commit C\n'
[b'refs/heads/master'] b'rev4: Commit D\n'
[b'refs/heads/master'] b'rev5: Commit E\n'

But I get this output:
[b'refs/tags/TAG_XXX'] b'rev1: Commit A\n'
[b'refs/tags/TAG_XXX'] b'rev2: Commit B\n'
[b'refs/tags/TAG_XXX'] b'rev3: Commit C\n'
[b'refs/heads/master'] b'rev4: Commit D\n'
[b'refs/heads/master'] b'rev5: Commit E\n'

Basically, it looks like all 'commit.branch' references before any tag are set on the tag reference itself (see commits A,B).

If I disable the tag creation, I get the refs I expect:
[b'refs/heads/master'] b'rev1: Commit A\n'
[b'refs/heads/master'] b'rev2: Commit B\n'
[b'refs/heads/master'] b'rev3: Commit C\n'
[b'refs/heads/master'] b'rev4: Commit D\n'
[b'refs/heads/master'] b'rev5: Commit E\n'

I would expect the commits A,B to mantain the same ref regardless the tag is created or not.

I also checked the output of 'git fast export', and it looks like the references I get are copied directly from its 'commit' rows

So I would know if these references are ok and I'm just missing something, or what?

Steps to reproduce the problem:

  • Unpack the attached zip file
  • To rebuild the sample repo with or without tags, just edit the ENABLE_TAGS line in 'build-sample-repo' script and then run it
  • To print commit.branch and commit.message, just run 'print-commit-branch' script
  • To get git fast-export output, just run 'get-filter-output' script
    (git fast-export output is already provided in 'output_with_tags.txt' and 'output_no_tags.txt')

test_refs.zip

Regards,
Andrea

Fresh clone not recognized as such

While trying to prepare test case for #37 I got myself into another problem:
After cloning https://github.com/zyzyzyryxy/git-filter-repo-not-clean I cannot run filter-repo without --force:

> git filter-repo --subdirectory-filter tests
Aborting: Refusing to overwrite repo history since this does not
look like a fresh clone.
  (expected freshly packed repo)
To override, use --force.

git version 2.24.0.windows.2

OS Name: Microsoft Windows 10 Pro
OS Version: 10.0.18362 N/A Build 18362

Multiple --path and --paths-from-file not doing the same

Hi

It seems to me that running git filter-repo with multiple --path and --path-from-file does not do the same thing.

I generated a list of --path, with full paths to the files I wanted to keep. When running, the resulting repo was completely empty, no files at all.

When I instead piped the files mentioned as --path to a file, and used this file along with --path-from-file it worked as expected.

By the way, thank you for this awesome tool! After a day of struggling with (and waiting for) git filter-branch, this did the job in just seconds.

No output, error, or history rewriting in Windows

I understand other users have gotten git-filter-repo to work on Windows. I've proven unable to, however.

I've modified the git-filter-repo's #! line to point to my Python directory (via Anaconda) and executable (python, not python3 as noted elsewhere). Given that modification, I no longer get an error when executing git filter-repo commands—but neither do I get any output, nor see my git history rewritten. I've tried both --path and --analyze arguments.

I fully realize this isn't much to go on, so you may not be able to provide much direction. Obviously, if I had an error message, that would provide some pointers on what's failing. Are there log files somewhere I might look for to help provide more details?

FWIW: I'm running on:

  • git: v2.16.1
  • git-filter-repo: v2.24.0
  • python: 3.6.1

Unexpected results on repo with filename case changes

Hello.

I've encountered strange results after running filter-repo on repo than contains commits with filename case changes.

I'm using latest stable releases of git, python and filter-repo on Windows 10 1909.

$ git --version
git version 2.24.0.windows.2

$ python --version
Python 3.8.0

$ python git_filter_repo.py --version
f3e8e0f8a87c

Steps to reproduce:

  1. Initialize new repo.
git init test
cd test
  1. Create test.txt and trash.txt files.
echo test > test.txt
echo trash > trash.txt
git add .
git commit -m "add test.txt and trash.txt"
  1. Rename test.txt to Test.txt. Has to do it via multiple renaming and amending, because I don't know another ways to do that, at least on Windows.
git mv test.txt Test1.txt
git commit -m "rename test.txt -> Test.txt"
git mv Test1.txt Test.txt
git commit --amend --no-edit
  1. Remove trash.txt file.
git rm trash.txt
git commit -m "remove trash.txt"
  1. Modify Test.txt.
echo more >> Test.txt
git add .
git commit -m "add more to Test.txt"
  1. Rename Test.txt to test.txt back.
git mv Test.txt test2.txt
git commit -m "rename Test.txt -> test.txt"
git mv test2.txt test.txt
git commit --amend --no-edit
  1. Modify test.txt again.
echo new >> test.txt
git add .
git commit -m "add new to test.txt"
  1. Show the history.
git log --oneline -p

It should be similar to this:

f7e0103 (HEAD -> master) add new to test.txt
diff --git a/test.txt b/test.txt
index 9fbef1c..eae8904 100644
--- a/test.txt
+++ b/test.txt
@@ -1,2 +1,3 @@
 test
 more
+new

715bc1e rename Test.txt -> test.txt
diff --git a/Test.txt b/test.txt
similarity index 100%
rename from Test.txt
rename to test.txt

799cdea add more to Test.txt
diff --git a/Test.txt b/Test.txt
index 9daeafb..9fbef1c 100644
--- a/Test.txt
+++ b/Test.txt
@@ -1 +1,2 @@
 test
+more

695b3d9 remove trash.txt
diff --git a/trash.txt b/trash.txt
deleted file mode 100644
index fad67c0..0000000
--- a/trash.txt
+++ /dev/null
@@ -1 +0,0 @@
-trash

f375c0e rename test.txt -> Test.txt
diff --git a/test.txt b/Test.txt
similarity index 100%
rename from test.txt
rename to Test.txt

be4133d add test.txt and trash.txt
diff --git a/test.txt b/test.txt
new file mode 100644
index 0000000..9daeafb
--- /dev/null
+++ b/test.txt
@@ -0,0 +1 @@
+test
diff --git a/trash.txt b/trash.txt
new file mode 100644
index 0000000..fad67c0
--- /dev/null
+++ b/trash.txt
@@ -0,0 +1 @@
+trash
  1. Show the tree's id and contents of HEAD commit.
$ git log --format=raw -1 | grep tree | cut -d ' ' -f 2
b90b63e43b3accb1add5108e94f8f394bf4f4146

$ git ls-tree $(git log --format=raw -1 | grep tree | cut -d ' ' -f 2)
100644 blob eae8904154c5ee09ed95ad74668597f83b8059fc    test.txt
  1. Now run filter-repo to completely remove trash.txt from history.
python git_filter_repo.py --path trash.txt --invert-paths --force

Expected results:

  • trash.txt is completely gone from history;
  • tree object of current HEAD is the same as it was before running filter-repo;
  • test.txt is in tree of HEAD and its history contains single lines additions in appropriate commits.

Actual results:

  • trash.txt is completely gone from history: everything is OK here;
  • tree object of current HEAD differs: see below;
  • test.txt is not in tree of HEAD, it is replaced with Test.txt and its history contains multiple lines additions: have a look at "rename Test.txt -> test.txt" and "rename Test.txt -> test.txt" commits diffs below.

Here are git log and git ls-tree outputs on modified repo:

a73a788 (HEAD -> master) add new to test.txt
diff --git a/Test.txt b/Test.txt
index 9fbef1c..eae8904 100644
--- a/Test.txt
+++ b/Test.txt
@@ -1,2 +1,3 @@
 test
 more
+new

f1af009 rename Test.txt -> test.txt
003e481 add more to Test.txt
diff --git a/Test.txt b/Test.txt
new file mode 100644
index 0000000..9fbef1c
--- /dev/null
+++ b/Test.txt
@@ -0,0 +1,2 @@
+test
+more

2369434 rename test.txt -> Test.txt
diff --git a/test.txt b/test.txt
deleted file mode 100644
index 9daeafb..0000000
--- a/test.txt
+++ /dev/null
@@ -1 +0,0 @@
-test

a5ec46d add test.txt and trash.txt
diff --git a/test.txt b/test.txt
new file mode 100644
index 0000000..9daeafb
--- /dev/null
+++ b/test.txt
@@ -0,0 +1 @@
+test
$ git log --format=raw -1 | grep tree | cut -d ' ' -f 2
b90b63e43b3accb1add5108e94f8f394bf4f4146

$ git ls-tree $(git log --format=raw -1 | grep tree | cut -d ' ' -f 2)
100644 blob eae8904154c5ee09ed95ad74668597f83b8059fc    Test.txt

I think, such behavior of filter-repo is not intended. Or am I missing something?

Thanks in advance.

Remove filtered paths from original repo

Still working on re-structuring some of our repositories.. what is the best way to effectively remove the filtered portion from the original repo?

Considered this, but i've not tried it at all:-

git clone <path to clone> my-fresh-repo
git filter-repo --path 'Some.Path/' --force
git remote add origin <path to new origin>
git push -u origin --all

Then, to tidy the original..

git clone <path to clone> my-fresh-repo-inverse
git filter-repo --path 'Some.Path/' --invert-paths --force
git pull <original remote> master
git remote add origin <original remote>
git push -u origin master

I am concerned that re-writing of the history is going to do some very unpleasant things to the original repo though.. especially as people are still working on it.. is am looking for a clean split as if it were never there and kinda thinking out loud .. recommendations welcome, ty!

Edit: I tried it on a test repo, and all my history got duplicated... so i guess there is a 'better' way? Maybe I'll simply delete the split paths by hand, and merge it as a 'normal' change.

filter-repo chokes on parsing a shell script

Got an error while running filter-repo, it seemed suspicious so I ran it again with GIT_TRACE=1, and yup it stopped while filtering a version-controlled shell script (here it's a script that's meant to inject options in git hooks):

23:59:09.987141 trace git-lfs: filepathfilter: accepting "Scripts/build on checkout/post-checkout-OFF"
Error reading line 0: #!/bin/sh

Unfortunately I don't have a callstack so not sure where to go about fixing that...

Moving files from super-repo to submodule

Hi,

I have a scenario were i've used this tool to separate a part of my project into a separate repo that's now included into my main project repo as a submodule.

Now sometimes i need to move files from the super-repo into the submodule including history related to these files. Along with all branches information and tags related to the files history.

What's the best approach to do this using this tool?

Right now my approach is the following:

  1. Filter the files i want to move.
  2. Add prefix to all tags.
  3. Add the submodule's repo as a remote repo.
  4. Sync all branches with the submodule remote repo.
  5. Push all tags to the remote repo (which leads to some duplicated tags).

Thanks in advance.

Question: migrating binaries to Git LFS

This library looks great.

I was wondering if you had any advice on how to use it to migrate files to Git LFS and clean the history to reduce the repository size.

A specific use case would be a Unity-project that have large scenes as binary files.

Issues is probably the wrong place to put this, so feel free to delete or ignore it.

Improve handling for --prune-degenerate and projects with a strict no-fast-forwards policy

While working on the prune-empty and prune-degenerate stuff, I realized that someone who had a strict no-fast-forwards policy could potentially be unhappy with my pruning of degenerate & empty commits (since I pruned commits merging something with its own ancestor and providing no special changes, naturally that would include fast-forward commits). However, I of course also had the stipulation that I would only remove commits that became degenerate. That seemed like a good enough compromise to handle nearly all cases. However, a Google search on "filter-repo" turned up https://www.reddit.com/r/git/comments/dza81v/removing_folders_from_history_while_preserving/, and it's not that long after 2.24.0 was released, so I should probably revisit.

Just in case that link goes dead, let's say there are two independent commits, B & C, which both build on A. Someone then does 'git merge --no-ff B && git merge --no-ff C'. Later, someone comes along and runs filter-repo, specifying to remove the only path(s) modified in B. Then B of course gets pruned, the commit merging B into A gets pruned, and the commit that merged C in started as a real merge commit but became degenerate (it merges C with its ancestor A now) so it gets pruned. For folks that want a --no-ff workflow, that's bad.

Not sure if I want to make this an option, or if I just want to modify the "become degenerate" to ignore the non-fast-forward cases (i.e. a merge commit is degenerate if it merges something with itself, or it merges a second parent with an ancestor of the second parent; the merge commit would not be considered degenerate if it merged the first parent with an ancestor of the first parent).

paths-from-file keeping unrelated files

I compiled a list with 568 file names to use with --paths-from-file. After using git filter-repo I ended up with a repo with files that were not on the list.

For example, all the files in the list are under Assets/AL/Common/Scripts/ but now I have files under Assets/AL/LevelDesign/, which should have been stripped out.

This is the command I'm running:
git filter-repo --paths-from-file /tmp/git_rewrite_temp/PRESERVE --force --replace-refs delete-no-add

Unfortunately, I can't provide the repo, so it's hard to create a repro. Can you please advise on how to debug? Thank you.

Delete file history but keep the latest commit

Hello,

I was using BFG to prune the history of large files in my repo. It has option to delete content of file from earlier commits, while preserving the last commit:

By default the BFG doesn't modify the contents of your latest commit on your master (or 'HEAD') branch, even though it will clean all the commits before it.

How can I achieve this with git-filter-repo?

When I use --strip-blobs-with-ids option, it removes files completely.

Please help.

TypeError: a bytes-like object is required, not 'str'

Hi,

I'm getting error with bfg-ish script on Windows, Python version is 3.7.1.

Traceback (most recent call last):
  File "./bfg-ish", line 437, in <module>
    bfg.run()
  File "./bfg-ish", line 370, in run
    preserve_refs = self.get_preservation_info(bfg_args.preserve_ref_tips)
  File "./bfg-ish", line 314, in get_preservation_info
    output = subprocess.check_output(['git', 'rev-parse'] + ref_trees)
  File "C:\Tools\Anaconda\lib\subprocess.py", line 389, in check_output
    **kwargs).stdout
  File "C:\Tools\Anaconda\lib\subprocess.py", line 466, in run
    with Popen(*popenargs, **kwargs) as process:
  File "C:\Tools\Anaconda\lib\subprocess.py", line 769, in __init__
    restore_signals, start_new_session)
  File "C:\Tools\Anaconda\lib\subprocess.py", line 1113, in _execute_child
    args = list2cmdline(args)
  File "C:\Tools\Anaconda\lib\subprocess.py", line 524, in list2cmdline
    needquote = (" " in arg) or ("\t" in arg) or not arg
TypeError: a bytes-like object is required, not 'str'

This is the line 314:

output = subprocess.check_output(['git', 'rev-parse'] + ref_trees)

It works when I decode byte-string to string:

output = subprocess.check_output(['git', 'rev-parse'] + [r.decode("utf-8") for r in ref_trees])

But then it fails in other places:

Traceback (most recent call last):
  File "./git-cleaner", line 438, in <module>
    bfg.run()
  File "./git-cleaner", line 425, in run
    self.revert_tree_changes(preserve_refs)
  File "./git-cleaner", line 329, in revert_tree_changes
    output = subprocess.check_output('git cat-file -p'.split()+[ref])
  File "C:\Tools\Anaconda\lib\subprocess.py", line 389, in check_output
    **kwargs).stdout
  File "C:\Tools\Anaconda\lib\subprocess.py", line 466, in run
    with Popen(*popenargs, **kwargs) as process:
  File "C:\Tools\Anaconda\lib\subprocess.py", line 769, in __init__
    restore_signals, start_new_session)
  File "C:\Tools\Anaconda\lib\subprocess.py", line 1113, in _execute_child
    args = list2cmdline(args)
  File "C:\Tools\Anaconda\lib\subprocess.py", line 524, in list2cmdline
    needquote = (" " in arg) or ("\t" in arg) or not arg
TypeError: a bytes-like object is required, not 'str'

Question about IDE on Windows

Hello,

I'm trying to edit the script on Windows. Tried VS Code and PyCharm. Both have problems with recognizing git-filter-repo import in splice_repos.py - either they don't find the file, or they don't recognize members like fr.Blob

Any idea why?

SyntaxError: invalid syntax in git-filter-repo in line 3146 **extra items}

Hi newren,
I am getting the following error when trying to execute the command $ git filter-repo
Python error

I have Python33 installed on my office system. Also, I am new to git and as well as python(installed it for the very first time) so apologies from my side if this a trivial issue.

Thanks in advance

--commit-callback fails when trying to add files - Git version 2.24

Hi
Trying to add file to the root of the branch fails with following error:

git filter-repo --force --commit-callback "if not commit.parents: commit.file_changes.append(FileChange(b'M', 'C:\MDC\MDC.7z', $(git hash-object -w 'C:\MDC\MDC.7z'), 100644))"
Traceback (most recent call last):
  File "C:/Program Files/Git/mingw64/libexec/git-core\git-filter-repo", line 3839, in <module>
    filter = RepoFilter(args)
  File "C:/Program Files/Git/mingw64/libexec/git-core\git-filter-repo", line 2661, in __init__
    self._handle_arg_callbacks()
  File "C:/Program Files/Git/mingw64/libexec/git-core\git-filter-repo", line 2763, in _handle_arg_callbacks
    handle('commit')
  File "C:/Program Files/Git/mingw64/libexec/git-core\git-filter-repo", line 2756, in handle
    setattr(self, callback_field, make_callback(type, code_string))
  File "C:/Program Files/Git/mingw64/libexec/git-core\git-filter-repo", line 2741, in make_callback
    exec('def callback({}, _do_not_use_this_var = None):\n'.format(argname)+
  File "<string>", line 2
    if not commit.parents: commit.file_changes.append(FileChange(b'M', 'C:\MDC\MDC.7z', 3d5fb68077a1d627a7ec3b18f335713c4262fbf0, 100644))
                                                                                         ^
SyntaxError: invalid syntax

Any advise would be great
Tried on 3 different repos

`Makefile` issues

I found two issues with the Makefile:

  • It fails if the docs branch has not previously been checked out locally. This can be easily fixed by using origin/docs, but I can imagine that having some surprising behavior for the maintainer when actually trying to modify the docs. git branch -u origin/docs docs should ensure the branch exists locally without actually changing the working directory, but for some reason that isn't working on my machine so I'm hesitant to open a PR with that change.
  • The mkdir -p commands for the documentation directory don't actually create the leaf dirs, man and html. This is a pretty simple fix.

If you're okay with using git show origin/docs..., or if I figure out why git branch -u isn't working on my machine, I can submit a simple PR fixing these issues.

Considerations for multiple branches

Given the scenario where there are branches master, develop as well as a bunch of feature/cool-feature-1 remotes.. i can see that when running a command such as:

git filter-repo --path folder1/ --path folder2/ --path-rename 'folder1':'src' --path-rename 'folder2':'src'

the filtering is done across all branches, regardless of if i have actively checked them out or not.. they all end up on my machine and it works really nicely. You can then push them back to a (new) remote etc.. and carry on with your day.

Is there a performance cost to this? A repo i am working with has absolutely hundreds of remote branches that are in various states of decay, is there a way i can exclude them?

Is there a way to define which branches to consider? Either explicitly.. or maybe 'only ones that i have checked out' maybe?

preserve git notes

Given simple "remove this file from repo", the git notes get discarded:

$ git filter-repo --invert-paths --path-glob 'Auth*.php'

Adding submodules to each commit

Hello, i'm looking to migrate old svn repos which heavily uses svn:externals to Git. Since there is no official solution to this problem we are looking for other solutions. We came up with the following idea: Importing svn-repo to git with subgit (www.subgit.com). This tool has an option to add a file .gitsvnextmodules to each commit which includes all all svn:externals info such as path and revision. We now have the idea to loop trough this repo and change each commit to add submodules for each entry in the .gitsvnextmodules file. This would end up in a equivalent Git repo to the svn repo.

Would this be possible with your tool? Is there perhaps another way of achieving this particular problem?

Thank you in advance for your answer!
Best Regards,
Oliver

Expected runtime on large repos

Hi there,
this is not a bug but I'm wondering what runtime to expect from git-filter-repo. It claims to be much faster than filter-branch but I have used neither so don't know what to expect.

I'm trying to remove empty commits from github,com/freebsd/freebsd which it has many as they were created by the cvs2svn conversion that was run years ago. The packed repo is about 1.6GB and has about 260k commits.

I've let git-filter-repo --analyze run overnight and it took a few hours to get to this:

$ git-filter-repo --analyze
Processed 5222675 blob sizes
Processed 250309 commits

And now it's stuck there. Python is spinning at 100% but I don't know how to poke inside it to see what it is doing. I could imagine there's some inefficient data structure that doesn't scale well to that many commits (or is git-filter-repo known to handle repos this large without problems?)

The process only uses up to 1GB, so I fear it might not have all the metadata in RAM? Is it also possible to use multiple cores maybe to speed up some processing?

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
70172 git_conv 1 103 0 1202M 1138M CPU0 0 918:03 100.00% python3.7

Thanks!

Suggestion to change the default replacement text

First off, thanks for this amazing tool. I can't really say how much I appreciate your work on this!

Now to the issue's main point: I'm suggesting to change the default replacement text ***REMOVED***, as * is more than often treated as a special character. For example, in Symfony's config files, * denotes a reference to another variable, thus simply replacing a credential with ***REMOVED*** would potentially break the app.

I know that we can change the replacement text into something safer e.g. ___REMOVED___ by appending each line in the input expression file with ==>__REMOVED___, but I'd argue that it's rather clunky and a better solution would be either:

  • Changing the default text from the library level altogether, or
  • Making this a CLI option e.g. git-filter-repo --replace-text expressions.txt --replacement=___REMOVED___

What do you think?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.