Git Product home page Git Product logo

git-sizer's Introduction

Happy Git repositories are all alike; every unhappy Git repository is unhappy in its own way. —Linus Tolstoy

git-sizer

Is your Git repository bursting at the seams?

git-sizer computes various size metrics for a local Git repository, flagging those that might cause you problems or inconvenience. For example:

  • Is the repository too big overall? Ideally, Git repositories should be under 1 GiB, and (without special handling) they start to get unwieldy over 5 GiB. Big repositories take a long time to clone and repack, and take a lot of disk space. Suggestions:

    • Avoid storing generated files (e.g., compiler output, JAR files) in Git. It would be better to regenerate them when necessary, or store them in a package registry or even a fileserver.

    • Avoid storing large media assets in Git. You might want to look into Git-LFS or git-annex, which allow you to version your media assets in Git while actually storing them outside of your repository.

    • Avoid storing file archives (e.g., ZIP files, tarballs) in Git, especially if compressed. Different versions of such files don't delta well against each other, so Git can't store them efficiently. It would be better to store the individual files in your repository, or store the archive elsewhere.

  • Does the repository have too many references (branches and/or tags)? They all have to be transferred to the client for every fetch, even if your clone is up-to-date. Try to limit them to a few tens of thousands at most. Suggestions:

    • Delete unneeded tags and branches.

    • Avoid pushing your "remote-tracking" branches to a shared repository.

    • Consider using "git notes" rather than tags to attach auxiliary information to commits (for example, CI build results).

    • Perhaps store some of your rarely-needed tags and branches in a separate fork of your repository that is not fetched from by normal developers.

  • Does the repository include too many objects? The more objects, the longer it takes for Git to traverse the repository's history, for example when garbage-collecting. Suggestions:

    • Think about whether you are storing very many tiny files that could easily be collected into a few bigger files.

    • Consider breaking your project up into multiple subprojects.

  • Does the repository include gigantic blobs (files)? Git works best with small- to medium-sized files. It's OK to have a few files in the megabyte range, but they should generally be the exception. Suggestions:

    • Consider using Git-LFS for storing your large files, especially those (e.g., media assets) that don't diff and merge usefully.

    • See also the section "Is the repository too big overall?"

  • Does the repository include many, many versions of large text files, each one slightly changed from the one before? Such files delta very well, so they might not cause your repository to grow alarmingly. But it is expensive for Git to reconstruct the full files and to diff them, which it needs to do internally for many operations. Suggestions:

    • Avoid storing log files and database dumps in Git.

    • Avoid storing giant data files (e.g., enormous XML files) in Git, especially if they are modified frequently. Consider using a database instead.

  • Does the repository include gigantic trees (directories)? Every time a file is modified, Git has to create a new copy of every tree (i.e., every directory in the path) leading to the file. Huge trees make this expensive. Moreover, it is very expensive to traverse through history that contains huge trees, for example for git blame. Suggestions:

    • Avoid creating directories with more than a couple of thousand entries each.

    • If you must store very many files, it is better to shard them into a hierarchy of multiple, smaller directories.

  • Does the repository have the same (or very similar) files repeated over and over again at different paths in a single commit? If so, the repository might have a reasonable overall size, but when you check it out it balloons into an enormous working copy. (Taken to an extreme, this is called a "git bomb"; see below.) Suggestions:

    • Perhaps you can achieve your goals more effectively by using tags and branches or a build-time configuration system.
  • Does the repository include absurdly long path names? That's probably not going to work well with other tools. One or two hundred characters should be enough, even if you're writing Java.

  • Are there other bizarre and questionable things in the repository?

    • Annotated tags pointing at one another in long chains?

    • Octopus merges with dozens of parents?

    • Commits with gigantic log messages?

git-sizer computes many size-related statistics about your repository that can help reveal all of the problems described above. These practices are not wrong per se, but the more that you stretch Git beyond its sweet spot, the less you will be able to enjoy Git's legendary speed and performance. Especially if your Git repository statistics seem out of proportion to your project size, you might be able to make your life easier by adjusting how you use Git.

Getting started

  1. Make sure that you have the Git command-line client installed, version >= 2.6. NOTE: git-sizer invokes git commands to examine the contents of your repository, so it is required that the git command be in your PATH when you run git-sizer.

  2. Install git-sizer. Either:

    a. Install a released version of git-sizer(recommended):

    1. Go to the releases page and download the ZIP file corresponding to your platform.
    2. Unzip the file.
    3. Move the executable file (git-sizer or git-sizer.exe) into your PATH.

    b. Build and install from source. See the instructions in docs/BUILDING.md.

  3. Change to the directory containing a full, non-shallow clone of the Git repository that you'd like to analyze. Then run

    git-sizer [<option>...]
    

    No options are required. You can learn about available options by typing git-sizer -h or by reading on.

Pro tip: If you add git-sizer to your PATH, then you can run it by typing either git-sizer or git sizer. In the latter case, it is found and run for you by Git, and you can add extra Git options between the two words, like git -C /path/to/my/repo sizer. If you don't add git-sizer to your PATH, then of course you need to type its full path and filename to run it; e.g., /path/to/bin/git-sizer. In either case, the git executable must be in your PATH.

Usage

By default, git-sizer outputs its results in tabular format. For example, let's use it to analyze the Linux repository, using the --verbose option so that all statistics are output:

$ git-sizer --verbose
Processing blobs: 1652370
Processing trees: 3396199
Processing commits: 722647
Matching commits to trees: 722647
Processing annotated tags: 534
Processing references: 539
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Commits                    |           |                                |
|   * Count                    |   723 k   | *                              |
|   * Total size               |   525 MiB | **                             |
| * Trees                      |           |                                |
|   * Count                    |  3.40 M   | **                             |
|   * Total size               |  9.00 GiB | ****                           |
|   * Total tree entries       |   264 M   | *****                          |
| * Blobs                      |           |                                |
|   * Count                    |  1.65 M   | *                              |
|   * Total size               |  55.8 GiB | *****                          |
| * Annotated tags             |           |                                |
|   * Count                    |   534     |                                |
| * References                 |           |                                |
|   * Count                    |   539     |                                |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  72.7 KiB | *                              |
|   * Maximum parents      [2] |    66     | ******                         |
| * Trees                      |           |                                |
|   * Maximum entries      [3] |  1.68 k   | *                              |
| * Blobs                      |           |                                |
|   * Maximum size         [4] |  13.5 MiB | *                              |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum history depth      |   136 k   |                                |
| * Maximum tag depth      [5] |     1     |                                |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [6] |  4.38 k   | **                             |
| * Maximum path depth     [7] |    13     | *                              |
| * Maximum path length    [8] |   134 B   | *                              |
| * Number of files        [9] |  62.3 k   | *                              |
| * Total size of files    [9] |   747 MiB |                                |
| * Number of symlinks    [10] |    40     |                                |
| * Number of submodules       |     0     |                                |

[1]  91cc53b0c78596a73fa708cceb7313e7168bb146
[2]  2cde51fbd0f310c8a2c5f977e665c0ac3945b46d
[3]  4f86eed5893207aca2c2da86b35b38f2e1ec1fc8 (refs/heads/master:arch/arm/boot/dts)
[4]  a02b6794337286bc12c907c33d5d75537c240bd0 (refs/heads/master:drivers/gpu/drm/amd/include/asic_reg/vega10/NBIO/nbio_6_1_sh_mask.h)
[5]  5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c (refs/tags/v2.6.11)
[6]  1459754b9d9acc2ffac8525bed6691e15913c6e2 (589b754df3f37ca0a1f96fccde7f91c59266f38a^{tree})
[7]  78a269635e76ed927e17d7883f2d90313570fdbc (dae09011115133666e47c35673c0564b0a702db7^{tree})
[8]  ce5f2e31d3bdc1186041fdfd27a5ac96e728f2c5 (refs/heads/master^{tree})
[9]  532bdadc08402b7a72a4b45a2e02e5c710b7d626 (e9ef1fe312b533592e39cddc1327463c30b0ed8d^{tree})
[10] f29a5ea76884ac37e1197bef1941f62fda3f7b99 (f5308d1b83eba20e69df5e0926ba7257c8dd9074^{tree})

The output is a table showing the thing that was measured, its numerical value, and a rough indication of which values might be a cause for concern. In all cases, only objects that are reachable from references are included (i.e., not unreachable objects, nor objects that are reachable only from the reflogs).

The "Overall repository size" section includes repository-wide statistics about distinct objects, not including repetition. "Total size" is the sum of the sizes of the corresponding objects in their uncompressed form, measured in bytes. The overall uncompressed size of all objects is a good indication of how expensive commands like git gc --aggressive (and git repack [-f|-F] and git pack-objects --no-reuse-delta), git fsck, and git log [-G|-S] will be. The uncompressed size of trees and commits is a good indication of how expensive reachability traversals will be, including clones and fetches and git gc.

The "Biggest objects" section provides information about the biggest single objects of each type, anywhere in the history.

In the "History structure" section, "maximum history depth" is the longest chain of commits in the history, and "maximum tag depth" reports the longest chain of annotated tags that point at other annotated tags.

The "Biggest checkouts" section is about the sizes of commits as checked out into a working copy. "Maximum path depth" is the largest number of path components for files in the working copy, and "maximum path length" is the longest path in terms of bytes. "Total size of files" is the sum of all file sizes in the single biggest commit, including multiplicities if the same file appears multiple times.

The "Value" column displays counts, using units "k" (thousand), "M" (million), "G" (billion) etc., and sizes, using units "B" (bytes), "KiB" (1024 bytes), "MiB" (1024 KiB), etc. Note that if a value overflows its counter (which should only happen for malicious repositories), the corresponding value is displayed as in tabular form, or truncated to 2³²-1 or 2⁶⁴-1 (depending on the size of the counter) in JSON mode.

The "Level of concern" column uses asterisks to indicate values that seem high compared with "typical" Git repositories. The more asterisks, the more inconvenience this aspect of your repository might be expected to cause. Exclamation points indicate values that are extremely high (i.e., equivalent to more than 30 asterisks).

The footnotes list the SHA-1s of the "biggest" objects referenced in the table, along with a more human-readable <commit>:<path> description of where that object is located in the repository's history. Given the name of a large object, you could, for example, type

git cat-file -p <commit>:<path>

at the command line to view the contents of the object. (Use --names=none if you'd rather omit these footnotes.)

By default, only statistics above a minimal level of concern are reported. Use --verbose (as above) to request that all statistics be output. Use --threshold=<value> to suppress the reporting of statistics below a specified level of concern. (<value> is interpreted as a numerical value corresponding to the number of asterisks.) Use --critical to report only statistics with a critical level of concern (equivalent to --threshold=30).

If you'd like the output in machine-readable format, including exact numbers, use the --json option. You can use --json-version=1 or --json-version=2 to choose between old and new style JSON output.

To get a list of other options, run

git-sizer -h

The Linux repository is large by most standards. As you can see, it is pushing some of Git's limits. And indeed, some Git operations on the Linux repository (e.g., git fsck, git gc) do take a while. But due to its sane structure, none of its dimensions are wildly out of proportion to the size of the code base, so the kernel project is managed successfully using Git.

Here is the non-verbose output for one of the famous "git bomb" repositories:

$ git-sizer
[...]
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Biggest checkouts            |           |                                |
| * Number of directories  [1] |  1.11 G   | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Maximum path depth     [1] |    11     | *                              |
| * Number of files        [1] |     ∞     | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Total size of files    [2] |  83.8 GiB | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |

[1]  c1971b07ce6888558e2178a121804774c4201b17 (refs/heads/master^{tree})
[2]  d9513477b01825130c48c4bebed114c4b2d50401 (18ed56cbc5012117e24a603e7c072cf65d36d469^{tree})

This repository is mischievously constructed to have a pathological tree structure, with the same directories repeated over and over again. As a result, even though the entire repository is less than 20 kb in size, when checked out it would explode into over a billion directories containing over ten billion files. (git-sizer prints for the blob count because the true number has overflowed the 32-bit counter used for that field.)

Contributing

git-sizer is in regular use and is still under active development. If you would like to help out, please see CONTRIBUTING.md.

git-sizer's People

Contributors

cleancut avatar dbast avatar dscho avatar elhmn avatar ferhatelmas avatar mhagger avatar migue avatar owbone avatar pawamoy avatar pnsk avatar rajhawaldar avatar tgummerer avatar yarikoptic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

git-sizer's Issues

Add new convenience options like `--no-branches`, `--stash`, etc.

All of the following can be accomplished using reference filtering (#75), but we could add new options to make them a tad more convenient (and these options would nicely complement --branches, --tags, and --remotes):

  • --no-branches — equivalent to --exclude=refs/heads
  • --no-tags — equivalent to --exclude=refs/tags
  • --no-remotes — equivalent to --exclude=refs/remotes
  • --notes / --no-notes — equivalent to --include=refs/notes / --exclude=refs/notes
  • --stash / --no-stash — equivalent to --include=refs/stash / --exclude=refs/stash

Bad CPU type in executable

I've installed the latest release, git-sizer-1.3.0-darwin-386.zip, on my new Mac running macOS Catalina. When I run it I get this error:

-bash: /usr/local/bin/git-sizer: Bad CPU type in executable

The same error also occurs on a brand-new one-day-old MacBook Pro as well.

Is this perhaps an issue where macOS Catalina is no longer compatible with 32-bit apps?

Installation

Hello, this question can be noob, but im starting with git.

I have problems understanding the installation process, in the guide, the step 3 tell us to add the .exe to your PATH, exactly, what that means?

Thanks and sorry for the noob question

Crash when trying to run

Running against a checkout of [email protected]:jpschewe/fll-sw.git.

>uname -a
Linux jon-laptop 4.13.0-36-generic #40-Ubuntu SMP Fri Feb 16 20:07:48 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

>git-sizer  
Processing blobs: 22223                                                                   Processing trees: 24428                      
Processing commits: 5194                     
Matching commits to trees: 5194                                                           Processing annotated tags: 94
Processing references: 156
panic: 8 tag records remain!                                                                                    
goroutine 1 [running]:
github.com/github/git-sizer/sizes.(*Graph).HistorySize(0xc420001800, 0x0, 0x0, 0x0, 0x0, 0
x0, 0x0, 0x0, 0x0, 0x0, ...)                                                              
        /home/mhagger/github/proj/git-sizer/git/.gopath/src/github.com/github/git-sizer/si
zes/graph.go:398 +0x2c9
github.com/github/git-sizer/sizes.ScanRepositoryUsingGraph(0xc420010510, 0x5687c0, 0x2, 0x
1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)                                                     
        /home/mhagger/github/proj/git-sizer/git/.gopath/src/github.com/github/git-sizer/si
zes/graph.go:335 +0x1df6                                                                  main.mainImplementation(0x0, 0x0)
        /home/mhagger/github/proj/git-sizer/git/.gopath/src/github.com/github/git-sizer/git-sizer.go:150 +0x8ef 
main.main()                                                                                       /home/mhagger/github/proj/git-sizer/git/.gopath/src/github.com/github/git-sizer/gi
t-sizer.go:46 +0x26                                 

For large repos, why is maxCheckoutBlobSize.value == uniqueBlobSize.value ?

For large repos with binary blobs, I always found that maxCheckoutBlobSize.value == uniqueBlobSize.value. How is it that they're always equal for such repos? Note that I'm not necessarily using GitHub with such repos. An example is:

$ git-sizer --json --json-version 2 | egrep -A2 'maxCheckoutBlobSize|uniqueBlobSize'
Processing blobs: 5527                        
Processing trees: 32                        
Processing commits: 10                        
Matching commits to trees: 10                        
Processing annotated tags: 0                        
Processing references: 2                        
    "maxCheckoutBlobSize": {
        "description": "The maximum sum of file sizes in any checkout",
        "value": 3976127840,
--
    "uniqueBlobSize": {
        "description": "The total size of all distinct blob objects",
        "value": 3976127840,

For small repos, there was always a meaningful difference, with uniqueBlobSize.value being larger. An example is:

$ git-sizer --json --json-version 2 | egrep -A2 'maxCheckoutBlobSize|uniqueBlobSize'
Processing blobs: 1653                        
Processing trees: 3481                        
Processing commits: 1630                        
Matching commits to trees: 1630                        
Processing annotated tags: 0                        
Processing references: 5                        
    "maxCheckoutBlobSize": {
        "description": "The maximum sum of file sizes in any checkout",
        "value": 165999,
--
    "uniqueBlobSize": {
        "description": "The total size of all distinct blob objects",
        "value": 85867069,

Running 'make test' dies due to no .git when building from a .tar.gz

If you follow these build instructions and run make test on one of the .tar.gz releases you get:

$ make test; echo $?
[...]
        Messages:       command failed; output: "error: couldn't open Git repository: git rev-parse failed: fatal: not a git repository (or any of the parent directories): .git\n\n"
[...]
make: *** [gotest] Error 1
2

Just running git init in the build directory works around it, but I don't know how much this breaks the tests, if at all:

$ make test; echo $?
[...]
ok      github.com/github/git-sizer     0.123s
ok      github.com/github/git-sizer/counts      (cached)
?       github.com/github/git-sizer/git [no test files]
?       github.com/github/git-sizer/isatty      [no test files]
?       github.com/github/git-sizer/meter       [no test files]
?       github.com/github/git-sizer/sizes       [no test files]
0

So now I do git init in my packaging script, but it would be neat if this worked out of the box.

git cat-file: error: unknown option `buffer'

Just downloaded git-sizer 1.0.0 and tried it on one of my repos.
My git is 2.1.4.

Here is the message:

$ git-sizer
error: unknown option `buffer'
usage: git cat-file (-t|-s|-e|-p|<type>|--textconv) <object>
   or: git cat-file (--batch|--batch-check) < <list_of_objects>

<type> can be one of: blob, tree, commit, tag
    -t                    show object type
    -s                    show object size
    -e                    exit with zero when there's no error
    -p                    pretty-print object's content
    --textconv            for blob objects, run textconv on object's content
    --batch[=<format>]    show info and content of objects fed from the standard input
    --batch-check[=<format>]
                          show info about objects fed from the standard input

Processing blobs: 0                        
error: error scanning repository: read |0: file already closed

Release 1.4.0 Windows builds flagged as malicious by multiple vendors

git-sizer.exe in both git-sizer-1.4.0-windows-386.zip and git-sizer-1.4.0-windows-amd64.zip, along with the zip files themselves, are flagged as malicious by multiple vendors, per VirusTotal.

error: error scanning repository: <commit-id> has no header separator

I am running git-sizer on https://source.codeaurora.org/quic/qsdk/oss/kernel/linux-msm repository (2.4GB in size) and I get the following error.

error: error scanning repository: e34c586c99f12db9ebf61598a960a27f4e0359cd has no header separator

Line in source
git.go:578: return ObjectHeaderIter{}, fmt.Errorf("%s has no header separator", name)

I suspect the issue is with the repository and not with the tool. Is there a way to skip this and still run git-sizer on the repo?

TiA
zer0-0ne

Recommend the help and README mention that this is for local git repos only.

Nice tool.
Now considering that git is often used both locally and remotely I think some users would find it helpful if some brief mention that git-sizer is for collecting statistics about a local git repository only.

A future enhancement suggestion may be to have git-sizer optionally interrogate a remote git
repository for the purpose of determining ahead of time if the user is about to clone
a very large repository.

Thanks again for making git-sizer available.
mrT

Doc Request: The importance of `Total size`

From the README:

"Total size" is the sum of the sizes of the corresponding objects in their uncompressed form, measured in bytes.

Why is this metric important?

Run on our repo, it gives 41 GB but the packfile is only 5 GB. The 41 GB uncompressed size gives me a 4/10 concern level.

I'd imagine the important stats are the size of the packfile (for cloning), the # of objects (for traversal), and the size of the checkout both in files and size (for normal dev workflows). Those numbers don't currently give me cause for concern in our daily usage.

So why should I be concerned about the uncompressed size of all the blobs in my repo? Wouldn't that include blobs in the history that may not be relevant anymore unless the user asks for them specifically? My best guess is that this metric somehow predicts how git gc will perform.

Detect and report when run outside of a Git repository

Currently, Git reports errors to stderr but git-sizer as a whole succeeds, reporting no problems. Instead, there should be an explicit test that the repository exists, and if not, the error should be reported more clearly. (The same check could also be used to make sure that a git executable is available.)

Add suggestion on how to resolve spotted issues

As an example, when the biggest blob is found, git-sizer could suggest how to remove it. There is a tool out there named git-forget-blob which could be used to solve the issue if the blob is to be found to be a mistake.

For example in our case, someone had committed/pushed a cscope.out in their own branch.

When there are possible solution like this, providing a "suggestion" on how to proceed next, would be awesome.

Add "delta chain length" metric?

@MrChrisW made me aware of @peff's great analysis regarding delta chain length performance implications. I wrote a script based on @peff's method and the results are mostly consistent (see output below if you are curious).

My question is: Would it make sense to add a "delta chain length" metric to git-sizer? I take it that this could change with any repack. In that sense this kind of metric would be different from the existing ones.


#### REPO PACK BENCHMARK ###
git version 2.16.2.windows.1
System:    MINGW64_NT-10.0
Repo URL:  https://xyz

### TEST RUN ###
Start:     Fri, Apr 13, 2018  7:20:06 AM
Depth:     250
Size:      750M .
rev-list:  1m29.612s
log -Sfoo: 18m43.435s


### TEST RUN ###
Start:     Fri, Apr 13, 2018  8:21:54 AM
Depth:     100
Size:      752M .
rev-list:  1m27.198s
log -Sfoo: 19m55.479s     


### TEST RUN ###
Start:     Fri, Apr 13, 2018  9:29:05 AM
Depth:     50
Size:      756M .
rev-list:  1m49.097s         <--- This is a weird outlier although I report best of 5.
log -Sfoo: 15m14.464s


### TEST RUN ###
Start:     Fri, Apr 13, 2018 10:28:00 AM
Depth:     10
Size:      779M .
rev-list:  1m24.860s
log -Sfoo: 13m16.779s

Unsurprisingly benchmarking the same repo on Linux is way faster...

#### REPO PACK BENCHMARK ###
git version 2.16.2
System:    Linux
Repo URL:  https://xyz

### TEST RUN ###
Start:     Fri Apr 13 10:09:08 PDT 2018
Depth:     250
Size:      750M .
rev-list:  0m26.308s
log -Sfoo: 10m39.158s


### TEST RUN ###
Start:     Fri Apr 13 11:22:07 PDT 2018
Depth:     100
Size:      752M .
rev-list:  0m23.450s
log -Sfoo: 9m10.427s


### TEST RUN ###
Start:     Fri Apr 13 12:25:13 PDT 2018
Depth:     50
Size:      756M .
rev-list:  0m22.240s
log -Sfoo: 7m43.298s


### TEST RUN ###
Start:     Fri Apr 13 13:19:24 PDT 2018
Depth:     10
Size:      779M .
rev-list:  0m20.231s
log -Sfoo: 5m41.928s      <-- Woah, searching all commits takes only half the time :-)

[Discussion] Is overwriting a big json file continuously okay?

I am familiar with a repository named metakgp/naraad which overwrites a JSON file around 4-5 times a day. One can check the commit history to get more useful information.

This causes json files to not just be big but it also causes to have too many big objects (non-binary) so it gets undetected.

Should such repositories be a problem? Personally it's a big headache for me to clone, and work on it, and I get around it by just cloning the tip.

Impove NewObjectIter method

Hi @mhagger

To get all objects in git repo, I wonder if git cat-file --batch-check='%(objectname) %(objecttype) %(objectsize)' --batch-all-objects is more effective and easier than git rev-list --objects --stdin && git cat-file --batch-check --buffer.

Error: panic: commit is not available

When trying out git-sizer on an internal repo I am getting following error:

Processing blobs: 424652
Processing trees: 407283
panic: commit is not available

goroutine 1 [running]:
github.com/github/git-sizer/sizes.(*Graph).GetCommitSize(0xc420001380, 0xc37d1f0ae042c3b1, 0x181a181366861aad, 0xd4440d70, 0xb300000010)
	/home/mhagger/github/proj/git-sizer/git/.gopath/src/github.com/github/git-sizer/sizes/graph.go:630 +0xc6
github.com/github/git-sizer/sizes.(*Graph).RegisterCommit(0xc420001380, 0x775ac44301b3a032, 0x39b7ad7765e2bb53, 0xc4e6d0cfba, 0xc421f50180)
	/home/mhagger/github/proj/git-sizer/git/.gopath/src/github.com/github/git-sizer/sizes/graph.go:656 +0x1e1
github.com/github/git-sizer/sizes.ScanRepositoryUsingGraph(0xc42000e5d0, 0x116c6f0, 0x2, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/mhagger/github/proj/git-sizer/git/.gopath/src/github.com/github/git-sizer/sizes/graph.go:281 +0x11e8
main.mainImplementation(0x0, 0x0)
	/home/mhagger/github/proj/git-sizer/git/.gopath/src/github.com/github/git-sizer/git-sizer.go:150 +0x8ef
main.main()
	/home/mhagger/github/proj/git-sizer/git/.gopath/src/github.com/github/git-sizer/git-sizer.go:46 +0x26

Idea: have `git-sizer` read some defaults from the gitconfig

It would be nice to have git-sizer read some of its default settings from gitconfig; for example:

  • Which version of JSON output to use by default
  • What --names option to use by default
  • Whether to use --verbose by default
  • Default minimum --threshold
  • Possibly even some reference-selection options, maybe using different names for different selections

/cc @terrorobe, who suggested a use for this.

A tag depth of 1 isn't a cause for concern

Somehow I hadn't noticed before, but a "maximum tag depth" of 1 earns one "concern" star. But every tag has a depth of at least 1, so it's absurd to consider this at all concerning.

I think it would be better for depth 2 to get one star, depth 3 two stars, etc.

Add "Level of concern" to the json output

It would be nice to have "Level of concern" column content in JSON output for more friendly automation.

We are going to write a script that will iterate across all our repos and will show all major Level of concerns.
In this case, we need to run git-sizer several times, first, one to get "Level of concern" and second to get data from JSON.

What do you think?

Does git-sizer count objects managed by Git LFS?

I have a largish bare repo with Git LFS installed (SVN to Git migration):

proj.git (BARE:master) $ git-sizer
Processing blobs: 1107392
Processing trees: 178226
Processing commits: 29412
Matching commits to trees: 29412
Processing annotated tags: 0
Processing references: 24
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Blobs                      |           |                                |
|   * Total size               |  12.8 GiB | *                              |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Trees                      |           |                                |
|   * Maximum entries      [1] |  1.96 k   | *                              |
| * Blobs                      |           |                                |
|   * Maximum size         [2] |   113 MiB | ***********                    |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [3] |  13.3 k   | ******                         |
| * Maximum path depth     [4] |    18     | *                              |
| * Maximum path length    [5] |   232 B   | **                             |
| * Number of files        [6] |   910 k   | ******************             |
| * Total size of files    [7] |  3.37 GiB | ***                            |

I've written a little git lfs ls-file helper git_lfs_calculate_size_by_type.py which reports for proj.git repo this:

Git LFS objects summary:
.lib:   count: 1111     size: 8764.66 MB
.dll:   count: 749      size: 1427.98 MB
.pdb:   count: 612      size: 2814.09 MB
.exe:   count: 786      size: 2005.72 MB
.zip:   count: 24       size: 1153.65 MB
Total:  count: 3282     size: 16166.11 MB

Does the latter 16166.11 MB relate to the former 12.8 GiB in any way?
Or, is the grand total of the repo, Git and Git LFS objects, a sum of the two figure?

Can Git Sizer be used to compute a size of a directory including its history?

I assume the answer to this question is no, based on the reading of the - command, but I wonder if there might be a way anyway or perhaps you could direct me to a different tool / a Git command which is able to achieve this? I have a repository with a directory which is big and I plan on purging its (and only its, not the rest of the repository) history, but I want to first check how large it really is including all of its history and how much this operation will save me. It would also be useful to compare how large all the top level directories including their history are as compared to one another.

Detect & report tmp_pack files

When git repack fails, it's possible that .git/objects/pack/tmp_pack_* files are left over. Since these are likely to be large, I suggest that git-sizer users could benefit from being informed about them, for example in the Biggest objects section.

Question: How is logical file size calculated (Total size of files)

I am trying to understand how git-sizer calculates logical file size. I have included some observations below, using finagle as an example. I have included an example of size on disk from Github, du -h and looking at the packed size - roughly these all seem to come to the same'ish figure.

My question is around logical size - I have included a script which I found and compared the output with git-sizer and there is a difference and was wondering if someone could help me understand what that is. I am assuming it has something to do with deletions and evaluating this based on walking the tree

Example project: https://github.com/twitter/finagle
Github reported size: 101.81MB
Git pack size (.pack file): 102MB (using ls -lh rounded up from 106701022 bytes)
Bash script using du -h: 132M (I understand this to be the contents of the .git directory along with the files at the version they are at in the working directory)
Git-Sizer: Total size of files [8] | 159 MiB (I understand this to be logical size and not considering any of the compression techniques)
*Bash script using git rev-list and git cat-file: 175M ( Uniquing the filename )
**Bash script using git rev-list and git cat-file: 291M ( Summing all the filesizes as they appear without uniqing the filename.)

* script: git rev-list --no-walk --all --objects --date-order | sort -u -t' ' -k2r | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | grep ^blob | cut -d ' ' -f3 | paste -s -d + - | bc | numfmt --field 1 --to=iec

** script: git rev-list --no-walk --all --objects --date-order | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | grep ^blob | cut -d ' ' -f3 | paste -s -d + - | bc | numfmt --field 1 --to=iec

git-sizer error: "running 'git config': exit status 129"

Not sure what the problem is. Downloaded zip file for Linux AMD 64 but get above error using:

$ git-sizer

error: running 'git config': exit status 129

It appears that git config has an invalid parameter but it is called by the git-sizer command which theoretically knows how to call the git command with the correct parameters?

If it helps:

$ type -a git
git is /usr/bin/git

$ type -a git-sizer
git-sizer is /home/rick/bin/git-sizer

$ git-sizer --version
git-sizer release 1.5.0

$ git --version
git version 2.7.4

Any suggestions would be appreciated as it would be nice to know how much GitHub resources my repo is consuming.

Don't consider objects under `refs/notes/*` when looking for "worst" items

Git objects reachable only from refs/notes/ are peculiar and not likely to be checked out, so probably it would be better not to include them for the items under "biggest checkouts". OTOH it probably does make sense to include them in the other statistics.

Are there other reference namespaces in common use that should get similar special treatment?

error: Could not read changeset

When we executed the git-sizer binary for our repo which has 1.8GiB git repository. We got the exception which is given below for your reference :

fatal: revision walk setup failed
Processing blobs: 0                        
error: error scanning repository: exit status 128
[root@f258e46eaf0a zohobooks_server.git]# 

Git version - git version 2.15.1

Kindly help us to resolve the issue.

Thanks,

Kaleeswaran

--version flag or similar

I just checked in on this project to see if there was a newer release than the one I have in my ~/bin folder. I saw there has a second release, but I had no idea which version I had locally. I mean, I think the latest release was newer than the one I had, but I was surprised to to find that running git-sizer --version gave me unknown flag: --version. So I just had to overwrite the binary whether I actually needed to or not. 🤷‍♂️

Advice for alleviating found issues?

TL;DR

I really like this tool. One thing I would like, though, would be if it could suggest ways to fix or otherwise deal with the issues it finds. (I don't know if this is always possible for every type of issue this tool can detect.)

The Long Version

I'm a programmer who uses git every day and I'm familiar with how to do the common stuff like cloning, branching, committing, merging, etc. I've been using it for years. I read Version Control with Git, so I have a decent understanding of how git works. I've written a program that uses Rugged, the Ruby bindings to libgit2. I'm even familiar enough that I can do more advanced stuff like interactive rebasing for editing commit messages or author info, and for squashing commits. I can do non-interactive rebasing to clean up branch histories.

Point is, I'm familiar with how git works and how to use it. But I'm not familiar with git's performance profile. That's how I've been using git-sizer the past few days. I've been running it on the various repos to which I contribute to see what pops up.

But once it tells me that certain objects/trees/tags/whatever are potentially problematic, then what? Is there something I can do to fix that, or is the tool just telling me my repo is screwed? 😉

Feature: Set nonzero exit code with --critical

For automation purposes, it would be a great feature to be able to analyze repositories and get a machine readable exit code that states that critical issues were found.

One way is to directly have --critical set the exit code or add another flag that does that if critical issues were found.

Thanks for creating this great tool.

Can it work on a bare repo?

Hi,
Thanks for this wonderful solution!
We are thinking to do the analysis directly on the hosting service (Bitbucket).
As the hosting service contains all the repos as bare repos would it be possible to run the git-sizer on all those bare repos.

Idea would be get the analytics org.wide and based on anomalies found we can then have a closer look / correction of individual repositories.

Thanks!

go get: installing executables with 'go get' in module mode is deprecated

Use section compile from source in docs/BUILDING.md execute go get receive next output:

$ go get github.com/github/git-sizer
go: downloading github.com/github/git-sizer v1.5.0
go: downloading github.com/spf13/pflag v1.0.5
go: downloading github.com/cli/safeexec v1.0.0
go: downloading golang.org/x/sync v0.0.0-20210220032951-036812b2e83c
go get: installing executables with 'go get' in module mode is deprecated.
	Use 'go install pkg@version' instead.
	For more information, see https://golang.org/doc/go-get-install-deprecation
	or run 'go help get' or 'go help install'.

Use go version 1.17.1

"git count-objects" return way smaller size than git-sizer

Git-sizer returns the output at the bottom and states that we have 13.9GB of blobs. The .git folder is only 210MB and when running

git count-objects -vH

I also get

C:\Code\TestRepo>git count-objects -vH
count: 0
size: 0 bytes
in-pack: 554928
packs: 1
size-pack: 210.25 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

What am I missing?

C:\Code\TestRepo>git-sizer.exe --verbose
Processing blobs: 159789
Processing trees: 356816
Processing commits: 38323
Matching commits to trees: 38323
Processing annotated tags: 0
Processing references: 3
| Name | Value | Level of concern |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size | | |
| * Commits | | |
| * Count | 38.3 k | |
| * Total size | 12.0 MiB | |
| * Trees | | |
| * Count | 357 k | |
| * Total size | 404 MiB | |
| * Total tree entries | 9.12 M | |
| * Blobs | | |
| * Count | 160 k | |
| * Total size | 13.7 GiB | * |
| * Annotated tags | | |
| * Count | 0 | |
| * References | | |
| * Count | 3 | |
| | | |
| Biggest objects | | |
| * Commits | | |
| * Maximum size [1] | 1.71 KiB | |
| * Maximum parents [2] | 2 | |
| * Trees | | |
| * Maximum entries [3] | 1.18 k | * |
| * Blobs | | |
| * Maximum size [4] | 7.82 MiB | |
| | | |
| History structure | | |
| * Maximum history depth | 38.3 k | |
| * Maximum tag depth | 0 | |
| | | |
| Biggest checkouts | | |
| * Number of directories [5] | 5.12 k | ** |
| * Maximum path depth [6] | 13 | * |
| * Maximum path length [7] | 219 B | ** |
| * Number of files [6] | 34.9 k | |
| * Total size of files [6] | 490 MiB | |
| * Number of symlinks | 0 | |
| * Number of submodules | 0 | |

Run from subdirectory of the repository

Trying to run from a subdirectory of the git repository doesn't work. In the example here "working-dir" is the top of the repository.

>git-sizer 
fatal: Cannot change to 'home/jpschewe/projects/fll-sw/working-dir/.git': No such file or directory
fatal: Cannot change to 'home/jpschewe/projects/fll-sw/working-dir/.git': No such file or directory
fatal: Cannot change to 'home/jpschewe/projects/fll-sw/working-dir/.git': No such file or directory
Processing blobs: 0                        
error: error scanning repository: exit status 128

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.