Git Product home page Git Product logo

Comments (19)

NorthIsUp avatar NorthIsUp commented on June 1, 2024 2

wrote this to fetch the repos, unfortunately it only downloads shallow clones.

#!/usr/bin/env python3
import json
import re
from http.client import TOO_MANY_REQUESTS
from pathlib import Path
from time import sleep, time
from typing import List, Literal, TypeVar, Union

import click
import requests
from rich import print
from rich import print_json as _print_json
from rich import traceback

traceback.install()

T = TypeVar("T")


def inspect(any: T) -> T:
    print(any)
    return any


def inspect_json(any: T) -> T:
    _print_json(any)
    return any


def jprint(obj):
    # create a formatted string of the Python JSON object
    print(json.dumps(obj, sort_keys=True, indent=4))


def fetch(
    url: str,
    *path: Union[str, int],
    method: Literal["GET", "POST"] = "GET",
    pretty: bool = True,
    backoff: int = 0,
):
    print(f"===> fetching {method} {url}")
    resp = requests.request(method, url)
    if resp.status_code == TOO_MANY_REQUESTS:
        sleep_until = int(resp.headers["X-RateLimit-Reset"])
        sleep_for = int(sleep_until - time()) + 1
        print(f"backing off until {sleep_until} (i.e. {sleep_for} seconds)")
        sleep(sleep_for)
        return fetch(url, *path, method=method, pretty=pretty, backoff=backoff + 1)

    resp.raise_for_status()
    json = resp.json()
    if pretty:
        inspect(json)

    for p in path:
        json = json[p]

    return json


def fetch_repo(repo: str, no_wait: bool = False):
    # Utility to pretty-print json.
    destination = Path(f'~/Downloads/{repo.replace("/", "---")}.git.tar').expanduser()
    # if destination.exists():
    #     print(f"skipping {repo}, already fetched")
    #     return

    assert re.match(r"\w+/\w+", repo), "repo must be of format ORG/NAME"
    visits_url = f"https://archive.softwareheritage.org/api/1/origin/https://github.com/{repo}/visits/"

    snapshot_url = fetch(visits_url, 0, "snapshot_url")
    head_url = fetch(snapshot_url, "branches", "HEAD", "target_url")
    directory_id = fetch(head_url, "id")

    # target_url = fetch(directory_url, 0, "target_url")
    vault_url = (
        f"https://archive.softwareheritage.org/api/1/vault/git-bare/swh:1:rev:{directory_id}/"
    )
    
    meal_json = fetch(vault_url, method="POST")

    if no_wait:
        print("not waiting")
        return meal_json

    while meal_json["status"] != "done":
        meal_json = fetch(vault_url, pretty=False)
        print(meal_json)
        sleep(60)
    else:
        print("downloading!")
        resp = requests.get(meal_json["fetch_url"])
        print(f"saving at {destination}")
        print(resp.headers)
        destination.write_bytes(resp.content)


@click.command()
@click.argument("repos", nargs=-1)
@click.option("--no-wait/--wait")
def main(repos: List[str], no_wait: bool):
    for repo in repos:
        try:
            fetch_repo(repo, no_wait=no_wait)
        except Exception as e:
            print(e)


if __name__ == "__main__":
    main()```

from i_want_to_help.

NorthIsUp avatar NorthIsUp commented on June 1, 2024 2

ok, here we go, should be basically all of it: https://github.com/northismirror/

from i_want_to_help.

pyrox0 avatar pyrox0 commented on June 1, 2024 1

Here's a list of useful urls to look at:

Edit 1:

from i_want_to_help.

dmitmel avatar dmitmel commented on June 1, 2024 1

I figured out how to clone a full Git repository from their archive. You need to send a request to https://archive.softwareheritage.org/api/1/vault/git-bare/swh:1:rev:<REVID>/, and wait for it to finish cooking (sleep for 1 second is too frequent, to be honest). Then you will be able to download a Tar archive, inside of which there will be a bare Git repository. The repository can be "unpacked" by doing git clone repo-bare.git repo-unpacked, but honestly, if all you need is to set the remote and push, then those commands can be run from inside of the bare repository.

from i_want_to_help.

dmitmel avatar dmitmel commented on June 1, 2024 1

By the way, I want to mention that I have recreated the wikis of zdharma/zui and zdharma/declare-zsh from what was in the Web Archive: https://gist.github.com/dmitmel/895d65776f4d61bc74ef21454f221dc2 and https://gist.github.com/dmitmel/6af23be49547024e55a34a13cae99e16

from i_want_to_help.

pschmitt avatar pschmitt commented on June 1, 2024 1

By the way, I want to mention that I have recreated the wikis of zdharma/zui and zdharma/declare-zsh from what was in the Web Archive: gist.github.com/dmitmel/895d65776f4d61bc74ef21454f221dc2 and gist.github.com/dmitmel/6af23be49547024e55a34a13cae99e16

Sweet.
I restored declare-zsh and its wiki: https://github.com/zdharma-continuum/declare-zsh

Unfortunately, I don't have access to the wiki section of https://github.com/zdharma-continuum/zui
@alichtman Can you restore it or grant me access?

from i_want_to_help.

alichtman avatar alichtman commented on June 1, 2024 1

@pschmitt Upgraded you to write access. Let me know if that does the trick for you.

from i_want_to_help.

NorthIsUp avatar NorthIsUp commented on June 1, 2024

(updated the script to fetch the git tars, teamwork makes the dreamwork!

from i_want_to_help.

pschmitt avatar pschmitt commented on June 1, 2024

The script works great, I've used it to fetch the commits for zdharma-continuum/zsh-package-any-gem, zdharma-continuum/zsh-package-any-node and zdharma-continuum/zsh-package-firefox-dev.
Great stuff!

from i_want_to_help.

pschmitt avatar pschmitt commented on June 1, 2024

@pschmitt Upgraded you to write access. Let me know if that does the trick for you.
That worked.

@dmitmel
ZUI wiki restored: https://github.com/zdharma-continuum/zui/wiki

from i_want_to_help.

pschmitt avatar pschmitt commented on June 1, 2024

Here's the spaghetti code I used to migrate repos btw:

repo="$1"

if [[ -z "$repo" ]]
then
  echo "Missing repo name. Please source again. source $0 ORG/REPO" >&2
  return 2
fi

mkdir -p data && \
python fetch.py "$repo" && mv ~/Downloads/${repo//\//---}.git.tar data && \
  cd ./data && { unsetopt nomatch; rm -rf swh* repo.git && setopt nomatch } && \
  tar xf ${repo//\//---}.git.tar && \
  git clone swh* repo.git && \
  cd repo.git && \
  git remote remove origin && \
  git checkout -b main && \
  gh repo create --public -y zdharma-continuum/$(basename $repo) && \
  git push origin main

Usage: source migrate.zsh ORG/REPO

Edit: @NorthIsUp I had to edit the regex in your script to make this work for some of the repos I migrated

assert re.match(r".+/.+", repo), "repo must be of format ORG/NAME"

from i_want_to_help.

dmitmel avatar dmitmel commented on June 1, 2024

@pschmitt Repos might contain other branches besides main. I think, to handle that you need to remove the git checkout -b main, and use git push origin --all (git push origin --tags after that won't hurt).

from i_want_to_help.

pschmitt avatar pschmitt commented on June 1, 2024

Right, I blanked on that one. Damn.

from i_want_to_help.

pschmitt avatar pschmitt commented on June 1, 2024

Hm. I've re-downloaded a few (30+) zdharma repos from softwareheritage.org and it looks like softwareheritage.org only archives the main/master branch...

from i_want_to_help.

dmitmel avatar dmitmel commented on June 1, 2024

Not true: https://archive.softwareheritage.org/browse/origin/branches/?origin_url=https://github.com/neovim/neovim

from i_want_to_help.

pschmitt avatar pschmitt commented on June 1, 2024

Not true: archive.softwareheritage.org/browse/origin/branches/?origin_url=https://github.com/neovim/neovim

My bad. I guess it was only (un)lucky with my picks then. Thanks for letting me know.

from i_want_to_help.

pschmitt avatar pschmitt commented on June 1, 2024

Now I understand why I only got a single branch from my archives:

    head_url = fetch(snapshot_url, "branches", "HEAD", "target_url")

Will try to improve the script so that it fetches all branches.

Here it is:

#!/usr/bin/env python3
import json
import re
from http.client import TOO_MANY_REQUESTS
from os.path import basename
from pathlib import Path
from time import sleep, time
from typing import List, Literal, TypeVar, Union

import click
import requests
from rich import print
from rich import print_json as _print_json
from rich import traceback

traceback.install()

T = TypeVar("T")


def inspect(any: T) -> T:
    print(any)
    return any


def inspect_json(any: T) -> T:
    _print_json(any)
    return any


def jprint(obj):
    # create a formatted string of the Python JSON object
    print(json.dumps(obj, sort_keys=True, indent=4))


def fetch(
    url: str,
    *path: Union[str, int],
    method: Literal["GET", "POST"] = "GET",
    pretty: bool = True,
    backoff: int = 0,
):
    print(f"===> fetching {method} {url}")
    resp = requests.request(method, url)
    if resp.status_code == TOO_MANY_REQUESTS:
        sleep_until = int(resp.headers["X-RateLimit-Reset"])
        sleep_for = int(sleep_until - time()) + 1
        print(f"backing off until {sleep_until} (i.e. {sleep_for} seconds)")
        sleep(sleep_for)
        return fetch(url, *path, method=method, pretty=pretty, backoff=backoff + 1)

    resp.raise_for_status()
    json = resp.json()
    if pretty:
        inspect(json)

    for p in path:
        json = json[p]

    return json


def fetch_repo(repo: str, no_wait: bool = False):
    # Utility to pretty-print json.

    assert re.match(r".+/.+", repo), "repo must be of format ORG/NAME"
    visits_url = f"https://archive.softwareheritage.org/api/1/origin/https://github.com/{repo}/visits/"

    snapshot_url = fetch(visits_url, 0, "snapshot_url")
    branches = fetch(snapshot_url, "branches")
    for branch, branch_data in branches.items():
        print(f"Processing branch {branch}")
        if branch_data.get("target_type") == "alias":
            print(f"SKIP alias branch {branch}")
            continue
        branch_url = fetch(snapshot_url, "branches", branch, "target_url")
        directory_id = fetch(branch_url, "id")

        # target_url = fetch(directory_url, 0, "target_url")
        vault_url = f"https://archive.softwareheritage.org/api/1/vault/git-bare/swh:1:rev:{directory_id}/"

        destination = Path(
            f'~/Downloads/{repo.replace("/", "---")}---{basename(branch)}.git.tar'
        ).expanduser()
        if destination.exists():
            print(f"skipping {repo}, already fetched")
            return

        meal_json = fetch(vault_url, method="POST")

        if no_wait:
            print("not waiting")
            return meal_json

        while meal_json["status"] != "done":
            meal_json = fetch(vault_url, pretty=False)
            print(meal_json)
            print("Sleeping for 30s")
            sleep(30)
        else:
            print("downloading!")
            resp = requests.get(meal_json["fetch_url"])
            print(f"saving at {destination}")
            print(resp.headers)
            destination.write_bytes(resp.content)


@click.command()
@click.argument("repos", nargs=-1)
@click.option("--no-wait/--wait")
def main(repos: List[str], no_wait: bool):
    for repo in repos:
        try:
            fetch_repo(repo, no_wait=no_wait)
        except Exception as e:
            print(e)


if __name__ == "__main__":
    main()

from i_want_to_help.

pschmitt avatar pschmitt commented on June 1, 2024

Update: I've added the missing branches, for some of the repos at least. If there's anything missing please ping me to add it.

Also I pushed some of the original tags - well I had to re-create them manually, didn't see another way.

from i_want_to_help.

alichtman avatar alichtman commented on June 1, 2024

I believe we're done with the migration. Closing.

from i_want_to_help.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.