Pretty much anything can be found at <a href="https://archive.softwareheritage.org/bro

ok, here we go, should be basically all of it: <a href="https://github.com/northismirr

Here's a list of useful urls to look at: <a href="https://arch

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

(updated the to fetch the git tars, teamwork makes the dreamwork!

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

⚒️ Migration scripts about i_want_to_help HOT 19 CLOSED

zdharma-continuum commented on June 1, 2024 5

⚒️ Migration scripts

from i_want_to_help.

Comments (19)

NorthIsUp commented on June 1, 2024 2

wrote this to fetch the repos, unfortunately it only downloads shallow clones.

#!/usr/bin/env python3
import json
import re
from http.client import TOO_MANY_REQUESTS
from pathlib import Path
from time import sleep, time
from typing import List, Literal, TypeVar, Union

import click
import requests
from rich import print
from rich import print_json as _print_json
from rich import traceback

traceback.install()

T = TypeVar("T")


def inspect(any: T) -> T:
    print(any)
    return any


def inspect_json(any: T) -> T:
    _print_json(any)
    return any


def jprint(obj):
    # create a formatted string of the Python JSON object
    print(json.dumps(obj, sort_keys=True, indent=4))


def fetch(
    url: str,
    *path: Union[str, int],
    method: Literal["GET", "POST"] = "GET",
    pretty: bool = True,
    backoff: int = 0,
):
    print(f"===> fetching {method} {url}")
    resp = requests.request(method, url)
    if resp.status_code == TOO_MANY_REQUESTS:
        sleep_until = int(resp.headers["X-RateLimit-Reset"])
        sleep_for = int(sleep_until - time()) + 1
        print(f"backing off until {sleep_until} (i.e. {sleep_for} seconds)")
        sleep(sleep_for)
        return fetch(url, *path, method=method, pretty=pretty, backoff=backoff + 1)

    resp.raise_for_status()
    json = resp.json()
    if pretty:
        inspect(json)

    for p in path:
        json = json[p]

    return json


def fetch_repo(repo: str, no_wait: bool = False):
    # Utility to pretty-print json.
    destination = Path(f'~/Downloads/{repo.replace("/", "---")}.git.tar').expanduser()
    # if destination.exists():
    #     print(f"skipping {repo}, already fetched")
    #     return

    assert re.match(r"\w+/\w+", repo), "repo must be of format ORG/NAME"
    visits_url = f"https://archive.softwareheritage.org/api/1/origin/https://github.com/{repo}/visits/"

    snapshot_url = fetch(visits_url, 0, "snapshot_url")
    head_url = fetch(snapshot_url, "branches", "HEAD", "target_url")
    directory_id = fetch(head_url, "id")

    # target_url = fetch(directory_url, 0, "target_url")
    vault_url = (
        f"https://archive.softwareheritage.org/api/1/vault/git-bare/swh:1:rev:{directory_id}/"
    )
    
    meal_json = fetch(vault_url, method="POST")

    if no_wait:
        print("not waiting")
        return meal_json

    while meal_json["status"] != "done":
        meal_json = fetch(vault_url, pretty=False)
        print(meal_json)
        sleep(60)
    else:
        print("downloading!")
        resp = requests.get(meal_json["fetch_url"])
        print(f"saving at {destination}")
        print(resp.headers)
        destination.write_bytes(resp.content)


@click.command()
@click.argument("repos", nargs=-1)
@click.option("--no-wait/--wait")
def main(repos: List[str], no_wait: bool):
    for repo in repos:
        try:
            fetch_repo(repo, no_wait=no_wait)
        except Exception as e:
            print(e)


if __name__ == "__main__":
    main()```

from i_want_to_help.

NorthIsUp commented on June 1, 2024 2

ok, here we go, should be basically all of it: https://github.com/northismirror/

from i_want_to_help.

pyrox0 commented on June 1, 2024 1

Here's a list of useful urls to look at:

Edit 1:

All of the zinit-zsh org's repos

from i_want_to_help.

dmitmel commented on June 1, 2024 1

I figured out how to clone a full Git repository from their archive. You need to send a request to https://archive.softwareheritage.org/api/1/vault/git-bare/swh:1:rev:<REVID>/, and wait for it to finish cooking (sleep for 1 second is too frequent, to be honest). Then you will be able to download a Tar archive, inside of which there will be a bare Git repository. The repository can be "unpacked" by doing git clone repo-bare.git repo-unpacked, but honestly, if all you need is to set the remote and push, then those commands can be run from inside of the bare repository.

from i_want_to_help.

dmitmel commented on June 1, 2024 1

By the way, I want to mention that I have recreated the wikis of zdharma/zui and zdharma/declare-zsh from what was in the Web Archive: https://gist.github.com/dmitmel/895d65776f4d61bc74ef21454f221dc2 and https://gist.github.com/dmitmel/6af23be49547024e55a34a13cae99e16

from i_want_to_help.

pschmitt commented on June 1, 2024 1

By the way, I want to mention that I have recreated the wikis of zdharma/zui and zdharma/declare-zsh from what was in the Web Archive: gist.github.com/dmitmel/895d65776f4d61bc74ef21454f221dc2 and gist.github.com/dmitmel/6af23be49547024e55a34a13cae99e16

Sweet.
I restored declare-zsh and its wiki: https://github.com/zdharma-continuum/declare-zsh

Unfortunately, I don't have access to the wiki section of https://github.com/zdharma-continuum/zui
@alichtman Can you restore it or grant me access?

from i_want_to_help.

alichtman commented on June 1, 2024 1

@pschmitt Upgraded you to write access. Let me know if that does the trick for you.

from i_want_to_help.

NorthIsUp commented on June 1, 2024

(updated the script to fetch the git tars, teamwork makes the dreamwork!

from i_want_to_help.

pschmitt commented on June 1, 2024

The script works great, I've used it to fetch the commits for zdharma-continuum/zsh-package-any-gem, zdharma-continuum/zsh-package-any-node and zdharma-continuum/zsh-package-firefox-dev.
Great stuff!

from i_want_to_help.

pschmitt commented on June 1, 2024

@pschmitt Upgraded you to write access. Let me know if that does the trick for you.
That worked.

@dmitmel
ZUI wiki restored: https://github.com/zdharma-continuum/zui/wiki

from i_want_to_help.

pschmitt commented on June 1, 2024

Here's the spaghetti code I used to migrate repos btw:

repo="$1"

if [[ -z "$repo" ]]
then
  echo "Missing repo name. Please source again. source $0 ORG/REPO" >&2
  return 2
fi

mkdir -p data && \
python fetch.py "$repo" && mv ~/Downloads/${repo//\//---}.git.tar data && \
  cd ./data && { unsetopt nomatch; rm -rf swh* repo.git && setopt nomatch } && \
  tar xf ${repo//\//---}.git.tar && \
  git clone swh* repo.git && \
  cd repo.git && \
  git remote remove origin && \
  git checkout -b main && \
  gh repo create --public -y zdharma-continuum/$(basename $repo) && \
  git push origin main

Usage: source migrate.zsh ORG/REPO

Edit: @NorthIsUp I had to edit the regex in your script to make this work for some of the repos I migrated

assert re.match(r".+/.+", repo), "repo must be of format ORG/NAME"

from i_want_to_help.

dmitmel commented on June 1, 2024

@pschmitt Repos might contain other branches besides main. I think, to handle that you need to remove the git checkout -b main, and use git push origin --all (git push origin --tags after that won't hurt).

from i_want_to_help.

pschmitt commented on June 1, 2024

Right, I blanked on that one. Damn.

from i_want_to_help.

pschmitt commented on June 1, 2024

Hm. I've re-downloaded a few (30+) zdharma repos from softwareheritage.org and it looks like softwareheritage.org only archives the main/master branch...

from i_want_to_help.

dmitmel commented on June 1, 2024

Not true: https://archive.softwareheritage.org/browse/origin/branches/?origin_url=https://github.com/neovim/neovim

from i_want_to_help.

pschmitt commented on June 1, 2024

Not true: archive.softwareheritage.org/browse/origin/branches/?origin_url=https://github.com/neovim/neovim

My bad. I guess it was only (un)lucky with my picks then. Thanks for letting me know.

from i_want_to_help.

pschmitt commented on June 1, 2024

Now I understand why I only got a single branch from my archives:

    head_url = fetch(snapshot_url, "branches", "HEAD", "target_url")

Will try to improve the script so that it fetches all branches.

Here it is:

#!/usr/bin/env python3
import json
import re
from http.client import TOO_MANY_REQUESTS
from os.path import basename
from pathlib import Path
from time import sleep, time
from typing import List, Literal, TypeVar, Union

import click
import requests
from rich import print
from rich import print_json as _print_json
from rich import traceback

traceback.install()

T = TypeVar("T")


def inspect(any: T) -> T:
    print(any)
    return any


def inspect_json(any: T) -> T:
    _print_json(any)
    return any


def jprint(obj):
    # create a formatted string of the Python JSON object
    print(json.dumps(obj, sort_keys=True, indent=4))


def fetch(
    url: str,
    *path: Union[str, int],
    method: Literal["GET", "POST"] = "GET",
    pretty: bool = True,
    backoff: int = 0,
):
    print(f"===> fetching {method} {url}")
    resp = requests.request(method, url)
    if resp.status_code == TOO_MANY_REQUESTS:
        sleep_until = int(resp.headers["X-RateLimit-Reset"])
        sleep_for = int(sleep_until - time()) + 1
        print(f"backing off until {sleep_until} (i.e. {sleep_for} seconds)")
        sleep(sleep_for)
        return fetch(url, *path, method=method, pretty=pretty, backoff=backoff + 1)

    resp.raise_for_status()
    json = resp.json()
    if pretty:
        inspect(json)

    for p in path:
        json = json[p]

    return json


def fetch_repo(repo: str, no_wait: bool = False):
    # Utility to pretty-print json.

    assert re.match(r".+/.+", repo), "repo must be of format ORG/NAME"
    visits_url = f"https://archive.softwareheritage.org/api/1/origin/https://github.com/{repo}/visits/"

    snapshot_url = fetch(visits_url, 0, "snapshot_url")
    branches = fetch(snapshot_url, "branches")
    for branch, branch_data in branches.items():
        print(f"Processing branch {branch}")
        if branch_data.get("target_type") == "alias":
            print(f"SKIP alias branch {branch}")
            continue
        branch_url = fetch(snapshot_url, "branches", branch, "target_url")
        directory_id = fetch(branch_url, "id")

        # target_url = fetch(directory_url, 0, "target_url")
        vault_url = f"https://archive.softwareheritage.org/api/1/vault/git-bare/swh:1:rev:{directory_id}/"

        destination = Path(
            f'~/Downloads/{repo.replace("/", "---")}---{basename(branch)}.git.tar'
        ).expanduser()
        if destination.exists():
            print(f"skipping {repo}, already fetched")
            return

        meal_json = fetch(vault_url, method="POST")

        if no_wait:
            print("not waiting")
            return meal_json

        while meal_json["status"] != "done":
            meal_json = fetch(vault_url, pretty=False)
            print(meal_json)
            print("Sleeping for 30s")
            sleep(30)
        else:
            print("downloading!")
            resp = requests.get(meal_json["fetch_url"])
            print(f"saving at {destination}")
            print(resp.headers)
            destination.write_bytes(resp.content)


@click.command()
@click.argument("repos", nargs=-1)
@click.option("--no-wait/--wait")
def main(repos: List[str], no_wait: bool):
    for repo in repos:
        try:
            fetch_repo(repo, no_wait=no_wait)
        except Exception as e:
            print(e)


if __name__ == "__main__":
    main()

from i_want_to_help.

pschmitt commented on June 1, 2024

Update: I've added the missing branches, for some of the repos at least. If there's anything missing please ping me to add it.

Also I pushed some of the original tags - well I had to re-create them manually, didn't see another way.

from i_want_to_help.

alichtman commented on June 1, 2024

I believe we're done with the migration. Closing.

from i_want_to_help.

⚒️ Migration scripts about i_want_to_help HOT 19 CLOSED

Comments (19)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent