Comments (19)
wrote this to fetch the repos, unfortunately it only downloads shallow clones.
#!/usr/bin/env python3
import json
import re
from http.client import TOO_MANY_REQUESTS
from pathlib import Path
from time import sleep, time
from typing import List, Literal, TypeVar, Union
import click
import requests
from rich import print
from rich import print_json as _print_json
from rich import traceback
traceback.install()
T = TypeVar("T")
def inspect(any: T) -> T:
print(any)
return any
def inspect_json(any: T) -> T:
_print_json(any)
return any
def jprint(obj):
# create a formatted string of the Python JSON object
print(json.dumps(obj, sort_keys=True, indent=4))
def fetch(
url: str,
*path: Union[str, int],
method: Literal["GET", "POST"] = "GET",
pretty: bool = True,
backoff: int = 0,
):
print(f"===> fetching {method} {url}")
resp = requests.request(method, url)
if resp.status_code == TOO_MANY_REQUESTS:
sleep_until = int(resp.headers["X-RateLimit-Reset"])
sleep_for = int(sleep_until - time()) + 1
print(f"backing off until {sleep_until} (i.e. {sleep_for} seconds)")
sleep(sleep_for)
return fetch(url, *path, method=method, pretty=pretty, backoff=backoff + 1)
resp.raise_for_status()
json = resp.json()
if pretty:
inspect(json)
for p in path:
json = json[p]
return json
def fetch_repo(repo: str, no_wait: bool = False):
# Utility to pretty-print json.
destination = Path(f'~/Downloads/{repo.replace("/", "---")}.git.tar').expanduser()
# if destination.exists():
# print(f"skipping {repo}, already fetched")
# return
assert re.match(r"\w+/\w+", repo), "repo must be of format ORG/NAME"
visits_url = f"https://archive.softwareheritage.org/api/1/origin/https://github.com/{repo}/visits/"
snapshot_url = fetch(visits_url, 0, "snapshot_url")
head_url = fetch(snapshot_url, "branches", "HEAD", "target_url")
directory_id = fetch(head_url, "id")
# target_url = fetch(directory_url, 0, "target_url")
vault_url = (
f"https://archive.softwareheritage.org/api/1/vault/git-bare/swh:1:rev:{directory_id}/"
)
meal_json = fetch(vault_url, method="POST")
if no_wait:
print("not waiting")
return meal_json
while meal_json["status"] != "done":
meal_json = fetch(vault_url, pretty=False)
print(meal_json)
sleep(60)
else:
print("downloading!")
resp = requests.get(meal_json["fetch_url"])
print(f"saving at {destination}")
print(resp.headers)
destination.write_bytes(resp.content)
@click.command()
@click.argument("repos", nargs=-1)
@click.option("--no-wait/--wait")
def main(repos: List[str], no_wait: bool):
for repo in repos:
try:
fetch_repo(repo, no_wait=no_wait)
except Exception as e:
print(e)
if __name__ == "__main__":
main()```
from i_want_to_help.
ok, here we go, should be basically all of it: https://github.com/northismirror/
from i_want_to_help.
Here's a list of useful urls to look at:
Edit 1:
from i_want_to_help.
I figured out how to clone a full Git repository from their archive. You need to send a request to https://archive.softwareheritage.org/api/1/vault/git-bare/swh:1:rev:<REVID>/
, and wait for it to finish cooking (sleep for 1 second is too frequent, to be honest). Then you will be able to download a Tar archive, inside of which there will be a bare Git repository. The repository can be "unpacked" by doing git clone repo-bare.git repo-unpacked
, but honestly, if all you need is to set the remote and push, then those commands can be run from inside of the bare repository.
from i_want_to_help.
By the way, I want to mention that I have recreated the wikis of zdharma/zui and zdharma/declare-zsh from what was in the Web Archive: https://gist.github.com/dmitmel/895d65776f4d61bc74ef21454f221dc2 and https://gist.github.com/dmitmel/6af23be49547024e55a34a13cae99e16
from i_want_to_help.
By the way, I want to mention that I have recreated the wikis of zdharma/zui and zdharma/declare-zsh from what was in the Web Archive: gist.github.com/dmitmel/895d65776f4d61bc74ef21454f221dc2 and gist.github.com/dmitmel/6af23be49547024e55a34a13cae99e16
Sweet.
I restored declare-zsh and its wiki: https://github.com/zdharma-continuum/declare-zsh
Unfortunately, I don't have access to the wiki section of https://github.com/zdharma-continuum/zui
@alichtman Can you restore it or grant me access?
from i_want_to_help.
@pschmitt Upgraded you to write access. Let me know if that does the trick for you.
from i_want_to_help.
(updated the script to fetch the git tars, teamwork makes the dreamwork!
from i_want_to_help.
The script works great, I've used it to fetch the commits for zdharma-continuum/zsh-package-any-gem, zdharma-continuum/zsh-package-any-node and zdharma-continuum/zsh-package-firefox-dev.
Great stuff!
from i_want_to_help.
@pschmitt Upgraded you to write access. Let me know if that does the trick for you.
That worked.
@dmitmel
ZUI wiki restored: https://github.com/zdharma-continuum/zui/wiki
from i_want_to_help.
Here's the spaghetti code I used to migrate repos btw:
repo="$1"
if [[ -z "$repo" ]]
then
echo "Missing repo name. Please source again. source $0 ORG/REPO" >&2
return 2
fi
mkdir -p data && \
python fetch.py "$repo" && mv ~/Downloads/${repo//\//---}.git.tar data && \
cd ./data && { unsetopt nomatch; rm -rf swh* repo.git && setopt nomatch } && \
tar xf ${repo//\//---}.git.tar && \
git clone swh* repo.git && \
cd repo.git && \
git remote remove origin && \
git checkout -b main && \
gh repo create --public -y zdharma-continuum/$(basename $repo) && \
git push origin main
Usage: source migrate.zsh ORG/REPO
Edit: @NorthIsUp I had to edit the regex in your script to make this work for some of the repos I migrated
assert re.match(r".+/.+", repo), "repo must be of format ORG/NAME"
from i_want_to_help.
@pschmitt Repos might contain other branches besides main
. I think, to handle that you need to remove the git checkout -b main
, and use git push origin --all
(git push origin --tags
after that won't hurt).
from i_want_to_help.
Right, I blanked on that one. Damn.
from i_want_to_help.
Hm. I've re-downloaded a few (30+) zdharma repos from softwareheritage.org and it looks like softwareheritage.org only archives the main/master branch...
from i_want_to_help.
from i_want_to_help.
Not true: archive.softwareheritage.org/browse/origin/branches/?origin_url=https://github.com/neovim/neovim
My bad. I guess it was only (un)lucky with my picks then. Thanks for letting me know.
from i_want_to_help.
Now I understand why I only got a single branch from my archives:
head_url = fetch(snapshot_url, "branches", "HEAD", "target_url")
Will try to improve the script so that it fetches all branches.
Here it is:
#!/usr/bin/env python3
import json
import re
from http.client import TOO_MANY_REQUESTS
from os.path import basename
from pathlib import Path
from time import sleep, time
from typing import List, Literal, TypeVar, Union
import click
import requests
from rich import print
from rich import print_json as _print_json
from rich import traceback
traceback.install()
T = TypeVar("T")
def inspect(any: T) -> T:
print(any)
return any
def inspect_json(any: T) -> T:
_print_json(any)
return any
def jprint(obj):
# create a formatted string of the Python JSON object
print(json.dumps(obj, sort_keys=True, indent=4))
def fetch(
url: str,
*path: Union[str, int],
method: Literal["GET", "POST"] = "GET",
pretty: bool = True,
backoff: int = 0,
):
print(f"===> fetching {method} {url}")
resp = requests.request(method, url)
if resp.status_code == TOO_MANY_REQUESTS:
sleep_until = int(resp.headers["X-RateLimit-Reset"])
sleep_for = int(sleep_until - time()) + 1
print(f"backing off until {sleep_until} (i.e. {sleep_for} seconds)")
sleep(sleep_for)
return fetch(url, *path, method=method, pretty=pretty, backoff=backoff + 1)
resp.raise_for_status()
json = resp.json()
if pretty:
inspect(json)
for p in path:
json = json[p]
return json
def fetch_repo(repo: str, no_wait: bool = False):
# Utility to pretty-print json.
assert re.match(r".+/.+", repo), "repo must be of format ORG/NAME"
visits_url = f"https://archive.softwareheritage.org/api/1/origin/https://github.com/{repo}/visits/"
snapshot_url = fetch(visits_url, 0, "snapshot_url")
branches = fetch(snapshot_url, "branches")
for branch, branch_data in branches.items():
print(f"Processing branch {branch}")
if branch_data.get("target_type") == "alias":
print(f"SKIP alias branch {branch}")
continue
branch_url = fetch(snapshot_url, "branches", branch, "target_url")
directory_id = fetch(branch_url, "id")
# target_url = fetch(directory_url, 0, "target_url")
vault_url = f"https://archive.softwareheritage.org/api/1/vault/git-bare/swh:1:rev:{directory_id}/"
destination = Path(
f'~/Downloads/{repo.replace("/", "---")}---{basename(branch)}.git.tar'
).expanduser()
if destination.exists():
print(f"skipping {repo}, already fetched")
return
meal_json = fetch(vault_url, method="POST")
if no_wait:
print("not waiting")
return meal_json
while meal_json["status"] != "done":
meal_json = fetch(vault_url, pretty=False)
print(meal_json)
print("Sleeping for 30s")
sleep(30)
else:
print("downloading!")
resp = requests.get(meal_json["fetch_url"])
print(f"saving at {destination}")
print(resp.headers)
destination.write_bytes(resp.content)
@click.command()
@click.argument("repos", nargs=-1)
@click.option("--no-wait/--wait")
def main(repos: List[str], no_wait: bool):
for repo in repos:
try:
fetch_repo(repo, no_wait=no_wait)
except Exception as e:
print(e)
if __name__ == "__main__":
main()
from i_want_to_help.
Update: I've added the missing branches, for some of the repos at least. If there's anything missing please ping me to add it.
Also I pushed some of the original tags - well I had to re-create them manually, didn't see another way.
from i_want_to_help.
I believe we're done with the migration. Closing.
from i_want_to_help.
Related Issues (14)
- Do you need sources? HOT 2
- please add these repos HOT 1
- Adding a repo/wiki to help with migration HOT 9
- Community tools HOT 7
- 🎯 Hunt for missing repos! HOT 10
- Automated Security Analysis + Linting HOT 4
- Additional repo forks found HOT 6
- I was committer to zdharma/z-a-meta-plugins HOT 2
- Wiki HOT 1
- Removal of Donations + Sponsorship
- zinit Annexes HOT 2
- A few more repo forks HOT 14
- domain name HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from i_want_to_help.