Git Product home page Git Product logo

Comments (17)

joeyh avatar joeyh commented on June 1, 2024

At least some of this sounds like it would be best handled by building an external special remote. http://git-annex.branchable.com/design/external_special_remote_protocol/

That seems more likely to handle some of the unusual use cases, rather than trying to shoe-horn everything through URL handling.

from datalad.

yarikoptic avatar yarikoptic commented on June 1, 2024

On Tue, 02 Dec 2014, Joey Hess wrote:

At least some of this sounds like it would be best handled by building an
external special remote.
http://git-annex.branchable.com/design/external_special_remote_protocol/
That seems more likely to handle some of the unusual use cases, rather
than trying to show-horn everything through URL handling.

yes indeed, thanks for the reminder about this feature! And it sounds
like there could also be a middle-case where we could provide local
adapter which would serve as an "external special remote" e.g. in case
of datasets served already from S3 buckets with versioning
enabled. There ETags correspond to md5sum's of the load, so if we keep a
table of translations from underlying annex backend (e.g. SHA256) to
md5sums (with hope of no overlaps, or adding size as an additional
check), then such an adapter could request load from s3 buckets. But I
wonder if that would be beneficial over e.g. populating files with URLs
pointing to http frontend to S3 where we could request specific version
of a file anyways?

Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Research Scientist, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
WWW: http://www.linkedin.com/in/yarik

from datalad.

joeyh avatar joeyh commented on June 1, 2024

Yaroslav Halchenko wrote:

yes indeed, thanks for the reminder about this feature! And it sounds
like there could also be a middle-case where we could provide local
adapter which would serve as an "external special remote" e.g. in case
of datasets served already from S3 buckets with versioning
enabled. There ETags correspond to md5sum's of the load, so if we keep a
table of translations from underlying annex backend (e.g. SHA256) to
md5sums (with hope of no overlaps, or adding size as an additional
check), then such an adapter could request load from s3 buckets. But I
wonder if that would be beneficial over e.g. populating files with URLs
pointing to http frontend to S3 where we could request specific version
of a file anyways?

The S3 case seems just as well handled by git annex addurl, with an
URL pointing at the right version of the file in the S3 bucket.

External special remotes make sense for cases like extraction from
locally present archives. I think it also makes sense for cases where
a data source requires some credentials to use.

As for extending git-annex's url handling to support eg, .torrent or
other pluggable extensions, I think there is room for a simple pluggable
interface there. Indeed, git-annex already has some basics toward one.

What git-annex already has is that an url can be prefixed with
"quvi:", and then git-annex knows to use quvi for that url. git-annex addurl
checks if the url is one quvi supports, and if so, records the url as
"quvi:url". This way, git-annex always consistently downloads that url
with quvi (or fails if quvi is not installed).

That could be extended, by having some way to register a downloader.
For example:

git config annex.downloader.torrent.command 'aria2c %url $file'
git config annex.downloader.torrent.regexp '(^magnet:|\.torrent$)'

Then git-annex addurl would check if the url matches the regexp, and
if so, record the url as (in the above example) "torrent:$url".

Then when downloading "torrent:$url", git-annex would look for the
appropriate annex.downloader.torrent.command and run it.

(Torrent files are actually a difficult case, since an individual torrent
can contain multiple files, and git-annex expects to get exactly one
file when it downloads an url. So a torrent downloader would need to be
used with only single-file torrents, or somehow pick out the file that
is wanted from a multi-file torrent. Perhaps specified by appending
something like '?wanted-file=foo' to the torrent's url.)

see shy jo

from datalad.

joeyh avatar joeyh commented on June 1, 2024

Hmm, for magnet: torrents, it would also need a separate command to check if the torrent is still available (to the extent that can be checked at all for torrent swarms where peers come and go). Defaulting to just checking if the url is available seems reasonable in general though.

from datalad.

joeyh avatar joeyh commented on June 1, 2024

Hmm, it kind of seems that torrents are sufficienty complicated that it would be better implemented inside git-annex (running aria2c with the correct options) than as a plug-in downloader.

So, some examples of other sorts of urls or uris that could be handled by this proposed downloader interface would be useful, so I can check if I'm being sufficiently general and know it would really be useful.

from datalad.

joeyh avatar joeyh commented on June 1, 2024

Looking at http://datalad.org/datalad_crawl_design.html ...

The dl:cmd: AUIU, the idea is to download some url (or perhaps
git annex get something), filter its contents in some way with the cmd
(uncompress it for example), and that results in the annexed object.
dl:extract: is similar.

My design above could be used for this. But let's consider implementing it using an external special remote instead. An external special remote can record its own persistent state in the git-annex branch about a git-annex key, so it could record whatever info is needed to download the file, filter or extract it, etc.

But there is a missing piece: While git annex addurl can inject an existing url into the repo, using the web as a remote, there's no way to tell git-annex to inject something that already exists on an external special remote into the repo.

So, one way would be: git annex addremote CERN $uri

That would need an extension of the external special remote protocol, so that git-annex can ask the CERN external special remote to add $uri , incluluding downloading the data from it, and recording the $uri (or whatever it needs to record) in the git-annex branch.

Doable, I suppose. But there is a nice elegance in just using git annex addurl CERN:uri,
and then treating that as a downloader as I posted 2 comments above.

But, I see in https://github.com/datalad/datalad/issues?q=is%3Aopen+is%3Aissue+label%3Anew-dataprovider-platform that some of the data sources have things like custom protocols with authentication, that are much better handled by external special remotes. (git-annex can keep an external special remote program running and reuse it for several downloads of files, so it doesn't need to re-authenticate each time.)

Hmm, another option would be to make the external special remote be used as the downloader. So, when git-annex wanted to get an url from CERN:uri, it would look for an external special remote named "CERN" and use it. Instead of the simple git configuration I sketched out above. Perhaps this is the best approach; it lets external special remotes be used in all their glory, while keeping the addurl interface as-is.

from datalad.

yarikoptic avatar yarikoptic commented on June 1, 2024

N.B. email reply quotation leveling seems to be ruined, so my responses need to be dag out from greyed out text

My design above could be used for this. But let's consider implementing it
using an external special remote instead. An external special remote can
record its own persistent state in the git-annex branch about a git-annex
key, so it could record whatever info is needed to download the file,
filter or extract it, etc.

yeap -- I was reading on this as of speaking -- you mean aaa/bbb/*.log.rmt files right?

I see one little inconvenience of having only a single line for a key. I
expect some files becoming available from multiple (versions) of a
tarball so might require multiple entries. Sure thing could be fed into
a single line but it just might become a bit longish and not easily
"mergeable" happen additions are done independently (unlikely but
possible). Or I misunderstood and it could be extended for multiple
lines/entries?

But there is a missing piece: While git annex addurl can inject an
existing url into the repo, using the web as a remote, there's no way to
tell git-annex to inject something that already exists on an external
special remote into the repo.

So, one way would be: git annex addremote CERN $uri

That would need an extension of the external special remote protocol, so
that git-annex can ask the CERN external special remote to add $uri ,
incluluding downloading the data from it, and recording the $uri (or
whatever it needs to record) in the git-annex branch.

yeap -- I also foresaw need then to extend interface so that external
special remote (ESR) could request some key being 'annex get'ed first.
Interesting case (I haven't encountered yet among our targets)
would be a nested archive. In that case the same ESR might get another
request while it is awaiting response from annex to its own request...
So it might be something to keep in mind while crafting such an ESR

Doable, I suppose. But there is a nice elegance in just using git annex
addurl CERN:uri,
and then treating that as a downloader as I posted 2 comments above.

yeap.

But, I see in
https://github.com/datalad/datalad/issues?q=is%3Aopen+is%3Aissue+label%3Anew-dataprovider-platform
that some of the data sources have things like custom protocols with
authentication, that are much better handled by external special remotes.
(git-annex can keep an external special remote program running and reuse
it for several downloads of files, so it doesn't need to re-authenticate
each time.)

concur too

Hmm, another option would be to make the external special remote be used
as the downloader. So, when git-annex wanted to get an url from CERN:uri,
it would look for an external special remote named "CERN" and use it.
Instead of the simple git configuration I sketched out above. Perhaps this
is the best approach; it lets external special remotes be used in all
their glory, while keeping the addurl interface as-is.

yeap -- such a possibility stroke my mind too yesterday... to
summarize: we can potentially build any custom downloader necessary for
our needs interfacing it through ESR or a collection of those, and whose
meta-data would be stored in git-annex branch. The only catches
might be necessary extension of the ESR API as discussed above.
I guess I will give a try first with a simple "tarballs" ESR

from datalad.

joeyh avatar joeyh commented on June 1, 2024

http://git-annex.branchable.com/todo/extensible_addurl/

from datalad.

yarikoptic avatar yarikoptic commented on June 1, 2024

On Wed, 03 Dec 2014, Joey Hess wrote:

That could be extended, by having some way to register a downloader.
For example:

git config annex.downloader.torrent.command 'aria2c %url $file'
git config annex.downloader.torrent.regexp '(^magnet:|.torrent$)'

it would also need at least a custom commands for 'checking'
either link is still alive, right? (I remember trying using custom
downloader command with some obscure url but didn't work since wget was
still used for testing the link)

related question: is there a way to grow git/annex configuration by
simply dropping additional files under some directory? i.e. instead of
running commands above, just to have

[annex "downloader.torrent"]
command = aria2c %url $file
...

file dropped e.g. under /etc/gitconfig.d ?

Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Research Scientist, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
WWW: http://www.linkedin.com/in/yarik

from datalad.

joeyh avatar joeyh commented on June 1, 2024

I see one little inconvenience of having only a single line for a key.

That is indeed a problem with the external special remote's SETSTATE/GETSTATE. And it's not very easily solved, since we don't really know what kind of data an external special remote might store.

I think that with my revised design, it avoids being a problem, because git-annex addurl would record the downloader:uri in git-annex's url storage. The external special remote could then just be passed the uri to download, and would not need to do any state storage of its own.

from datalad.

joeyh avatar joeyh commented on June 1, 2024

I don't think that git-config supports .d directories, but it's certianly possible to set global git config settings in /etc/gitconfig or ~/.gitconfig

from datalad.

yarikoptic avatar yarikoptic commented on June 1, 2024

You know how we would love in Debian a package (e.g. datalad) to modify /etc content of another package (e.g. git ) happen we decide to extend the configuration "automagically" ;-)

from datalad.

joeyh avatar joeyh commented on June 1, 2024

Today I implemented the necessary support in git-annex.

So, to summarize what an external special remote needs to do, let's consider an external special remote that handles torrents.

  1. When it receives "CLAIMURL magnet:*" or "CLAIMURL *.torrent", it should respond with CLAIMURL-SUCCESS. (And respond with CLAIMURL-FAILURE for any urls it cannot handle.)
  2. When it receives "CHECKURL $url", it should check actively that the url still works. If not, it should respond "CHECKURL-FAILURE message". If it's possible to do so inexpensively, it should get the size in bytes of the content, and respond "CHECKURL-SIZE $size". If getting the size is not practical, it should respond "CHECKURL-SIZEUNKNOWN"
  3. When it receives "TRANSFER RETRIEVE $key $file", it needs to download the content of the torrent file. To do so, it needs to find out the torrent url. So, it can send "GETURLS $key". git-annex will respond to that with "VALUE $url" (repeated once per url), followed by "VALUE " to indicate the end of the list. Since this is a torrent url handler, it will want to look for urls that are magnet:* or *.torrent. Then it will perform the torrent download, sending progress messages if possible, and once complete, send "TRANSFER-SUCCESS RETRIEVE $key"

Implementation of special remotes for eg, CERN would be similar. Except in that case,
you might want to have the user run "git annex addurl CERN:$url", and them CLAIMURL can just look for the "CERN:" prefix and know it's supposed to handle this url. Also, when it needs to look up the url for a key, it can use "GETURLS $key CERN:" to get back only the CERN: prefixed urls.

from datalad.

yarikoptic avatar yarikoptic commented on June 1, 2024

Sounds great! Thank you Joey - I will give it a shot!

from datalad.

joeyh avatar joeyh commented on June 1, 2024

This example is probably a good starting place for writing other custom
downloaders.

http://git-annex.branchable.com/special_remotes/external/git-annex-remote-torrent

see shy jo

from datalad.

yarikoptic avatar yarikoptic commented on June 1, 2024

On Thu, 11 Dec 2014, Joey Hess wrote:

This example is probably a good starting place for writing other custom
downloaders.
http://git-annex.branchable.com/special_remotes/external/git-annex-remote-torrent

Thank you Joey!

from datalad.

joeyh avatar joeyh commented on June 1, 2024

I think we can close this one, git-annex has had the changes for some time, and afaik datalad is using them.

from datalad.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.