Comments (5)
I am not sure what change this needs on the git-annex side. You can already run multiple git annex get
commands in parallel.
from datalad.
sorry for not being clear... idea is not to request downloads in parallel independently (e.g. parallel runs of wget requesting different files) but rather e.g. to provide full list of files to be 'get'ed at once which would then serve nicely requests like 'get files X Y Z from archive blah.tar.gz'. This would differ from running 3 independent processes in parallel.
Also I have no clue (yet) on how such request should be specified since those files all might need specification of the destination path/filename, e.g. I do not see anything in tar cmdline to allow extraction of multiple files into arbitrary destination locations.
from datalad.
Ok, sounds more like caching resources so they can be reused for multiple transfers.
So, there's potentially some overlap with the resource management I recently added to git-annex to allow reusing of eg, http connections when downloading multiple chunks of a chunked key.
Expanding that to support reusing connections (or reusing a downloaded tarball in your example) when downloading multiple keys needs a solution to the question: How long should the cached resource be kept around? Certianly only until the end of the git annex get
command, but ideally less time than that; if a lot of files are being transferred we want to be able to examine the set of transfers and reorder ones that can reuse the same resources etc. There's a tension here with wanting git annex get
to still start the first transfer promptly as it does now, and not need to buffer a great many transfers in memory.
from datalad.
well:
- git annex supports parallel downloads now with
-J
switch, so kinda "solved" on annex side (removing git-annex label) - there is an outstanding issue to debug/fix for requesting multiple files from the same archive (#451)
- our
install
commands allows for multiple targets for installation ATM, and the rest of the logic on analysis of what should be the most efficient 'annex get' operations would be is TODO. See https://git-annex.branchable.com/todo/wishlist__58___--dry-run_option_for_all_commands/ and the particular command would begit annex find --not --in here -j [paths]
which would return in json records also the keys in question
from datalad.
I think it was largely solved, not clear what else we should possibly do here, thus closing
from datalad.
Related Issues (20)
- pytest collection fails on recentish neurodebians: Argument(s) {'collection_path'} are declared in the hookimpl but can not be found in the hookspec HOT 3
- datalad siblings enable fails in git-cloned dataset without git-annex branch HOT 1
- parallel get from datalad archive gives error
- Brainstorming: path to DataLad v2? HOT 1
- Install datalad by easybuild HOT 1
- datalad update fails randomly with error: "cannot lock ref 'refs/remotes/origin/master'" and ".... git-annex" HOT 1
- Github tarball checksums changed HOT 2
- Different HPC systems and users HOT 2
- Add ability to limit get (and thus install) --recursive installation of subdatasets
- Edge case: Large datalad saves with tight ulimits on many-core machines can fail
- 1-letter shortcut for `--reobtain-data` in datalad-update HOT 1
- `str(GitTransportRI)` broken, and with it `_get_flexible_source_candidates()`
- Boto dependency HOT 1
- Extension command line argument in conflict with `datalad` level argument HOT 3
- "Convert" .travis.yml into a github workflow
- DataLad extensions are not properly registered on Python 3.12 HOT 1
- FOI: "generic" analog to WTF?
- Datalad get can't find URL despite registering via addurls (and I can see the URL with git annex whereis) HOT 21
- `create_sibling_ria` does not release `IO` handler resources properly
- MacOS tests fail to install Python 3.7 (which is EOL anyway) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datalad.