Comments (14)
I'm able to reproduce and confirm that this is a regression introduced in #9246. See the reproduction script below. It creates a source repo with a shared cache. The initial dvc add
takes ~10 minutes on my machine. Then it creates a new repo is to import the source data, and the import takes ~15 minutes, even if using a shared cache that already contains all the data. Before that regression, it takes seconds.
set -eux
echo "setup source repo"
CACHE=$(mktemp -d)
REPO_SOURCE=$(mktemp -d)
cd $REPO_SOURCE
git init
dvc init -q
dvc cache dir $CACHE
dvc config cache.shared group
dvc config cache.type symlink
echo "generate data"
mkdir dir
for i in {1..100}; do
head -c 1000000000 < /dev/urandom > dir/${i};
done
echo "dvc add data"
time dvc add dir
git add .
git commit -m "add data"
echo "dvc import data with shared cache"
REPO_IMPORT=$(mktemp -d)
cd $REPO_IMPORT
git init
dvc init -q
dvc cache dir $CACHE
dvc config cache.shared group
dvc config cache.type symlink
time dvc import $REPO_SOURCE dir
from dvc.
Yes as you noted, it's not actually downloading from a remote, it's being copied from your existing cache. Ideally we should be preserving the link type in this case and we can look into fixing that (but not using the symlink here is not a regression)
cc @efiop
from dvc.
I guess it's similar to already resolved regression in the past - #9385 ?
from dvc.
Another note / info.
When I am importing using dvc==3.x
data which were created with dvc==2.x
it works OK as expected.
If I try to do the same with data created with dvc==3.x
it downloads data from remote everytime (even if the exact files are in the cache).
from dvc.
Just bit me as well. Thought the reason was my mix of dvc2/3 caches so recreated all in 3.0. Glad you found easy repro steps.
from dvc.
Hello,
Is there any update on this issue please?
Were you able to reproduce it, or do you need more input from my part?
from dvc.
I'm unable to reproduce this in the latest DVC release (3.45.0). Can you try updating and verify whether or not you still see the issue?
from dvc.
This is so odd. I am still seeing a "Downloading" message in my test case with 3.45.0 and I am running out of diskspace so I am quite sure the data gets downloaded from remote and only later a symlink generated. But this happens only with my case and not with the repo mentioned (but it looks to me the exact same situation). Not sure if I got more time to look into this .. suffering from that issue for many months : https://discuss.dvc.org/t/help-with-upgrading-imported-via-dvc2-x-dvc-data-with-dvc3-0/1750/22
from dvc.
Not sure if the following just confuses this issue but I noticed that when I do the "1st import" (similar to the repro steps) then I get asked the password for remote and then the download starts. On the 2nd "import" (similar to the repro steps) I do NOT get asked the password, "downloading" is displayed BUT its a lot faster (which makes me think it copies/downloads it from the external cache). BUT (and this is the main issue) it still copies the files before creating a symlink (which is bad because I run out hdd space in my realworld use case although the data could be just symlinked from a shared cache).
from dvc.
Thanx for your quick reply. Was that behaviour already in place with dvc 2.0 ? The good news is that once the import happens the "dvc checkout" behaves way better as it does not download from the cache before creating the symlink.
So the problem I am facing is
a) I have a projectA that stores ~4TB of data in a shared external cache (on a large external drive).
b) many different project live on a smaller drive (<4TB) and would like to import the data from projectA
This all should just "work" because the data is already in the cache. But because "import" "download"s it from the cache before creating the symlink I will run out of diskspace and the "import" fails.
Ideally we should be preserving the link type in this case and we can look into fixing that (but not using the symlink here is not a regression)
That would be really great. I can't see the workflow to import large sets of data from a shared external cache otherwise.
How likely is this to happen ? I am still confused why I didn't run into the issue with dvc 2.x .
Kindest regards
from dvc.
Sorry for my late response, i've been quite busy lately.
From my experients it's other way around now.
With dvc 3.x
version I am able to skip the downloading of the file created by dvc
3.x
, if it already exists in the cache - which is great.
But if I want to import files created by dvc
2.x
the cache is ignored and it always downloads from the remote.
So I guess this state is actually better than the previously reported, since we can migrate the data from 2.x
to 3.x
.
But I guess it would be worth fixing this issue too?
from dvc.
Here's the output of dvc doctor
:
DVC version: 3.48.0 (pip)
-------------------------
Platform: Python 3.11.7 on Linux-5.15.0-97-generic-x86_64-with-glibc2.31
Subprojects:
dvc_data = 3.13.0
dvc_objects = 5.1.0
dvc_render = 1.0.1
dvc_task = 0.3.0
scmrepo = 3.2.0
Supports:
http (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
https (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
s3 (s3fs = 2024.2.0, boto3 = 1.34.51)
Config:
Global: /root/.config/dvc
System: /etc/xdg/dvc
Cache types: symlink
Cache directory: xfs on /dev/mapper/data-srv
Caches: local
Remotes: None
Workspace directory: overlay on overlay
Repo: dvc (no_scm)
Repo.site_cache_dir: /var/tmp/dvc/repo/de89edf83a919aae8b7ee93ba17c75e0
from dvc.
I don't think local cache is being ignored here. @dberenbaum, in the above script, DVCFileSystem is copying from workspace files. If you add rm -rf dir
after git commit
, it'll use the cache.
Using --rev
will force dvc to import from a certain git revisions. And this behaviour does not happen in case of remote repositories.
from dvc.
If you add
rm -rf dir
aftergit commit
, it'll use the cache.
Yes, it will copy from the cache. However, this doesn't solve the underlying problem that this copy operation takes way longer than checkout. You can adjust the script above to just a few files (as long as each is large) and still see the difference between now and either before #9246 or with #10388.
from dvc.
Related Issues (20)
- dvc exp run: with import-db fails with `'NoneType' object has no attribute 'isabs'` HOT 1
- Warning/error when trying to push/pull outs with cache: false
- fix ssh fsspec: make put atomic HOT 6
- "Assume yes" flag for `dvc commit` HOT 1
- dvc==3.53.0 import fails with No such file or directory when cache.dir configured and cache.type symlink HOT 6
- dvc pull crashing on a FSx Lustre file system HOT 2
- `dvc repro -R <dir_1>` can run each `dir_1/**/dvc.yaml` from CWD
- Python CLI: `DeprecationWarning` on `dvc.repo.Repo` import HOT 4
- dvc update should consider "cache: false" setting of output in imported `.dvc` HOT 4
- Ability to track Docker images in Docker Hub or AWS ECR as artifacts HOT 5
- Keep temporary clones of import source repos HOT 4
- Dvc pull Crashes on Windows HOT 1
- `dvc diff` slow when there are many unique additions and deletions
- Unable connect dvc to Google Drive. Access blocked! HOT 8
- `dvc status`: add flag to ignore files excepted from cache. HOT 2
- Add `--allow-missing` for `dvc commit` HOT 13
- dvc pull/fetch: corrupted cache with GDrive HOT 6
- dvc exp run: replacing output folder instead of writing HOT 3
- dvc stage: params section with variable HOT 3
- dvc.yaml - cmd bash variables not working inside curly brackets HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dvc.