Git Product home page Git Product logo

Comments (32)

wangxingjun778 avatar wangxingjun778 commented on July 1, 2024 1

It looks like it will take 3 months to download the data. :/ Download is really slow, 114GB in 4 days with university internet from Europe. Is there something config-wise I can adjust on my end?

There is way may help, we will add the env variable ENABLE_DATASET_ACCELERATE in the next version (will be released soon), or you can make it by yourself:

step1: pull latest source code from branch release/1.1

step2: see modelscope.msdatasets.utils.oss_utils.OssUtilities._do_init(), modify this line: self.endpoint = 'https://oss-accelerate.aliyuncs.com'

step3: uninstall modelscope

step4: install modelscope from source code

Please try it ~~

from modelscope.

lhatsk avatar lhatsk commented on July 1, 2024 1

Thanks!

Maybe a mix would be good. Zip the data in chunks and unzip them asynchronously once they are downloaded.

from modelscope.

wangxingjun778 avatar wangxingjun778 commented on July 1, 2024 1

For some reason it's getting much slower again. Managed to download only 2GB in the last 5 days. :/ Close to 800k files downloaded now.

Hm we have tested the download speed on the Alibaba cloud ECS in London, it's about 2.24MB/s by default configuration; and when the dataset acceleration is enabled, it's about 5.02MB/s.

However, it seems, there is a little difference between the network of ECS and university internet from Europe ...

from modelscope.

wangxingjun778 avatar wangxingjun778 commented on July 1, 2024

Already fixed, please refer to the latest version : https://github.com/modelscope/modelscope/tree/release/1.1
You can use source code to install it, and the image will be released in the next few days. :)

from modelscope.

WeianMao avatar WeianMao commented on July 1, 2024

Already fixed, please refer to the latest version : https://github.com/modelscope/modelscope/tree/release/1.1 You can use source code to install it, and the image will be released in the next few days. :)

fantastic! i will try it now. thank you very much

from modelscope.

WeianMao avatar WeianMao commented on July 1, 2024

@wangxingjun778 sry to bother you again. I try the latest verison 1.1:

import modelscope
modelscope.version
'1.1.0'

however, the error is still the same:
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='www.modelscope.cn', port=80): Max retries exceeded with url: /api/v1/datasets/DPTech/Uni-Fold-Data/oss/tree/?MaxLimit=-1&Revision=master&Recursive=True&FilterDir=True (Caused by ReadTimeoutError("HTTPConnectionPool(host='www.modelscope.cn', port=80): Read timed out. (read timeout=60)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "scripts/download/download_training_data.py", line 7, in
ds = MsDataset.load(dataset_name='Uni-Fold-Data', namespace='DPTech', split='train')
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 192, in load
return MsDataset._load_ms_dataset(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 254, in _load_ms_dataset
dataset = MsDataset._load_from_ms(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 301, in _load_from_ms meta_map, file_map, args_map = get_dataset_files(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/dataset_utils.py", line 173, in get_dataset_files objects = list_dataset_objects(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/dataset_utils.py", line 98, in list_dataset_objects
objects = hub_api.list_oss_dataset_objects(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/hub/api.py", line 613, in list_oss_dataset_objects
resp = self.session.get(url=url, cookies=cookies)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/sessions.py", line 600, in get
return self.request("GET", url, **kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/adapters.py", line 565, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.modelscope.cn', port=80): Max retries exceeded with url: /api/v1/datasets/DPTech/Uni-Fold-Data/oss/tree/?MaxLimit=-1&Revision=master&Recursive=True&FilterDir=True (Caused by ReadTimeoutError("HTTPConnectionPool(host='www.modelscope.cn', port=80): Read timed out. (read timeout=60)"))

from modelscope.

WeianMao avatar WeianMao commented on July 1, 2024

if it is possible, could you please tell me where is the bug. i can fix it by my self. thank you very much

from modelscope.

WeianMao avatar WeianMao commented on July 1, 2024

i found the bug, please refer to the link: c1d2972

when i change the line from 'resp = self.session.get(url=url, cookies=cookies)' to 'resp = self.session.get(url=url, cookies=cookies, timeout=1800)', it fixed

from modelscope.

wangxingjun778 avatar wangxingjun778 commented on July 1, 2024

Yes, when you try to download the whole dataset, the MsDataset.load will get all of objects from modelscope dataset-hub, and it may take a little bit long ... (usually few mins, depends on your env). So we modified the timeout to keep some buffer.

from modelscope.

WeianMao avatar WeianMao commented on July 1, 2024

sry, although I successfully download the data for one night, there is another error:
100% Traceback (most recent call last): | 0/1
File "scripts/download/download_training_data.py", line 7, in
ds = MsDataset.load(dataset_name='Uni-Fold-Data', namespace='DPTech', split='train')
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 192, in load
return MsDataset._load_ms_dataset(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 254, in _load_ms_dataset
dataset = MsDataset._load_from_ms(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 330, in _load_from_ms
builder.download_and_prepare(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/builder.py", line 807, in download_and_prepare
self._download_and_prepare(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/builder.py", line 876, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/dataset_builder.py", line 98, in _split_generator
s
zip_data_files = dl_manager.download_and_extract(self.zip_data_files)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/download/download_manager.py", line 433, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/download/download_manager.py", line 310, in download
downloaded_path_or_paths = map_nested(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 428, in map_nested
mapped = [
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 429, in
_single_map_nested((function, obj, types, None, True, None))
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 348, in _single_map_nested
mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 348, in
mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 330, in _single_map_nested
return function(data_struct)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/download_utils.py", line 41, in _download
return self.oss_utilities.download(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/oss_utils.py", line 76, in download
oss2.resumable_download(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/oss2/resumable.py", line 155, in resumable_download
result = bucket.head_object(key, params=params, headers=valid_headers)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/oss2/api.py", line 970, in head_object
resp = self.__do_object('HEAD', key, headers=headers, params=params)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/oss2/api.py", line 2652, in __do_object
return self._do(method, self.bucket_name, key, **kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/oss2/api.py", line 239, in _do
raise e
oss2.exceptions.ServerError: {'status': 403, 'x-oss-request-id': '639B4CE653726E343124C54A', 'details': {}}

from modelscope.

WeianMao avatar WeianMao commented on July 1, 2024

when i download the dataset again there is another error:
Traceback (most recent call last):
File "scripts/download/download_training_data.py", line 7, in
ds = MsDataset.load(dataset_name='Uni-Fold-Data', namespace='DPTech', split='train')
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 192, in load
return MsDataset._load_ms_dataset(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 254, in _load_ms_dataset
dataset = MsDataset._load_from_ms(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 301, in _load_from_ms
meta_map, file_map, args_map = get_dataset_files(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/dataset_utils.py", line 173, in get_dataset_files
objects = list_dataset_objects(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/dataset_utils.py", line 98, in list_dataset_objec
ts
objects = hub_api.list_oss_dataset_objects(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/hub/api.py", line 614, in list_oss_dataset_objects
resp = self.session.get(url=url, cookies=cookies, timeout=1800)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/sessions.py", line 600, in get
return self.request("GET", url, **kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/adapters.py", line 565, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.modelscope.cn', port=80): Max retries exceeded with url: /api/v1/datasets/DPTech/Uni-Fold-Data/oss/tree/?M
axLimit=-1&Revision=master&Recursive=True&FilterDir=True (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')))

from modelscope.

wangxingjun778 avatar wangxingjun778 commented on July 1, 2024

Seems like unstable connection and auth token expired, we are trying to reproducing this status and solve this issue ASAP.

from modelscope.

wangxingjun778 avatar wangxingjun778 commented on July 1, 2024

Already updated on release/1.1, you can use source code to install it:
https://github.com/modelscope/modelscope/tree/release/1.1

Notes:

  1. The use cases have been updated: https://modelscope.cn/datasets/DPTech/Uni-Fold-Data/summary
  2. If you have tried to download this dataset but interrupted, please use following command to remove the old meta file:
    rm ~/.cache/modelscope/hub/datasets/downloads/DPTech/Uni-Fold-Data/master/Uni-Fold-Data.json

Pls refer to the commit for changes: 2042c6f

from modelscope.

lhatsk avatar lhatsk commented on July 1, 2024

Thanks! I was now able to download parts of the data (8GB). After 10 hours I ran into the following problem:


    353     cookies = ModelScopeConfig.get_cookies()
    354     if cookies is None:
--> 355         raise ValueError('Token does not exist, please login first.')
    356 return cookies

ValueError: Token does not exist, please login first. 

from modelscope.

wangxingjun778 avatar wangxingjun778 commented on July 1, 2024

The temporary token may be expired, please add following code in your script and continue to download:

from modelscope.hub.api import HubApi
api = HubApi()
api.login('your-sdk-tokn')
dataset = MsDataset.load(xxx)

and you can get ‘your-sdk-token’ on the modelscope.cn: 注册后->个人中心->访问令牌->新建SDK令牌->copy & paste

from modelscope.

lhatsk avatar lhatsk commented on July 1, 2024

Hm, that's unfortunate. Google translate doesn't properly work on this website and I can't complete the checksum thingy. Would be great if you could provide a new token. Given that download speed and the size of the data, wouldn't this take weeks to download?

from modelscope.

wangxingjun778 avatar wangxingjun778 commented on July 1, 2024

Hm, that's unfortunate. Google translate doesn't properly work on this website and I can't complete the checksum thingy. Would be great if you could provide a new token. Given that download speed and the size of the data, wouldn't this take weeks to download?

Emm sry about that we don't have english version website yet ... and I can't provide my personal token, but you can try to sign up for free on the modelscope.cn and follow the steps below on the screenshot like this :
image

Plus, the size of the data is about 2TB, it may take a few days to download it, depends on your connection speed. :)

from modelscope.

wangxingjun778 avatar wangxingjun778 commented on July 1, 2024

Hm, that's unfortunate. Google translate doesn't properly work on this website and I can't complete the checksum thingy. Would be great if you could provide a new token. Given that download speed and the size of the data, wouldn't this take weeks to download?

Besides, please pull the latest code from branch release1.1 to reinstall the sdk, a error log issue has been fixed.

from modelscope.

lhatsk avatar lhatsk commented on July 1, 2024

Thank you for your help! I managed to create the account and get the token.

Connection should be fairly good. My guess is 100MB/s.

from modelscope.

lhatsk avatar lhatsk commented on July 1, 2024

It looks like it will take 3 months to download the data. :/ Download is really slow, 114GB in 4 days with university internet from Europe. Is there something config-wise I can adjust on my end?

from modelscope.

lhatsk avatar lhatsk commented on July 1, 2024

Thank you! I made the adjustments and will report back. Since I interrupted the download, will modelscope just pick up where I left off?

My feeling was that the slow download speed is in part also because it's downloading many very small files.

from modelscope.

wangxingjun778 avatar wangxingjun778 commented on July 1, 2024

Thank you! I made the adjustments and will report back. Since I interrupted the download, will modelscope just pick up where I left off?

My feeling was that the slow download speed is in part also because it's downloading many very small files.

Yes, modelscope will continue download where you left off, but it will verify the object if exists or not, may take a little long ...
you can modify following file to improve the performance: https://github.com/modelscope/modelscope/blob/release/1.1/modelscope/msdatasets/utils/oss_utils.py

  1. cancel object_exists checking: modify line-80: file_oss_key = candidate_key
  2. modify line-98: if e.dict.get('status') == 403: ...
    (will be released in the next version)

Indeed, since uni-fold dataset contains 1600k+ small files, could be part of the reason for slow download speed. We are also considering to zip all files, but there is another problem with unzipping such a large file, could take a long time...

from modelscope.

lhatsk avatar lhatsk commented on July 1, 2024

For some reason it's getting much slower again. Managed to download only 2GB in the last 5 days. :/ Close to 800k files downloaded now.

from modelscope.

lhatsk avatar lhatsk commented on July 1, 2024

Thanks for checking! At least within Germany I get up to 75MB/s, probably the small files then. However, I think they are using compression on the filesystem level, which would make the download size a bad indicator then. I will keep monitoring the number of downloaded files.

from modelscope.

lhatsk avatar lhatsk commented on July 1, 2024

Unfortunately, for the last couple of days, the download crashes 2-3x a day with the following error:

NotFound: {'status': 404, 'x-oss-request-id': '63CD92F2174899D20E3EE020', 'details': {}}

from modelscope.

lhatsk avatar lhatsk commented on July 1, 2024

I am now at 1632397 files but can no longer restart the download, the rest seems to be missing/ broken.

NotFound: {'status': 404, 'x-oss-request-id': '63D244E1E1EC5099825CFCFF', 'details': {}}

from modelscope.

wangxingjun778 avatar wangxingjun778 commented on July 1, 2024

Looks like the whole dataset has been downloaded. But there is an issue may cause 404 error. Although this issue has been fixed on the release/1.2 version, but we still recommend to keep use release/1.1 .
Please modify following source code (based on release/1.1 version):

  1. modify this line to ignore unexpected separator in meta config files of uni-fold dataset: refer to :
    def fetch_single_csv_script(self, script_url: str):
    : text_list = resp.text.split('\n') change to : text_list = resp.text.strip().split('\n')
  2. improve the efficiency for listing objects: refer to : https://github.com/modelscope/modelscope/blob/70114a13e2b3063c0e2ce6a946d29e94233b16c8/modelscope/msdatasets/utils/oss_utils.py : file_oss_key = candidate_key if self.bucket.object_exists(candidate_key) else candidate_key_backup change to : file_oss_key = candidate_key

Next step, please uninstall modelscope sdk and reinstall it by using source code.
Please try :)

from modelscope.

lhatsk avatar lhatsk commented on July 1, 2024

Thanks! So with this change, the download finishes, but after a day the uncompressing fails:

EOFError: Compressed file ended before the end-of-stream marker was reached

from modelscope.

wangxingjun778 avatar wangxingjun778 commented on July 1, 2024

hmm never met this kind of issue, maybe you can check this out on the stackoverflow, like this: https://stackoverflow.com/questions/40877781/eoferror-compressed-file-ended-before-the-end-of-stream-marker-was-reached-mn

If still can't solve this problem, just let us know, we will try to reproduce this issue.

Note: DO NOT USE download_mode=DownloadMode.FORCE_REDOWNLOAD in the MsDataset.load() function, otherwise the local cache of uni-fold dataset will be removed !

from modelscope.

lhatsk avatar lhatsk commented on July 1, 2024

There the solution just seems to be to re-download the data which would take another month. Aren't the corrupt files from the 404's we skipped earlier? I am going to try to find and delete them, so maybe only those can be downloaded again, but even getting to this stage takes 1-2 days with the efficiency fix for listing objects.

from modelscope.

github-actions avatar github-actions commented on July 1, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

from modelscope.

github-actions avatar github-actions commented on July 1, 2024

This issue was closed because it has been stalled for 5 days with no activity.

from modelscope.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.