Comments (32)
It looks like it will take 3 months to download the data. :/ Download is really slow, 114GB in 4 days with university internet from Europe. Is there something config-wise I can adjust on my end?
There is way may help, we will add the env variable ENABLE_DATASET_ACCELERATE
in the next version (will be released soon), or you can make it by yourself:
step1: pull latest source code from branch release/1.1
step2: see modelscope.msdatasets.utils.oss_utils.OssUtilities._do_init(), modify this line: self.endpoint = 'https://oss-accelerate.aliyuncs.com'
step3: uninstall modelscope
step4: install modelscope from source code
Please try it ~~
from modelscope.
Thanks!
Maybe a mix would be good. Zip the data in chunks and unzip them asynchronously once they are downloaded.
from modelscope.
For some reason it's getting much slower again. Managed to download only 2GB in the last 5 days. :/ Close to 800k files downloaded now.
Hm we have tested the download speed on the Alibaba cloud ECS in London, it's about 2.24MB/s by default configuration; and when the dataset acceleration is enabled, it's about 5.02MB/s.
However, it seems, there is a little difference between the network of ECS and university internet from Europe ...
from modelscope.
Already fixed, please refer to the latest version : https://github.com/modelscope/modelscope/tree/release/1.1
You can use source code to install it, and the image will be released in the next few days. :)
from modelscope.
Already fixed, please refer to the latest version : https://github.com/modelscope/modelscope/tree/release/1.1 You can use source code to install it, and the image will be released in the next few days. :)
fantastic! i will try it now. thank you very much
from modelscope.
@wangxingjun778 sry to bother you again. I try the latest verison 1.1:
import modelscope
modelscope.version
'1.1.0'
however, the error is still the same:
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='www.modelscope.cn', port=80): Max retries exceeded with url: /api/v1/datasets/DPTech/Uni-Fold-Data/oss/tree/?MaxLimit=-1&Revision=master&Recursive=True&FilterDir=True (Caused by ReadTimeoutError("HTTPConnectionPool(host='www.modelscope.cn', port=80): Read timed out. (read timeout=60)"))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "scripts/download/download_training_data.py", line 7, in
ds = MsDataset.load(dataset_name='Uni-Fold-Data', namespace='DPTech', split='train')
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 192, in load
return MsDataset._load_ms_dataset(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 254, in _load_ms_dataset
dataset = MsDataset._load_from_ms(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 301, in _load_from_ms meta_map, file_map, args_map = get_dataset_files(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/dataset_utils.py", line 173, in get_dataset_files objects = list_dataset_objects(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/dataset_utils.py", line 98, in list_dataset_objects
objects = hub_api.list_oss_dataset_objects(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/hub/api.py", line 613, in list_oss_dataset_objects
resp = self.session.get(url=url, cookies=cookies)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/sessions.py", line 600, in get
return self.request("GET", url, **kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/adapters.py", line 565, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.modelscope.cn', port=80): Max retries exceeded with url: /api/v1/datasets/DPTech/Uni-Fold-Data/oss/tree/?MaxLimit=-1&Revision=master&Recursive=True&FilterDir=True (Caused by ReadTimeoutError("HTTPConnectionPool(host='www.modelscope.cn', port=80): Read timed out. (read timeout=60)"))
from modelscope.
if it is possible, could you please tell me where is the bug. i can fix it by my self. thank you very much
from modelscope.
i found the bug, please refer to the link: c1d2972
when i change the line from 'resp = self.session.get(url=url, cookies=cookies)' to 'resp = self.session.get(url=url, cookies=cookies, timeout=1800)', it fixed
from modelscope.
Yes, when you try to download the whole dataset, the MsDataset.load will get all of objects from modelscope dataset-hub, and it may take a little bit long ... (usually few mins, depends on your env). So we modified the timeout to keep some buffer.
from modelscope.
sry, although I successfully download the data for one night, there is another error:
100% Traceback (most recent call last): | 0/1
File "scripts/download/download_training_data.py", line 7, in
ds = MsDataset.load(dataset_name='Uni-Fold-Data', namespace='DPTech', split='train')
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 192, in load
return MsDataset._load_ms_dataset(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 254, in _load_ms_dataset
dataset = MsDataset._load_from_ms(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 330, in _load_from_ms
builder.download_and_prepare(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/builder.py", line 807, in download_and_prepare
self._download_and_prepare(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/builder.py", line 876, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/dataset_builder.py", line 98, in _split_generator
s
zip_data_files = dl_manager.download_and_extract(self.zip_data_files)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/download/download_manager.py", line 433, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/download/download_manager.py", line 310, in download
downloaded_path_or_paths = map_nested(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 428, in map_nested
mapped = [
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 429, in
_single_map_nested((function, obj, types, None, True, None))
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 348, in _single_map_nested
mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 348, in
mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 330, in _single_map_nested
return function(data_struct)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/download_utils.py", line 41, in _download
return self.oss_utilities.download(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/oss_utils.py", line 76, in download
oss2.resumable_download(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/oss2/resumable.py", line 155, in resumable_download
result = bucket.head_object(key, params=params, headers=valid_headers)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/oss2/api.py", line 970, in head_object
resp = self.__do_object('HEAD', key, headers=headers, params=params)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/oss2/api.py", line 2652, in __do_object
return self._do(method, self.bucket_name, key, **kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/oss2/api.py", line 239, in _do
raise e
oss2.exceptions.ServerError: {'status': 403, 'x-oss-request-id': '639B4CE653726E343124C54A', 'details': {}}
from modelscope.
when i download the dataset again there is another error:
Traceback (most recent call last):
File "scripts/download/download_training_data.py", line 7, in
ds = MsDataset.load(dataset_name='Uni-Fold-Data', namespace='DPTech', split='train')
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 192, in load
return MsDataset._load_ms_dataset(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 254, in _load_ms_dataset
dataset = MsDataset._load_from_ms(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/ms_dataset.py", line 301, in _load_from_ms
meta_map, file_map, args_map = get_dataset_files(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/dataset_utils.py", line 173, in get_dataset_files
objects = list_dataset_objects(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/msdatasets/utils/dataset_utils.py", line 98, in list_dataset_objec
ts
objects = hub_api.list_oss_dataset_objects(
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/modelscope-1.1.0-py3.8.egg/modelscope/hub/api.py", line 614, in list_oss_dataset_objects
resp = self.session.get(url=url, cookies=cookies, timeout=1800)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/sessions.py", line 600, in get
return self.request("GET", url, **kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/home/wayne/miniconda3/envs/uni/lib/python3.8/site-packages/requests/adapters.py", line 565, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.modelscope.cn', port=80): Max retries exceeded with url: /api/v1/datasets/DPTech/Uni-Fold-Data/oss/tree/?M
axLimit=-1&Revision=master&Recursive=True&FilterDir=True (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')))
from modelscope.
Seems like unstable connection and auth token expired, we are trying to reproducing this status and solve this issue ASAP.
from modelscope.
Already updated on release/1.1, you can use source code to install it:
https://github.com/modelscope/modelscope/tree/release/1.1
Notes:
- The use cases have been updated: https://modelscope.cn/datasets/DPTech/Uni-Fold-Data/summary
- If you have tried to download this dataset but interrupted, please use following command to remove the old meta file:
rm ~/.cache/modelscope/hub/datasets/downloads/DPTech/Uni-Fold-Data/master/Uni-Fold-Data.json
Pls refer to the commit for changes: 2042c6f
from modelscope.
Thanks! I was now able to download parts of the data (8GB). After 10 hours I ran into the following problem:
353 cookies = ModelScopeConfig.get_cookies()
354 if cookies is None:
--> 355 raise ValueError('Token does not exist, please login first.')
356 return cookies
ValueError: Token does not exist, please login first.
from modelscope.
The temporary token may be expired, please add following code in your script and continue to download:
from modelscope.hub.api import HubApi
api = HubApi()
api.login('your-sdk-tokn')
dataset = MsDataset.load(xxx)
and you can get ‘your-sdk-token’ on the modelscope.cn: 注册后->个人中心->访问令牌->新建SDK令牌->copy & paste
from modelscope.
Hm, that's unfortunate. Google translate doesn't properly work on this website and I can't complete the checksum thingy. Would be great if you could provide a new token. Given that download speed and the size of the data, wouldn't this take weeks to download?
from modelscope.
Hm, that's unfortunate. Google translate doesn't properly work on this website and I can't complete the checksum thingy. Would be great if you could provide a new token. Given that download speed and the size of the data, wouldn't this take weeks to download?
Emm sry about that we don't have english version website yet ... and I can't provide my personal token, but you can try to sign up for free on the modelscope.cn and follow the steps below on the screenshot like this :
Plus, the size of the data is about 2TB, it may take a few days to download it, depends on your connection speed. :)
from modelscope.
Hm, that's unfortunate. Google translate doesn't properly work on this website and I can't complete the checksum thingy. Would be great if you could provide a new token. Given that download speed and the size of the data, wouldn't this take weeks to download?
Besides, please pull the latest code from branch release1.1 to reinstall the sdk, a error log issue has been fixed.
from modelscope.
Thank you for your help! I managed to create the account and get the token.
Connection should be fairly good. My guess is 100MB/s.
from modelscope.
It looks like it will take 3 months to download the data. :/ Download is really slow, 114GB in 4 days with university internet from Europe. Is there something config-wise I can adjust on my end?
from modelscope.
Thank you! I made the adjustments and will report back. Since I interrupted the download, will modelscope just pick up where I left off?
My feeling was that the slow download speed is in part also because it's downloading many very small files.
from modelscope.
Thank you! I made the adjustments and will report back. Since I interrupted the download, will modelscope just pick up where I left off?
My feeling was that the slow download speed is in part also because it's downloading many very small files.
Yes, modelscope will continue download where you left off, but it will verify the object if exists or not, may take a little long ...
you can modify following file to improve the performance: https://github.com/modelscope/modelscope/blob/release/1.1/modelscope/msdatasets/utils/oss_utils.py
- cancel
object_exists
checking: modify line-80: file_oss_key = candidate_key - modify line-98: if e.dict.get('status') == 403: ...
(will be released in the next version)
Indeed, since uni-fold dataset contains 1600k+ small files, could be part of the reason for slow download speed. We are also considering to zip all files, but there is another problem with unzipping such a large file, could take a long time...
from modelscope.
For some reason it's getting much slower again. Managed to download only 2GB in the last 5 days. :/ Close to 800k files downloaded now.
from modelscope.
Thanks for checking! At least within Germany I get up to 75MB/s, probably the small files then. However, I think they are using compression on the filesystem level, which would make the download size a bad indicator then. I will keep monitoring the number of downloaded files.
from modelscope.
Unfortunately, for the last couple of days, the download crashes 2-3x a day with the following error:
NotFound: {'status': 404, 'x-oss-request-id': '63CD92F2174899D20E3EE020', 'details': {}}
from modelscope.
I am now at 1632397 files but can no longer restart the download, the rest seems to be missing/ broken.
NotFound: {'status': 404, 'x-oss-request-id': '63D244E1E1EC5099825CFCFF', 'details': {}}
from modelscope.
Looks like the whole dataset has been downloaded. But there is an issue may cause 404 error. Although this issue has been fixed on the release/1.2 version, but we still recommend to keep use release/1.1 .
Please modify following source code (based on release/1.1 version):
- modify this line to ignore unexpected separator in meta config files of uni-fold dataset: refer to :
modelscope/modelscope/hub/api.py
Line 569 in 70114a1
- improve the efficiency for listing objects: refer to : https://github.com/modelscope/modelscope/blob/70114a13e2b3063c0e2ce6a946d29e94233b16c8/modelscope/msdatasets/utils/oss_utils.py : file_oss_key = candidate_key if self.bucket.object_exists(candidate_key) else candidate_key_backup change to : file_oss_key = candidate_key
Next step, please uninstall modelscope sdk and reinstall it by using source code.
Please try :)
from modelscope.
Thanks! So with this change, the download finishes, but after a day the uncompressing fails:
EOFError: Compressed file ended before the end-of-stream marker was reached
from modelscope.
hmm never met this kind of issue, maybe you can check this out on the stackoverflow, like this: https://stackoverflow.com/questions/40877781/eoferror-compressed-file-ended-before-the-end-of-stream-marker-was-reached-mn
If still can't solve this problem, just let us know, we will try to reproduce this issue.
Note: DO NOT USE download_mode=DownloadMode.FORCE_REDOWNLOAD in the MsDataset.load() function, otherwise the local cache of uni-fold dataset will be removed !
from modelscope.
There the solution just seems to be to re-download the data which would take another month. Aren't the corrupt files from the 404's we skipped earlier? I am going to try to find and delete them, so maybe only those can be downloaded again, but even getting to this stage takes 1-2 days with the efficiency fix for listing objects.
from modelscope.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
from modelscope.
This issue was closed because it has been stalled for 5 days with no activity.
from modelscope.
Related Issues (20)
- docker环境中使用pip安装modelscope在某些服务器会出现cannot import name '_datasets_server' from 'datasets.utils' (/usr/local/lib/python3.8/dist-packages/datasets/utils/__init__.py)错误 HOT 4
- 能否增加数据集cerebras/SlimPajama-627B HOT 5
- 运行chatglm3-6b官方 finetune 命令 报错 kernel 需要update HOT 4
- ImportError: cannot import name 'VerificationMode' from 'datasets' HOT 4
- 缓存目录使用混乱,MODELSCOPE_CACHE环境变量在不同位置的作用不一致 HOT 1
- RexUniNLU模型多线程调用报错 HOT 2
- 在用pyinstaller打包代码报错,但是验证内置模型就没问题KeyError: 'speech_dfsmn_kws_char_farfield is not in the pipelines registry group keyword-spotting. Please make sure the correct version of ModelScope library is used.' HOT 4
- 发大头脑的疯狗进来
- Englishfriend.ly HOT 4
- face_detection和face_2d_keypoints不支持cv2读取图片后的numpy.ndarray格式 HOT 1
- strongly depends on transformers
- Tasks.image_segmentation Error HOT 1
- Run from modelscope.pipelines import pipeline met error on the latest modelscope HOT 1
- UniASR models don't work with the latest version HOT 1
- 上传模型的时候,LFS track 的内容无法上传 HOT 5
- BERT-语义对话预测模型预测问题 HOT 1
- 有没有零样本训练的技术细节,要怎么在自己的数据集训练 HOT 2
- refine error report when model id is wrong using snapshot_download
- cpu memory leak HOT 2
- Outdated Warning Message HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from modelscope.