m3dev / thunderbolt Goto Github PK
View Code? Open in Web Editor NEWgokart file manager
License: MIT License
gokart file manager
License: MIT License
S3 からタスク一覧を取得しようとしており、以下のコードを実装しました。
以下のコードを実行した際に、AttributeError: 'dict' object has no attribute 'split'
が発生します。
from thunderbolt.client.s3_client import S3Client
workspace_directory = 's3://bucket_name/prefix'
client = S3Client(workspace_directory=workspace_directory)
tasks_list = client.get_tasks()
client.get_tasks()
においてエラーが発生します。
以下のコードが client.get_tasks()
の問題箇所です。
ここで、 local_cache.get()
への入力 x
は str 型でファイル名を渡す必要があることがわかります。
しかし、s3_client.get_tasks()
で実際に local_cache.get()
へ渡されている x
は以下のような dict 型になっています。
{'Key': '/resources/log/task_log/${task_name}.pkl', 'LastModified': datetime.datetime()}
問題箇所 を以下のように修正することで任意の動作を確認できました。
cache = self.local_cache.get(x['Key'])
前述の通り、x
は dict であり、x['Key']
がファイル名です。
そこで local_cache.get()
に対して x['Key']
を渡すように修正することで以下の形式の出力を得ました。
[{'task_name': '',
'task_params': {},
'task_log': {'file_path': ['s3://***.pkl']},
'last_modified': datetime.datetime(),
'task_hash': ''},
...
]
もし、そもそもの用途が間違っていたらすみません。
s3 init too long time.
Because,
Recursively get Key from S3.
example is old :(
Push PyPI using TravisCI.
mistake comment
https://github.com/m3dev/thunderbolt/blob/master/thunderbolt/client/gcs_client.py
please use $GCS_CREDENTIAL
ref : https://github.com/m3dev/gokart/blob/master/gokart/gcs_config.py#L17
IMO, It might be fantastic if we can control recent "k" items rather than only recent one, isn't it ?
The redshells dump .zip
.
https://github.com/m3dev/redshells
But, we can't open .zip
.
gokart/file_processor.py in make_file_processor(file_path)
187
188 extension = os.path.splitext(file_path)[1]
--> 189 assert extension in extension2processor, f'{extension} is not supported. The supported extensions are {list(extension2processor.keys())}.'
190 return extension2processor[extension]
AssertionError: .zip is not supported. The supported extensions are ['.txt', '.csv', '.tsv', '.pkl', '.gz', '.json', '.xml', '.npz'].
We have to support unzip -> load
.
can't file load from s3.
using poetry
support deleted file
'task_params': pickle.load(self.gcs_client.download(x.replace('task_log', 'task_params'))),
'task_log': pickle.load(self.gcs_client.download(x)),
Because change GitHub rate structure, we need to go back to TravisCI from GitHub Actions.
_人人人人人人人人人人人_
> Deadline is 11/13. <
 ̄Y^Y^Y^Y^Y^Y^Y^Y^YY^ ̄
use github actions
issue: #26
# normal speed
tb = Thunderbolt()
# speed up
tb = Thunderbolt(task_filters=['Hoge', 'Piyo'])
# ideal
tb = Thunderbolt()
data = tb.get_date('Hoge') # get log and param data, so get_data
df = tb.get_task_df(task_filters=['Hoge', 'Piyo']) # get all log and param data, so get task_df
support gcs
Often, we want to get newest gokart's output.
I don't want check task_id by get_task_df.
A cool method is needed!
We must use absolute path.
Speed up by log params local cache.
def _get_gcs_object_info(self, x: str) -> Dict[str, str]:
bucket, obj = self.gcs_client._path_to_bucket_and_key(x) # use cache
return self.gcs_client.client.objects().get(bucket=bucket, object=obj).execute()
~/.pyenv/versions/3.7.3/lib/python3.7/site-packages/thunderbolt/client/gcs_client.py in get_tasks(self)
52 continue
53
---> 54 if len(tasks_list) != len(files):
55 warnings.warn(f'[NOT FOUND LOGS] target file: {len(files)}, found log file: {len(tasks_list)}')
56
TypeError: object of type 'generator' has no len()
If all data is set to false then a KeyError exception will be raised if the workspace is empty.
Additionally, when workspace is empty the empty dataframe should contain the correct columns even if there are no rows. Right now the dataframe contains no columns when workspace is empty and all_data is set to true.
turn off tqdm. It's noisy.
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-3-c64798315aa3> in <module>
----> 1 d = tb.get_data('RecommendForDoctor')
2 doctor = tb.get_data('AddDoctorQueryInfo')
3 d['flag'] = d['system_cd'].apply(lambda x: len(get_words(x))>5)
4 print(len(d[d['flag']]))
~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/thunderbolt/thunderbolt.py in get_data(self, task_name)
70 df = self.get_task_df()
71 df = df.sort_values(by='last_modified', ascending=False)
---> 72 return self.load(df.query(f'task_name=="{task_name}"')['task_id'].iloc[0])
73
74 def load(self, task_id: int) -> Union[list, Any]:
~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/thunderbolt/thunderbolt.py in load(self, task_id)
82 The return value is data or data list. This is because it may be divided when dumping by gokart.
83 """
---> 84 data = [self._target_load(x) for x in self.tasks[task_id]['task_log']['file_path']]
85 data = data[0] if len(data) == 1 else data
86 return data
~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/thunderbolt/thunderbolt.py in <listcomp>(.0)
82 The return value is data or data list. This is because it may be divided when dumping by gokart.
83 """
---> 84 data = [self._target_load(x) for x in self.tasks[task_id]['task_log']['file_path']]
85 data = data[0] if len(data) == 1 else data
86 return data
~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/thunderbolt/thunderbolt.py in _target_load(self, file_name)
105 shutil.rmtree(tmp_path)
106 return model
--> 107 return gokart.target.make_target(file_path=file_path).load()
~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/gokart-0.3.6-py3.6.egg/gokart/target.py in load(self)
55
56 def load(self) -> Any:
---> 57 with self._target.open('r') as f:
58 return self._processor.load(f)
59
~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/luigi/local_target.py in open(self, mode)
163
164 elif rwmode == 'r':
--> 165 fileobj = FileWrapper(io.BufferedReader(io.FileIO(self.path, mode)))
166 return self.format.pipe_reader(fileobj)
167
FileNotFoundError: [Errno 2] No such file or directory:
Can't use file path from $TASK_WORKSPACE_DIRECTORY.
https://github.com/m3dev/thunderbolt/blob/master/thunderbolt/thunderbolt.py#L27
To make data analysis easy, it would be great if running gokart task is supported in modules of thuderbolt like that
def gokart_run(self, task_name, options: List[str]):
gokart.run([task_name, *options])
return self.get_data(task_name)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.