Git Product home page Git Product logo

thunderbolt's People

Contributors

dependabot[bot] avatar hirosassa avatar kohei-shinden avatar qqpann avatar vaaaaanquish avatar yuta100101 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

thunderbolt's Issues

[bug] Wrong input of local_cache.get in S3Client.get_tasks

バグ

S3 からタスク一覧を取得しようとしており、以下のコードを実装しました。
以下のコードを実行した際に、AttributeError: 'dict' object has no attribute 'split' が発生します。

from thunderbolt.client.s3_client import S3Client
workspace_directory = 's3://bucket_name/prefix'
client = S3Client(workspace_directory=workspace_directory)
tasks_list = client.get_tasks()

client.get_tasks() においてエラーが発生します。

問題箇所

以下のコードが client.get_tasks() の問題箇所です。

cache = self.local_cache.get(x)

ここで、 local_cache.get() への入力 x は str 型でファイル名を渡す必要があることがわかります。

def get(self, file_name: str) -> Optional[dict]:

しかし、s3_client.get_tasks() で実際に local_cache.get() へ渡されている x は以下のような dict 型になっています。

{'Key': '/resources/log/task_log/${task_name}.pkl', 'LastModified': datetime.datetime()}

解決方法

問題箇所 を以下のように修正することで任意の動作を確認できました。

cache = self.local_cache.get(x['Key'])

前述の通り、x は dict であり、x['Key'] がファイル名です。
そこで local_cache.get() に対して x['Key'] を渡すように修正することで以下の形式の出力を得ました。

[{'task_name': '',
  'task_params': {},
  'task_log': {'file_path': ['s3://***.pkl']},
  'last_modified': datetime.datetime(),
  'task_hash': ''},
  ...
]

確認

もし、そもそもの用途が間違っていたらすみません。

Can't load .zip

The redshells dump .zip .
https://github.com/m3dev/redshells

But, we can't open .zip .

gokart/file_processor.py in make_file_processor(file_path)
    187 
    188     extension = os.path.splitext(file_path)[1]
--> 189     assert extension in extension2processor, f'{extension} is not supported. The supported extensions are {list(extension2processor.keys())}.'
    190     return extension2processor[extension]

AssertionError: .zip is not supported. The supported extensions are ['.txt', '.csv', '.tsv', '.pkl', '.gz', '.json', '.xml', '.npz'].

We have to support unzip -> load.

not found log file

support deleted file

                'task_params': pickle.load(self.gcs_client.download(x.replace('task_log', 'task_params'))),
                'task_log': pickle.load(self.gcs_client.download(x)),

Return of TravisCI

Because change GitHub rate structure, we need to go back to TravisCI from GitHub Actions.

_人人人人人人人人人人人_
> Deadline is 11/13. <
 ̄Y^Y^Y^Y^Y^Y^Y^Y^YY^ ̄

don't search task log when __init__

issue: #26

# normal speed
tb = Thunderbolt()

# speed up
tb = Thunderbolt(task_filters=['Hoge', 'Piyo'])

# ideal
tb = Thunderbolt()
data = tb.get_date('Hoge')    # get log and param data, so get_data
df = tb.get_task_df(task_filters=['Hoge', 'Piyo'])    # get all log and param data, so get task_df

local cache

Speed up by log params local cache.

    def _get_gcs_object_info(self, x: str) -> Dict[str, str]:
        bucket, obj = self.gcs_client._path_to_bucket_and_key(x)     # use cache
        return self.gcs_client.client.objects().get(bucket=bucket, object=obj).execute()

[bug]generator has no len()

~/.pyenv/versions/3.7.3/lib/python3.7/site-packages/thunderbolt/client/gcs_client.py in get_tasks(self)
     52                 continue
     53 
---> 54         if len(tasks_list) != len(files):
     55             warnings.warn(f'[NOT FOUND LOGS] target file: {len(files)}, found log file: {len(tasks_list)}')
     56 

TypeError: object of type 'generator' has no len()

[Bug] get_task_df issues if workspace empty

If all data is set to false then a KeyError exception will be raised if the workspace is empty.

Additionally, when workspace is empty the empty dataframe should contain the correct columns even if there are no rows. Right now the dataframe contains no columns when workspace is empty and all_data is set to true.

FileNotFoundError relative path

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-3-c64798315aa3> in <module>
----> 1 d = tb.get_data('RecommendForDoctor')
      2 doctor = tb.get_data('AddDoctorQueryInfo')
      3 d['flag'] = d['system_cd'].apply(lambda x: len(get_words(x))>5)
      4 print(len(d[d['flag']]))

~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/thunderbolt/thunderbolt.py in get_data(self, task_name)
     70         df = self.get_task_df()
     71         df = df.sort_values(by='last_modified', ascending=False)
---> 72         return self.load(df.query(f'task_name=="{task_name}"')['task_id'].iloc[0])
     73 
     74     def load(self, task_id: int) -> Union[list, Any]:

~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/thunderbolt/thunderbolt.py in load(self, task_id)
     82             The return value is data or data list. This is because it may be divided when dumping by gokart.
     83         """
---> 84         data = [self._target_load(x) for x in self.tasks[task_id]['task_log']['file_path']]
     85         data = data[0] if len(data) == 1 else data
     86         return data

~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/thunderbolt/thunderbolt.py in <listcomp>(.0)
     82             The return value is data or data list. This is because it may be divided when dumping by gokart.
     83         """
---> 84         data = [self._target_load(x) for x in self.tasks[task_id]['task_log']['file_path']]
     85         data = data[0] if len(data) == 1 else data
     86         return data

~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/thunderbolt/thunderbolt.py in _target_load(self, file_name)
    105             shutil.rmtree(tmp_path)
    106             return model
--> 107         return gokart.target.make_target(file_path=file_path).load()

~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/gokart-0.3.6-py3.6.egg/gokart/target.py in load(self)
     55 
     56     def load(self) -> Any:
---> 57         with self._target.open('r') as f:
     58             return self._processor.load(f)
     59 

~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/luigi/local_target.py in open(self, mode)
    163 
    164         elif rwmode == 'r':
--> 165             fileobj = FileWrapper(io.BufferedReader(io.FileIO(self.path, mode)))
    166             return self.format.pipe_reader(fileobj)
    167 

FileNotFoundError: [Errno 2] No such file or directory: 

Support running gokart task.

To make data analysis easy, it would be great if running gokart task is supported in modules of thuderbolt like that

def gokart_run(self, task_name, options: List[str]):
    gokart.run([task_name, *options])
    return self.get_data(task_name)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.