icrar / daliuge Goto Github PK

View Code? Open in Web Editor NEW

24.0 10.0 7.0 170.52 MB

The DALiuGE Execution Engine

License: GNU Lesser General Public License v2.1

Python 74.75% JavaScript 19.77% Shell 0.90% CSS 1.24% HTML 1.86% Makefile 0.02% C++ 0.51% C 0.84% Dockerfile 0.12%

daliuge's Introduction

Data Activated 流 Graph Engine

https://github.com/ICRAR/daliuge/actions/workflows/run-unit-tests.yml/badge.svg?branch=master

https://coveralls.io/repos/github/ICRAR/daliuge/badge.svg?branch=master

https://github.com/ICRAR/daliuge/actions/workflows/create-palettes.yml/badge.svg?branch=master

DALiuGE is a workflow graph development, management and execution framework, specifically designed to support very large scale processing graphs for the reduction of interferometric radio astronomy data sets. DALiuGE has already been used for processing large astronomical datasets in existing radio astronomy projects. It originated from a prototyping activity as part of the SDP Consortium called Data Flow Management System (DFMS). DFMS aimed to prototype the execution framework of the proposed SDP architecture.

For more information about the installation and usage of the system please refer to the documentation

Development and maintenance of DALiuGE is currently hosted at ICRAR and is performed by the DIA team.

See the docs/ directory for more information, or visit our online documentation

[1]	流 (pronounced Liu) is the Chinese character for "flow".

daliuge's People

Contributors

Stargazers

Watchers

Forkers

steve-ord jasonruonanwang xyuan pritchardn zylatis myxie sourcery-ai-experiments

daliuge's Issues

Install issue for dev

I successfully installed and compiled the daliuge code (option dev) following the instructions in https://daliuge.readthedocs.io/en/latest/installing.html

But when I try and run it fails in the translator. Text below, but any ideas?
dep works ...

rdodson@dep66340 daliuge-engine % ./run_engine.sh dev
./run_engine.sh: line 12: nvidia-docker: command not found
Running Engine development version in background...
docker run -td --shm-size=2g --ipc=shareable --rm --name daliuge-engine -v /var/run/docker.sock:/var/run/docker.sock -p 5555:5555 -p 6666:6666 -p 8000:8000 -p 8001:8001 -p 8002:8002 -p 9000:9000 --user 503:20 --group-add 20 -v /Users/rdodson/dlg/workspace/settings/passwd:/etc/passwd -v /Users/rdodson/dlg/workspace/settings/group:/etc/group -v /Users/rdodson/dlg:/Users/rdodson/dlg --env DLG_ROOT=/Users/rdodson/dlg icrar/daliuge-engine:master
fac46404c22aa459a1682755d9c8a26875c7bf0497085d37ab74aa7b3aca3add
{"pid": 45}% r

dodson@dep66340 daliuge-engine % cd ../daliuge-translator
rdodson@dep66340 daliuge-translator % ./run_translator.sh dev
Running Translator development version in foreground...
Traceback (most recent call last):
File "/dlg/bin/dlg", line 8, in
sys.exit(run())
File "/dlg/lib/python3.8/site-packages/dlg/common/tool.py", line 166, in run
commands[cmd]1
File "/dlg/lib/python3.8/site-packages/dlg/common/tool.py", line 106, in wrapped
f(parser, *args, **kwargs)
File "/dlg/lib/python3.8/site-packages/dlg/common/tool.py", line 99, in call
module = importlib.import_module(modname)
File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 848, in exec_module
File "", line 219, in _call_with_frames_removed
File "/dlg/lib/python3.8/site-packages/dlg/dropmake/web/translator_rest.py", line 968, in
default=AlgoParams(),
File "/dlg/lib/python3.8/site-packages/pydantic/main.py", line 159, in init
pydantic_self.pydantic_validator.validate_python(data, self_instance=pydantic_self)
pydantic_core._pydantic_core.ValidationError: 9 validation errors for AlgoParams
min_goal
Field required [type=missing, input_value={}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.1/v/missing
ptype
Field required [type=missing, input_value={}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.1/v/missing
max_load_imb
Field required [type=missing, input_value={}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.1/v/missing
max_cpu
Field required [type=missing, input_value={}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.1/v/missing
time_greedy
Field required [type=missing, input_value={}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.1/v/missing
deadline
Field required [type=missing, input_value={}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.1/v/missing
topk
Field required [type=missing, input_value={}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.1/v/missing
swarm_size
Field required [type=missing, input_value={}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.1/v/missing
max_mem
Field required [type=missing, input_value={}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.1/v/missing

Erroneous, unlinked node produced when graph containing Groupby/Scatter is unrolled.

I am translating a graph which has a Scatter component, in which there are application and data drops. These drops are properly unrolled; however, there is an additional node that is created, but is not connected to anything.

When unrolled, it looks like the following:

  {
    "oid": "1_-14_0",
    "type": "app",
    "app": "dlg.apps.simple.SleepApp",
    "loop_cxt": null,
    "sleepTime": 4,
    "tw": 4,
    "num_cpus": 1,
    "iid": "0",
    "lg_key": -14,
    "dt": "Component",
    "nm": "SolveNEScatter",
    "node": "#0",
    "island": "#0"
  }

I believe this is the result of a placeholder function that has not been deprecated in the dlg.dropmake.pg_generator module.

In the class pg_generater.LG, the method lgn_to_pgn calls make_single_drop(miid) in the following context:


 if (extra_links_drops and not lgn.is_loop()): # make GroupBy and Gather drops
       src_gdrop = lgn.make_single_drop(miid)
       self._drop_dict[lgn.id].append(src_gdrop)
        if (lgn.is_groupby()):
             self._drop_dict['new_added'].append(src_gdrop['grp-data_drop'])
             elif (lgn.is_gather()):
              pass

Following through to make_single_drop, we see a function call to _create_test_drop_spec(oid, kwargs):

        dropSpec = self._create_test_drop_spec(oid, kwargs)

There is a comment in this code that states this is a test function; and it hard-codes the app':'dlg.apps.simple.SleepApp' into a node.

Any information on how to stop this from occurring would be greatly appreciated!

--process-dependency-links flag has been deprecated

Heya,

I really liked seeing you managed to get the installation instructions to a nicer level, but alas it has been deprecated before an alternative was around. If I understand the following discussion though there should be a solution by now:
pypa/pip#4187

Updates to PGT cause deploy graph failure for `node_list` with only one node.

Environment

EAGLE: eagle.icrar.org
DALiuGE: Translator, Node Manager, Data Island Manager (all local).

Issue

Changes introduced in 6a7bf50 lead to the following error when attempting translate and deploy a graph locally:

"Failed to deploy physical graph: Invalid new_num_parts 0"

Specifically, it looks like the issue starts in the PGT class prior to partitioning, with how we initialise the nm_list.

    is_list = node_list[0:num_islands]
    nm_list = node_list[num_islands:]

I believe this is an off by one error, with Python's list splicing partly to blame; an IndexError is not thrown if we try and splice an index that is out of range, it just returns an empty list.

If num_islands=1, and we have node_list = ['node1', 'node2'], we will end up with:

is_list == ['node1']
nm_list == ['node2']

However, if we have only node_list= [node1]:

is_list ['node1'] # index[0]
nm_list -> [] # index[1:], which is out of bounds

Solution

I have implemented a workaround with the following changes:

    is_list = node_list[0:num_islands]
    nm_list = node_list[num_islands-1:]

However, I may be missing some nuance in the construction of the island manager/node manager list, so am interested to hear more on the preferred solution.

Inconsistent translator naming convention for Categories.SCATTER drops

Context

When running the translation from an LGT designed in EAGLE, there are inconsistencies in the named output in the translated PGT.

In a *.graph file, a normal application node (e.g. PythonApp) will have the name stored in the text field:

"nodeDataArray": [
            ....
            "category": "PythonApp",
            "categoryType": "Application",
            ...
            "text": "UpdateGSM",

After translation, this produces the following (abridged) output:

...
"dt": "PythonApp", 
"nm": "UpdateGSM",
...

This is a nice form of 'output sugar', as it makes keeping track of the applications in other contexts easier than following the OIDs, which have no contextual meaning.

When the node is a Scatter or Gather drop, the following occurs instead, in the dlg.dropmake.dm_utils.convert_construct function:

if has_app[0] == "i":
    app_node["text"] = node["inputApplicationName"]
else:
    app_node["text"] = node["outputApplicationName"]

This leads to an ambiguous naming convention for the Scatter drops in the PGT file:

# Scatter drop naming in LGT JSON output
"dt": "PythonApp", 
"nm": "Python App"

Request

There is no clear context for why the app_node["text"], and subsequent name change in the drop 'nm' field, requires this change; it is possible there is some historical reason for the change, but updating to the following makes no difference to the translation:

if has_app[0] == "i":
    app_node["text"] = node["text"]
else:
    app_node["text"] = node["text"]

Any discussion or pointers on this would be greatly appreciated.

metis>=0.2a3 not in pypi

Heya,

When try to install DaLiuGE in my virtualenv with pip install . and it crashes at the step of installing metis because:

Collecting metis>=0.2a3 (from daliuge==0.4.0)
  Could not find a version that satisfies the requirement metis>=0.2a3 (from daliuge==0.4.0) (from versions: 0.2a1, 0.2a2)
No matching distribution found for metis>=0.2a3 (from daliuge==0.4.0)

I guess I can hack around it, but you probably want to know :)

No local packages or working download links found for daliuge-engine==1.0.0

Did not directly install from source. I'll look around.

Example SDP Graph translator error

The example SDP continuum imaging graph found in the EAGLE-graph-repo does not translate, giving the following error:

dlg.dropmake.pg_generator.GInvalidNode: Loop 'Major Cycle' should have at least one Start Component and one End Data

This breaks using both the dlg unroll command line utility in a local python3 venv, and the Docker-based daliuge-translator.

python-cwlgen is deprecated (used in daliuge-translator)

The python-cwlgen repository is archived as read-only by the maintainers: https://github.com/common-workflow-lab/python-cwlgen

It is recommend to update to cwl-utils moving forward: https://github.com/common-workflow-language/cwl-utils

In daliuge-translator/setup.py

install_requires = [
    "bottle",
    "cwlgen",
    "daliuge-common==%s" % (VERSION,),
    "metis>=0.2a3",
    # Python 3.6 is only supported in NetworkX 2 and above
    # But we are not compatible with 2.4 yet, so we need to constrain that
    "networkx<2.4; python_version<'3.6'",
    "networkx<2.4,>= 2.0; python_version>='3.6.0'",
    "numpy",
    "psutil",
    "pyswarm",
    # 1.10 contains an important race-condition fix on lazy-loaded modules
    "six>=1.10",
]

NetworkX outdated (<2.4)

The daliuge-translator requirements.txt file requires version of NetworkX < 2.4.

Given that 2.4 was released in 2019, this is a breaking issue with code bases that have updated to the any version of NetworkX post-2019.

How to get the input

I learned in SDP MEMO that you use data generated by OSKAR as the input. Does that mean DAG can be generated by .fits file or there are other way to get that DAG.

Updated port names should break existing tests

The change to 'port names' described in 01d11f4 should be updated in test cases, but existing test cases pass. This demonstrates a defect in the coupling and test management of the translator.

The previous result in translating from a Logical Graph (LG) to a Physical Graph Template (PGT) led to:

    "producers": [
      "1_-13_0/3/1/1"
    ],
    "consumers": [
      "1_-19_0/3/1/1"
    ]

Now the approach is:

"producers": 
    [{"1_-13_0/0/0/0": "event"}],
"consumers": 
    [{"1_-19_0/0/0/0": "event"}]

This appears to pass all tests in the translator, but this is because test_pg_gen.py does not test the output of the translation. In fact, most of the tests do not even run assert, meaning all they do is confirm that no runtime errors occur.

The reason for this failure is that the test cases use older graphs (daliuge-engine/test/graphs). The only places I can see where "consumers"/"producers" are invoked are in the daliuge-engine code base (e.g. test_dm.py). However, these drops are created as dictionaries rather than created as mocked drop objects based on dropdict in daliuge-common. If there is a change to a foundational class like this, it should be reflected in failing tests across the codebases.

For example, an example hard-coded reflection of dropdict in test_dm.py passes just fine:

      {
          "oid": "C",
          "type": "plain",
          "storage": Categories.MEMORY,
          "producers": ["B"],
      },

However, these tests should fail given that they use a now-outdated approach to managing "ports".

(As an aside, is it possible to entertain the idea of changing the collective term for producers and consumers "ports"? Given the codebase also contains network and server operations in which ports have a functionally different meaning, it seems like a better naming convention could be devised.)

DLM.DropChecker 'disappeared' condition causing possible graph execution race condition

Issue

Note: This is reproducible, but succeeds intermittently, making it difficult to determine what is the exact root cause across multiple sessions.

I was testing if I can get the a translated SubGraph running with the DIM/NM, prior to updating the node dropclass. To do so, I've used a basic SleepApp as the InputApplication for the SubGraph construct and a CopyApp as the OutputApplication.

In running the graph locally (which translates, partitions, and runs the 'main graph' fine), it has revealed a race condition where sometimes the middle File data drop (see below) is marked as disappeared, even though it has yet to be created by the SleepApp drop:

This leads to the drop being removed from the DLM, and any future attempt to interact with it causes the rest of the graph to fail due to a missing drop.uid key:

  File "/home/rwb/github/daliuge-icrar/daliuge-engine/dlg/lifecycle/dlm.py", line 475, in handleOpenedDrop
    drop = self._drops[uid]
KeyError: '2024-07-08T16:10:07_-5_0/0'

Outcome	Graph
Pass
Fail

Note: These graphs were run immediately after each other.

It appears the race condition is caused by the timeout of the DropChecker overlapping with the SleepApp, but it's unclear why this occurs for this SleepApp/File combination, and not others.

To confirm that this isn't a result of the subgraph-specific work, I've tested this with a similarly-structured 'normal' graph (attached), and replicated the failure.

Proposed solution

I don't know if this is actually a bug, or expected behaviour with something like the SleepApp, given that this code is from way back when the project originated. I suspect that the issue may be user error, and I've missed a configuration option. I have attached the alternative graph I made to replicate the bug.

TestSleepRaceNoSubGraph.txt

Emptying a text box in drop settings

I have an issue with the DaLiuGE graph editor in Google Chrome. I have a graph with several drops. I wanted to change the dirname in a file drop, but accidentally set the filepath. So I click the file drop and empty tje filepath textbox in the DROP Parameters. After pressing return, the box is nicely empty, as expected. When clicking the Accept button however, the text reappears in the settings box. There is essentially no way (that I know of) to unset a set variable in the DROP Parameters in the graph editor.

Install issues with DALiuGE

I have been trying to use the daliuge-translator to do some unrolling/partitioning of files, and have been having issues.

When I install DALiuGE from the main directory, using the local copy cloned from the master branch, it uses an old version of the common/init.py file. However, when I install the components separately (i.e., enter the daliuge-translator/, daliuge-common/, and daliuge-runtime) directories, and install each component separately from there, it works fine and uses the correct libraries.

I think the setup.py file is getting an older version of DALiuGE (maybe from PyPi?), and this is causing the conflicts.

Installing them all separately stops this from being a 'block', however.

icrar / daliuge Goto Github PK

daliuge's Introduction

Data Activated 流 Graph Engine

daliuge's People

Contributors

Stargazers

Watchers

Forkers

daliuge's Issues

Environment

Issue

Solution

Context

Request

Issue

Proposed solution

Recommend Projects

Recommend Topics

Recommend Org