dask / dask-docker Goto Github PK

View Code? Open in Web Editor NEW

231.0 231.0 148.0 296 KB

Docker images for dask

Home Page: https://hub.docker.com/u/daskdev

License: BSD 3-Clause "New" or "Revised" License

Shell 3.59% Jupyter Notebook 90.14% Dockerfile 6.26%

dask-docker's Introduction

Dask

Dask is a flexible parallel computing library for analytics. See documentation for more information.

LICENSE

New BSD. See License File.

dask-docker's People

Contributors

Stargazers

Watchers

Forkers

tonywangcn asyd mrocklin lpbrochu eddienko avolution ogrisel hyoscyamine wchatx hunan-rostomyan hdoupe afcarl gridl anna-hope luizbraga dhassault jorandox tomaugspurger nmatare dimitar-petrov csreid thiagoalmeidasa drydocker mwengren redpoint13 mengjin001 renefritze rmccorm4 malongge javabrett jminsk-cc mohitsethi ian-r-rose emattiza jcrist dlprentdsm stalinkay jeffreybreen mpunzalan-devoted magsol thorndog2 opensean xykonur dstephanadvalo betolink gongonpower pilillo chrisfinlay aguinaldoabbj egoddard jacobtomlinson jerowe mateuspinto wangjianze chuehsien detroyejr jrbourbeau anhmike dewberry kmadathil tayowonibi netkingcode raybellwaves gassayy ogorun rsignell-usgs zilbermanor learningdymyr hayesgb msiers dsrincon whatnick tyhal staylorx translucentcomputing ebunt huangkai31 arm64-images black2kanoi steve-liang harshavardhanbabu stsievert amcnicho anthonyleluyer menniti jkanche sykesdev rpanai corteva ashokbalaraman alanderex calo81 sanlanka holdenk edodekel sfoucher ai-platform brianlechthaler icivic kulcsarb

dask-docker's Issues

Move default branch from "master" -> "main"

@jrbourbeau and I are in the process of moving the default branch for this repo from master to main.

Changed in GitHub
Merged PR to change branch name in code. (#138)

What you'll see

Once the name on github is changed (the first box above is Xed, or this issue closed), when you try to git pull you'll get

Your configuration specifies to merge with the ref 'refs/heads/master'
from the remote, but no such ref was fetched.

What you need to do

First: head to your fork and rename the default branch there
Then:

git branch -m master main
git fetch origin
git branch -u origin/main main

Error: Unable to instantiate java compiler

What happened:

After installing Java and dask-sql using pip, whenever I try to run a SQL query from my python code I get the following error:

...
File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 378, in sql
    rel, select_names, _ = self._get_ral(sql)
  File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 515, in _get_ral
    nonOptimizedRelNode = generator.getRelationalAlgebra(validatedSqlNode)
java.lang.java.lang.IllegalStateException: java.lang.IllegalStateException: Unable to instantiate java compiler
...
...
File "JaninoRelMetadataProvider.java", line 426, in org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile
  File "CompilerFactoryFactory.java", line 61, in org.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory
java.lang.java.lang.NullPointerException: java.lang.NullPointerException

What you expected to happen:

I should get a dataframe as a result.

Minimal Complete Verifiable Example:

# The cluster/client setup is done first, in another module not the one executing the SQL query
# Also tried other cluster/scheduler types with the same error
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
    n_workers=4,
    threads_per_worker=1,
    processes=False,
    dashboard_address=':8787',
    asynchronous=False,
    memory_limit='1GB'
    )
client = Client(cluster)

# The SQL code is executed in its own module
import dask.dataframe as dd
from dask_sql import Context

c = Context()
df = dd.read_parquet('/vQuery/files/results/US_Accidents_June20.parquet') 
c.register_dask_table(df, 'df')
df = c.sql("""select ID, Source from df""") # This line fails with the error reported

Anything else we need to know?:

As mentioned in the code snippet above, due to the way my application is designed, the Dask client/cluster setup is done before dask-sql context is created.

Environment:

Dask version:
- dask: 2020.12.0
- dask-sql: 0.3.1
Python version:
- Python 3.8.5
Operating System:
- Ubuntu 20.04.1 LTS
Install method (conda, pip, source):
- pip

Install steps

$ sudo apt install default-jre

$ sudo apt install default-jdk

$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

$ javac -version
javac 11.0.10

$ echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-amd64

$ pip install dask-sql

$ pip list | grep dask-sql
dask-sql               0.3.1

Improve prepare.sh by adding apt install

Hi,

can you add something like:

if [ "$EXTRA_APT_PACKAGES" ]; then
    echo "EXTRA_APT_PACKAGES environment variable found.  Installing."
    apt install -y $EXTRA_APT_PACKAGES
fi

Thanks

Add new image for Dask hub

In order for a Docker image to be used with the Daskhub helm chart it needs dask-gateway and jupyterhub-singleuser to be installed.

Neither of our official images have those packages so I propose we either add them to the notebook image or create a new image specifically for Daskhub.

update_graph() got an unexpected keyword argument 'actors'

I am not sure what would be causing the following error as I haven't changed anything, it just seems to have started happening after rebuilding my images

dask-scheduler_1  | distributed.core - ERROR - update_graph() got an unexpected keyword argument 'actors'
dask-scheduler_1  | Traceback (most recent call last):
dask-scheduler_1  |   File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 321, in handle_comm
dask-scheduler_1  |     result = yield result
dask-scheduler_1  |   File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
dask-scheduler_1  |     value = future.result()
dask-scheduler_1  |   File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
dask-scheduler_1  |     yielded = self.gen.send(value)
dask-scheduler_1  |   File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 1923, in add_client
dask-scheduler_1  |     yield self.handle_stream(comm=comm, extra={'client': client})
dask-scheduler_1  |   File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
dask-scheduler_1  |     value = future.result()
dask-scheduler_1  |   File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
dask-scheduler_1  |     yielded = self.gen.send(value)
dask-scheduler_1  |   File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 375, in handle_stream
dask-scheduler_1  |     handler(**merge(extra, msg))
dask-scheduler_1  | TypeError: update_graph() got an unexpected keyword argument 'actors'
dask-scheduler_1  | distributed.scheduler - INFO - Receive client connection: Client-b183835c-d2e2-11e8-8001-0242ac130007
dask-scheduler_1  | distributed.core - INFO - Starting established connection

here is my docker compose file

version: '3.3'
services:
    dask-scheduler:
        image: daskdev/dask:0.18.1
        command: ["dask-scheduler"]
        volumes:
            - ./mnt:/mnt
        env_file:
            - .env
            - .env.local
        ports:
            - 8786:8786
            - 8787:8787
    worker:
        image: daskdev/dask:0.18.1
        command: ["dask-worker", "dask-scheduler:8786"]
        volumes:
            - ./mnt:/mnt
        env_file:
            - .env
            - .env.local
        depends_on:
            - dask-scheduler

I have tried deleting all images, containers and networks and then rebuilding everything but then when I start up my service I get the error above. I haven't changed my code and get access to the client with the following

def get_client() -> Client:
    client = Client('dask-scheduler:8786')
    while not client.scheduler_info()['workers']:
        print('Workers are asleep')
        time.sleep(1)
    client.restart()
    client.upload_file('/usr/app/src/hash.py')
    return client

I have taken the all the code after client out, it seems to make no difference, it seems to throw this error the moment I try to use the client, all the futures return as cancelled

here is example code I am running which seems to trigger my error

futures = client.map(run_hash(dir), batches)

for future in as_completed(futures):
    if future.status == 'finished':
        files += future.result()
    else:
        print('{} - {}'.format(future.status, future.exception()))

I am using 0.18.1 because the helm charts currently use dask version 0.18.1

any help would be appreciated

No 2.12.0 tag

Hello,

could you please add a 2.12.0 tag?

Thank you!

Is there any way to automate this, since I saw other issues with missing issues?

Worker command in compose file fails

The command specified for worker in docker-compose.yml fails with "command not found" error and worker service crashes. Setting it to 2-item list (iow ["dask-worker", "scheduler:8786"] solves this problem.

Host system: OS X 10.13.6
Docker: 18.06.1-ce-mac73 (26764)
Compose: 1.22.0

ImportError: Can not find the shared library: libhdfs3.so

Hi,

I'm trying to use hdfs3 with a distributed notebook to save/read parquets files.

However, after added hdfs3 fastparquet to EXTRA_CONDA_PACKAGES, when I try to run the notebook, it failed with:

ImportError: Can not find the shared library: libhdfs3.so

But the file exists: /opt/conda/lib/libhdfs3.so

Maybe we should add this directory via a file in /etc/ld.so.conf/?

Thanks

Docker Compose fails build on Win 10 and macOS

docker-compose build base-notebook
Building base-notebook
#1 [internal] load git source github.com/jupyter/docker-stacks.git#master:base-notebook

#1 ERROR: subdir not supported yet
------
 > [internal] load git source github.com/jupyter/docker-stacks.git#master:base-notebook:
------
failed to solve with frontend dockerfile.v0: failed to read dockerfile: failed to load cache key: subdir not supported yet
Service 'base-notebook' failed to build : Build failed

Support of extra jupyterlab extention

It would be nice to support jupyterlab extention in dask-notebook.
Something like this in the prepare.sh:

if [ "$EXTRA_JL_EXTENTIONS" ]; then  
    echo "EXTRA_JL_EXTENTIONS environment variable found.  Installing."  
    jupyter labextension install $EXTRA_JL_EXTENTIONS  
fi

joblib error with docker workers

this is the script im running

from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import numpy as np
import pandas as pd


X, y = make_classification(n_samples=1000, random_state=0)
X[:5]

param_grid = {"C": np.logspace(-3, 1, 30),
              "gamma": [0.05, 0.5, 2],
              "kernel": ['rbf', 'poly', 'sigmoid'],
              "shrinking": [True, False]}

grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
                           param_grid=param_grid,
                           return_train_score=False,
                           iid=True,
                           cv=3,
                           n_jobs=-1)



from sklearn.externals import joblib
dask_scheduler='172.17.0.1'

with joblib.parallel_backend('dask',scheduler_host=dask_scheduler+":8786", scatter=[X, y]):
    grid_search.fit(X, y)

I get the following error

distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95Q\x03\x00\x00\x00\x00\x00\x00\x8c\x0cjoblib._dask\x94\x8c\x05Batch\x94\x93\x94]\x94\x8c#sklearn.model_selection._validation\x94\x8c\x0e_fit_and_score\x94\x93\x94]\x94(\x8c\x13sklearn.svm.classes\x94\x8c\x03SVC\x94\x93\x94)\x81\x94}\x94(\x8c\x17decision_function_shape\x94\x8c\x03ovr\x94\x8c\x06kernel\x94\x8c\x03rbf\x94\x8c\x06degree\x94K\x03\x8c\x05gamma\x94\x8c\x04auto\x94\x8c\x05coef0\x94G\x00\x00\x00\x00\x00\x00\x00\x00\x8c\x03tol\x94G?PbM\xd2\xf1\xa9\xfc\x8c\x01C\x94G?\xf0\x00\x00\x00\x00\x00\x00\x8c\x02nu\x94G\x00\x00\x00\x00\x00\x00\x00\x00\x8c\x07epsilon\x94G\x00\x00\x00\x00\x00\x00\x00\x00\x8c\tshrinking\x94\x88\x8c\x0bprobability\x94\x88\x8c\ncache_size\x94K\xc8\x8c\x0cclass_weight\x94N\x8c\x07verbose\x94\x89\x8c\x08max_iter\x94J\xff\xff\xff\xff\x8c\x0crandom_state\x94K\x00\x8c\x10_sklearn_version\x94\x8c\x060.21.2\x94ub\x8c\x11distributed.utils\x94\x8c\nitemgetter\x94\x93\x94K\x00\x85\x94R\x94h$K\x01\x85\x94R\x94e}\x94(\x8c\x05train\x94h$K\x02\x85\x94R\x94\x8c\x04test\x94h$K\x03\x85\x94R\x94\x8c\nparameters\x94}\x94(h\x16\x8c\x15numpy.core.multiarray\x94\x8c\x06scalar\x94\x93\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02f8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94bC\x08\xfc\xa9\xf1\xd2MbP?\x94\x86\x94R\x94h\x12G?\xa9\x99\x99\x99\x99\x99\x9ah\x0fh\x10h\x19\x88u\x8c\x06scorer\x94}\x94\x8c\x05score\x94\x8c\x16sklearn.metrics.scorer\x94\x8c\x13_passthrough_scorer\x94\x93\x94s\x8c\nfit_params\x94}\x94\x8c\x12return_train_score\x94\x89\x8c\x15return_n_test_samples\x94\x88\x8c\x0creturn_times\x94\x88\x8c\x11return_parameters\x94\x89\x8c\x0berror_score\x94\x8c\x11raise-deprecating\x94h\x1dK\x00u\x87\x94a\x85\x94R\x94.'
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 61, in loads
return pickle.loads(x)
ModuleNotFoundError: No module named 'joblib'
distributed.worker - WARNING - Could not deserialize task
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1272, in add_task
self.tasks[key] = _deserialize(function, args, kwargs, task)
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 3060, in _deserialize
function = pickle.loads(function)
File "/opt/conda/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 61, in loads
return pickle.loads(x)
ModuleNotFoundError: No module named 'joblib'

dask-docker: Licensing

Hello,

I'm wondering: is this software released under a license, like Apache 2.0, MIT, etc.? If so, could you add a LICENSE, like so: https://help.github.com/en/github/creating-cloning-and-archiving-repositories/licensing-a-repository

Dask itself is licensed under New BSD: https://github.com/dask/dask#license

Thank you!

Automate updates with dask releases

It would be nice for this to be kept in line with dask releases automatically.

Perhaps some CI task which raises a PR to update versions?

CMD includes prepare and dask-scheduler

Currently the ENTRYPOINT and CMD entries in the base image look like the following:

ENTRYPOINT ["/usr/local/bin/dumb-init", "--"] 
CMD ["bash", "-c", "/usr/bin/prepare.sh && exec dask-scheduler"]

Often we use this image also for the dask-worker process, and replace the command with dask-worker. My understanding is that this will stop the prepare script from running properly. Is my understanding correct? If so then is there a clean way to move some of the arguments in the CMD line up to ENTRYPOINT?

Error pulling image configuration: unknown blob

Current pulls are failing with the error error pulling image configuration: unknown blob.

$ docker pull daskdev/dask:latest
latest: Pulling from daskdev/dask
b8f262c62ec6: Already exists
0a43c0154f16: Already exists
906d7b5da8fb: Already exists
3568180997ed: Pulling fs layer
555c313ecf5a: Pulling fs layer
218fd3c9fea3: Pulling fs layer
error pulling image configuration: unknown blob

Reproduced locally and on CI for dask-kubernetes.

Default versions of Python does not match jupyter image

When using this container as part of the dask-distributed helm chart one faces the issue of inconsistent python versions:

2.7 for this image
3.5 (currently) for jupyter

I think we should stop providing 2.7 by default and instead use Python 3.6 by default for this image. Furthermore it would be great to find way to pass an environment.yml that would be used to initialise the images both for jupyter and the dask workers (and scheduler) maybe using the ENTRYPOINT feature of Dockerfile so that all containers share compatible versions of Python and shared libraries (dask and distributed in particular) so as to avoid weird pickling and protocol issues.

sudo access for 'jovyan' not provided

It would appear that sudo access has not been provided for the default 'jovyan' user, or credentials are missing in the documentation, but my docker research tells me that this is intentional for security reasons. The bigger issue is how to run the dask docker image granting sudo access, which does not currently seem to be possible as all iterations of docker run seem to fail completely with a 'no environment.yml' error:

>>> docker run --rm -it -p 8888:8888 -e GRANT_SUDO="yes" --user root daskdev/dask
+ '[' '' ']'
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
no environment.yml
+ '[' '' ']'
+ '[' '' ']'

and I have not been able to find any documentation explaining how to achieve sudo access in the docker-compose.yml file.

docker-compose up error: /usr/bin/prepare.sh: No such file or directory

I am trying to compose up the three docker images but I get the following errors:

C:\cygwin64\home\fcgr\code\dask-docker>docker-compose up
Starting dask-docker_worker_1     ... done
Starting dask-docker_scheduler_1  ... done
Recreating dask-docker_notebook_1 ... done
Attaching to dask-docker_worker_1, dask-docker_notebook_1, dask-docker_scheduler_1
worker_1       |  [dumb-init] /usr/bin/prepare.sh: No such file or directory
notebook_1   |  [FATAL tini (6)] exec /usr/bin/prepare.sh failed: No such file or directory
scheduler_1   |  [dumb-init] /usr/bin/prepare.sh: No such file or directory
dask-docker_worker_1 exited with code 2
dask-docker_notebook_1 exited with code 127
dask-docker_scheduler_1 exited with code 2

I am using Docker Version 18.06.1-ce-win73

AttributeError: 'MaterializedLayer' object has no attribute 'pack_annotations' when running example notebooks

What happened:
I am not fully sure this is a bug, or it is due to an incorrect setup/installation.
However, I am using the provided docker-compose to test a local dockerized instance of dask, but I can't execute any job on it.

Currently, I simply tried a few of the provided example notebooks(e.g. number 4), and they did not run correctly. The following error is returned: AttributeError: 'MaterializedLayer' object has no attribute 'pack_annotations'
He is the stack trace:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-a7bc8667f5ea> in <module>
----> 1 x = x.persist()
      2 progress(x)

/opt/conda/lib/python3.8/site-packages/dask/base.py in persist(self, **kwargs)
    253         dask.base.persist
    254         """
--> 255         (result,) = persist(self, traverse=False, **kwargs)
    256         return result
    257 

/opt/conda/lib/python3.8/site-packages/dask/base.py in persist(*args, **kwargs)
    754             else:
    755                 if client.get == schedule:
--> 756                     results = client.persist(
    757                         collections, optimize_graph=optimize_graph, **kwargs
    758                     )

/opt/conda/lib/python3.8/site-packages/distributed/client.py in persist(self, collections, optimize_graph, workers, allow_other_workers, resources, retries, priority, fifo_timeout, actors, **kwargs)
   2942         names = {k for c in collections for k in flatten(c.__dask_keys__())}
   2943 
-> 2944         futures = self._graph_to_futures(
   2945             dsk,
   2946             names,

/opt/conda/lib/python3.8/site-packages/distributed/client.py in _graph_to_futures(self, dsk, keys, workers, allow_other_workers, priority, user_priority, resources, retries, fifo_timeout, actors)
   2541                 dsk = HighLevelGraph.from_collections(id(dsk), dsk, dependencies=())
   2542 
-> 2543             dsk = highlevelgraph_pack(dsk, self, keyset)
   2544 
   2545             annotations = {}

/opt/conda/lib/python3.8/site-packages/distributed/protocol/highlevelgraph.py in highlevelgraph_pack(hlg, client, client_keys)
    113                 "__module__": None,
    114                 "__name__": None,
--> 115                 "state": _materialized_layer_pack(
    116                     layer,
    117                     hlg.get_all_external_keys(),

/opt/conda/lib/python3.8/site-packages/distributed/protocol/highlevelgraph.py in _materialized_layer_pack(layer, all_keys, known_key_dependencies, client, client_keys)
     63     }
     64 
---> 65     annotations = layer.pack_annotations()
     66     all_keys = all_keys.union(dsk)
     67     dsk = {stringify(k): stringify(v, exclusive=all_keys) for k, v in dsk.items()}

AttributeError: 'MaterializedLayer' object has no attribute 'pack_annotations'

What you expected to happen:
Computation should start on the dask cluster

Minimal Complete Verifiable Example:
Run docker-compose up, connect to Jupyter Notebook and exec e.g. notebook 04, or paste this:

from dask.distributed import Client, progress
c = Client()

import dask.array as da
x = da.random.random(size=(10000, 10000), chunks=(1000, 1000))

x = x.persist()
progress(x)

Anything else we need to know?:

Environment:
Printing the distributed client object returns the following:

/opt/conda/lib/python3.8/site-packages/distributed/client.py:1135: VersionMismatchWarning: Mismatched versions found

+---------+---------------+---------------+---------------+
| Package | client        | scheduler     | workers       |
+---------+---------------+---------------+---------------+
| blosc   | 1.10.2        | 1.9.2         | 1.9.2         |
| lz4     | 3.1.3         | 3.1.1         | 3.1.1         |
| msgpack | 1.0.2         | 1.0.0         | 1.0.0         |
| python  | 3.8.6.final.0 | 3.8.0.final.0 | 3.8.0.final.0 |
+---------+---------------+---------------+---------------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

Dask version: 2021.2.0 (from conda-forge)
Python version: 3.8
Operating System: Ubuntu 18.04 (but I run dask in docker)
Install method (conda, pip, source): docker

Auto-tag CI job failure

Our new auto-tag GitHub action job is failing with the following error message (see this CI build)

Warning: Unexpected input(s) 'GITHUB_TOKEN', valid inputs are ['source_file', 'extraction_regex', 'tag_format', 'tag_message']
Run jaliborc/[email protected]
  with:
    GITHUB_TOKEN: ***
    source_file: .github/workflows/build.yml
    extraction_regex: \s*"release"\s*:\s*"([\d\.]+)"\s*
    tag_format: {version}
(node:1574) Warning: require() of ES modules is not supported.
require() of /home/runner/work/_actions/jaliborc/action-general-autotag/1.0.0/main.js is an ES module file as it is a .js file whose nearest parent package.json contains "type": "module" which defines all .js files in that package scope as ES modules.
Instead rename main.js to end in .cjs, change the requiring code to use import(), or remove "type": "module" from /home/runner/work/_actions/jaliborc/action-general-autotag/1.0.0/package.json.
Error: no match was found for the regex '/\s*"release"\s*:\s*"([\d\.]+)"\s*/'.

Note there is both a warning about GITHUB_TOKEN (which stems from the action being used Jaliborc/action-general-autotag#3) and an error about no matching regex being found.

cc @jacobtomlinson

miniconda version bump

Is there a particolar reason for having

FROM continuumio/miniconda3:4.7.12

Instead of

FROM continuumio/miniconda3:4.8.2

latest image is 2.13 instead of 2.15

latest docker hub image is 2.13 instead of 2.15 released in 24-04.

Noticed this while using dask-kubernetes and client.get_versions(check=True).

I guess this is tied to: https://github.com/dask/dask-docker/pull/94/files

If the pull request gets merged does it solve the issue? Thanks

Customized Jupyter Lab Parameters

I'm using your fantastic dask notebook image, thanks for creating it!

I would like to pass some extra parameters to the notebook as command line arguments. This doesn't seem to be possible using the standard prepare.sh script, so I need to customize the image with an additional layer to enable this.

Did I miss something and there is a good way to pass parameters without creating another image layer?
If not, is there any interest here in a pull request to support this? An easy implementation would be an environment variable, say JUPYTERLAB_ARGS that contains any additional command line args to be passed through or an empty string otherwise.

The line of code in question: link

docker workers fails to deserialize with rasterio

Hi,

after a deploy over GC of a Dask chart using Helm I've update the environment with some extra conda packages, specifically xarray and rasterio.
If I try to run my code I'm getting back this error from the workers log.

Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback ret = callback() File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result future.result() File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 742, in run yielded = self.gen.throw(*exc_info) # type: ignore File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 661, in handle_scheduler self.ensure_computing]) File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 735, in run value = future.result() File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 742, in run yielded = self.gen.throw(*exc_info) # type: ignore File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 386, in handle_stream msgs = yield comm.read() File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 735, in run value = future.result() File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 742, in run yielded = self.gen.throw(*exc_info) # type: ignore File "/opt/conda/lib/python3.7/site-packages/distributed/comm/tcp.py", line 206, in read deserializers=deserializers) File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 735, in run value = future.result() File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 209, in wrapper yielded = next(result) File "/opt/conda/lib/python3.7/site-packages/distributed/comm/utils.py", line 82, in from_frames res = _from_frames() File "/opt/conda/lib/python3.7/site-packages/distributed/comm/utils.py", line 68, in _from_frames deserializers=deserializers) File "/opt/conda/lib/python3.7/site-packages/distributed/protocol/core.py", line 132, in loads value = _deserialize(head, fs, deserializers=deserializers) File "/opt/conda/lib/python3.7/site-packages/distributed/protocol/serialize.py", line 184, in deserialize return loads(header, frames) File "/opt/conda/lib/python3.7/site-packages/distributed/protocol/serialize.py", line 57, in pickle_loads return pickle.loads(b''.join(frames)) File "/opt/conda/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 59, in loads return pickle.loads(x) File "/opt/conda/lib/python3.7/site-packages/rasterio/__init__.py", line 22, in <module> from rasterio._base import gdal_version ImportError: libzstd.so.1: cannot open shared object file: No such file or directory
For my understanding problem seems to be the missing or corrupted library libzstdl , am I right?
Any idea?

Migrate CI to GitHub Actions

Due to changes in the Travis CI billing, the Dask org is migrating CI to GitHub Actions.

This repo contains a .travis.yml file which needs to be replaced with an equivalent .github/workflows/ci.yml file.

See dask/community#107 for more details.

scheduler_1 | KeyError: 'op'

Platform: ubuntu 16.04, Docker version 18.03.0-ce, docker-compose version 1.21.0

Exception when docker-compose up

scheduler_1  | Future exception was never retrieved
scheduler_1  | future: <Future finished exception=KeyError('op',)>
scheduler_1  | Traceback (most recent call last):
scheduler_1  |   File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
scheduler_1  |     yielded = self.gen.send(value)
scheduler_1  |   File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 276, in handle_comm
scheduler_1  |     op = msg.pop('op')
scheduler_1  | KeyError: 'op'
scheduler_1  | Future exception was never retrieved
scheduler_1  | future: <Future finished exception=KeyError('op',)>
scheduler_1  | Traceback (most recent call last):
scheduler_1  |   File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
scheduler_1  |     yielded = self.gen.send(value)
scheduler_1  |   File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 276, in handle_comm
scheduler_1  |     op = msg.pop('op')
scheduler_1  | KeyError: 'op'
scheduler_1  | Future exception was never retrieved
scheduler_1  | future: <Future finished exception=KeyError('op',)>
scheduler_1  | Traceback (most recent call last):
scheduler_1  |   File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
scheduler_1  |     yielded = self.gen.send(value)
scheduler_1  |   File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 276, in handle_comm
scheduler_1  |     op = msg.pop('op')
scheduler_1  | KeyError: 'op'

plans

What do you propose to do with this, versus the Dockerfile at https://github.com/martindurant/dask-kubernetes/blob/master/Dockerfile ?

Use Python 3.8 in docker images

Since both dask and distributed now support Python 3.8, should we bump our docker images here to use Python 3.8 too?

OSError: Timed out trying to connect to 'tcp://scheduler:8786' after 10 s

Testing a docker-compose setup with one scheduler, one worker, and one client

docker-compose.yml:

version: "3.1"

services:
  scheduler:
    image: daskdev/dask
    hostname: dask-scheduler
    ports:
      - "8786:8786"
      - "8787:8787"
    command: ["dask-scheduler"]

  worker:
    image: daskdev/dask
    hostname: dask-worker
    command: ["dask-worker", "tcp://scheduler:8786"]

  client:
    build: client
    environment:
        - DASK_SCHEDULER_ADDRESS=scheduler:8786
    command: ["python", "script.py"]

client Dockerfile

FROM python:3.8-slim

ENV VIRTUAL_ENV=/opt/env
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

WORKDIR /app

COPY ./requirements.txt /app/requirements.txt
RUN apt-get update \
    && apt-get install gcc -y \
    && apt-get clean

RUN pip install -r /app/requirements.txt \
    && rm -rf /root/.cache/pip

COPY . /app/

client script

import os
from dask.distributed import Client

dask_scheduler = os.getenv("DASK_SCHEDULER_ADDRESS")
cl = Client(dask_scheduler)
print(cl)

repo available here: https://github.com/hcorrada/test-dask

What happened:
Client could not connect to scheduler:

Traceback (most recent call last):

File "/opt/env/lib/python3.8/site-packages/distributed/comm/core.py", line 313, in connect

_raise(error)

File "/opt/env/lib/python3.8/site-packages/distributed/comm/core.py", line 266, in _raise

raise IOError(msg)

OSError: Timed out trying to connect to 'tcp://scheduler:8786' after 10 s: connect() didn't finish in time


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

File "script.py", line 22, in <module>

cl = Client(dask_scheduler)

File "/opt/env/lib/python3.8/site-packages/distributed/client.py", line 744, in __init__

self.start(timeout=timeout)

File "/opt/env/lib/python3.8/site-packages/distributed/client.py", line 949, in start

sync(self.loop, self._start, **kwargs)

File "/opt/env/lib/python3.8/site-packages/distributed/utils.py", line 339, in sync

raise exc.with_traceback(tb)

File "/opt/env/lib/python3.8/site-packages/distributed/utils.py", line 323, in f

result[0] = yield future

File "/opt/env/lib/python3.8/site-packages/tornado/gen.py", line 735, in run

value = future.result()

File "/opt/env/lib/python3.8/site-packages/distributed/client.py", line 1046, in _start

await self._ensure_connected(timeout=timeout)

File "/opt/env/lib/python3.8/site-packages/distributed/client.py", line 1103, in _ensure_connected

comm = await connect(

File "/opt/env/lib/python3.8/site-packages/distributed/comm/core.py", line 325, in connect

_raise(error)

File "/opt/env/lib/python3.8/site-packages/distributed/comm/core.py", line 266, in _raise

raise IOError(msg)

OSError: Timed out trying to connect to 'tcp://scheduler:8786' after 10 s: Timed out trying to connect to 'tcp://scheduler:8786' after 10 s: connect() didn't finish in time

What you expected to happen:

Client to connect

Minimal Complete Verifiable Example:

above

Environment:

Docker version: Docker version 19.03.12, build 48a66213fe
Operating System: MacOS Darwin Kernel Version 19.5.0
Install method (conda, pip, source): pip (via docker)

how to start multiple workers in single machine?

docker-compose scale worker=10

ERROR: for daskdocker_worker_6  driver failed programming external connectivity on endpoint daskdocker_worker_6 (2e9c2498d0638292dadc59aa550b310103d7b9c82c258f929f568db0c994269a): Bind for 0.0.0.0:8789 failed: port is already allocated

=> ERROR [internal] load git source github.com/jupyter/docker-stacks.git#master:base-notebook

ValueError on last cell of examples/03-dataframes-timeseries.ipynb

Running through the example notebooks in daskdev/dask-notebook:1.1.1, and encountered this error on the last cell of examples/03-dataframes-timeseries.ipynb:

df_small.rolling(100).mean().visualize(rankdir='LR')

Traceback

/opt/conda/lib/python3.7/site-packages/dask/dataframe/utils.py:390: FutureWarning: Creating a DatetimeIndex by passing range endpoints is deprecated.  Use `pandas.date_range` instead.
  tz=idx.tz, name=idx.name)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
    210             try:
--> 211                 values = ensure_float64(values)
    212             except (ValueError, TypeError):

pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.ensure_float64()

ValueError: could not convert string to float: 'foo'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-13-0432fe458397> in <module>
----> 1 df_small.rolling(100).mean().visualize(rankdir='LR')

/opt/conda/lib/python3.7/site-packages/dask/dataframe/rolling.py in mean(self)
    261     @derived_from(pd_Rolling)
    262     def mean(self):
--> 263         return self._call_method('mean')
    264 
    265     @derived_from(pd_Rolling)

/opt/conda/lib/python3.7/site-packages/dask/dataframe/rolling.py in _call_method(self, method_name, *args, **kwargs)
    229         rolling_kwargs = self._rolling_kwargs()
    230         meta = pandas_rolling_method(self.obj._meta_nonempty, rolling_kwargs,
--> 231                                      method_name, *args, **kwargs)
    232 
    233         if self._has_single_partition:

/opt/conda/lib/python3.7/site-packages/dask/dataframe/rolling.py in pandas_rolling_method(df, rolling_kwargs, name, *args, **kwargs)
    180 def pandas_rolling_method(df, rolling_kwargs, name, *args, **kwargs):
    181     rolling = df.rolling(**rolling_kwargs)
--> 182     return getattr(rolling, name)(*args, **kwargs)
    183 
    184 

/opt/conda/lib/python3.7/site-packages/pandas/core/window.py in mean(self, *args, **kwargs)
   1726     def mean(self, *args, **kwargs):
   1727         nv.validate_rolling_func('mean', args, kwargs)
-> 1728         return super(Rolling, self).mean(*args, **kwargs)
   1729 
   1730     @Substitution(name='rolling')

/opt/conda/lib/python3.7/site-packages/pandas/core/window.py in mean(self, *args, **kwargs)
   1070     def mean(self, *args, **kwargs):
   1071         nv.validate_window_func('mean', args, kwargs)
-> 1072         return self._apply('roll_mean', 'mean', **kwargs)
   1073 
   1074     _shared_docs['median'] = dedent("""

/opt/conda/lib/python3.7/site-packages/pandas/core/window.py in _apply(self, func, name, window, center, check_minp, **kwargs)
    839         results = []
    840         for b in blocks:
--> 841             values = self._prep_values(b.values)
    842 
    843             if values.size == 0:

/opt/conda/lib/python3.7/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
    212             except (ValueError, TypeError):
    213                 raise TypeError("cannot handle this type -> {0}"
--> 214                                 "".format(values.dtype))
    215 
    216         if kill_inf:

TypeError: cannot handle this type -> object

Version mismatch between worker/scheduler and notebook in images from docker hub

I pulled the latest images of daskdev/dask and daskdev/dask-notebook from Docker hub. It sounds like there is a version mismatch between them

Moving forward to investigate my dask dataframe it created error. I'm new to dask and not sure this version mismatch is really the root cause but that's what error message complained!

daskdev/dask:1.1.5 has 2.0.0 version inside

Title says it all

$ docker pull daskdev/dask:1.1.5
$ docker run --entrypoint python daskdev/dask:1.1.5 -m pip freeze | grep dask
dask==1.1.5

Base and notebook images have different sources for some libraries

For example, in the base image python comes from main

$ docker run -it --rm daskdev/dask conda list python
...
+ conda list python
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
msgpack-python            0.5.6            py36h6bb024c_1
python                    3.6.5                hc3d631a_2
python-blosc              1.5.1            py36h14c3975_2
python-dateutil           2.7.3                    py36_0

But in the notebook image, it come from conda-forge.

docker run -it --rm daskdev/dask-notebook conda list python
...
+ conda list python
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
ipython                   6.5.0                    py36_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
msgpack-python            0.5.6            py36h2d50403_3    conda-forge
python                    3.6.5                         1    conda-forge
python-blosc              1.4.4                    py36_0    conda-forge
python-dateutil           2.7.3                      py_0    conda-forge
python-editor             1.0.3                    py36_0    conda-forge
python-oauth2             1.0.1                    py36_0    conda-forge

Add arm builds to docker images

It would be really useful if the default builds supported arm, docker buildx makes this a lot easier than it used to be.

no 2.6.0 tag

it seems it only tags latest/dev, add a proper version tag like 2.6.0 will be great

Cache user-specified dependencies

I have an environment.yml file. It installs 13 packages from conda and 23 packages from PyPI. Installing takes time. It'd be nice if this could be accelerated on repeated builds. I think this would be possible by caching the downloads/installs.

Error docker-compose up: error creating new backup file '/var/lib/dpkg/status-old'

Hello, i try docker-compose up

Get:93 http://archive.ubuntu.com/ubuntu bionic/main amd64 vim-runtime all 2:8.0.1453-1ubuntu1 [5,437 kB]

Get:94 http://archive.ubuntu.com/ubuntu bionic/main amd64 vim amd64 2:8.0.1453-1ubuntu1 [1,152 kB]

debconf: delaying package configuration, since apt-utils is not installed

Fetched 31.8 MB in 22s (1,446 kB/s)
Selecting previously unselected package multiarch-support.
(Reading database ... 5067 files and directories currently installed.)
Preparing to unpack .../multiarch-support_2.27-3ubuntu1_amd64.deb ...
Unpacking multiarch-support (2.27-3ubuntu1) ...

dpkg: error: error creating new backup file '/var/lib/dpkg/status-old': 
    Invalid cross-device link

E: Sub-process /usr/bin/dpkg returned an error code (2)

ERROR: Service 'notebook' failed to build: The command '/bin/sh -c apt-get update   
     && apt-get install -yq --no-install-recommends graphviz git vim   
     && apt-get clean   && rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100

I use Docker version 18.09.0-ce, build 4d60db472b

Thanks

lz4 dependency

As noted in dask/distributed#3209, the helm/kubernetes documentation for dask leads to an issue if the client computer has lz4 installed.

It may be better to include the lz4 conda package in the daskdev/dask image to avoid this.

APT install part doesn't work

Trying to install build-essential package for some of pip requirements, can't pass-through because of that error.

EXTRA_APT_PACKAGES environment variable found.  Installing.
+ echo 'EXTRA_APT_PACKAGES environment variable found.  Installing.'
+ apt update -y

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Reading package lists...
E: List directory /var/lib/apt/lists/partial is missing. - Acquire (13: Permission denied)
+ apt install -y build-essential

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?```

Add dask-ml notebook

@TomAugspurger I'd like to add your dask-ml notebook to our standard helm install, which uses this docker image by default. Is the notebook that you placed in the pangeo docker image the correct one to copy here as well? I can do this, I just wanted to check with you that you haven't made changes since then.

Add cachey dependency

I know this feature is considered experimental, but is there any chance of making cachey a dependency?

custom-delayed notebook error

I met an error like this when i run docker-compose up and test custom-delayed notebook
anyone help me?

NameError Traceback (most recent call last)
in
----> 1 z

NameError: name 'z' is not defined

Add more Python versions

Today I've been debugging something which ended up being a Python 3.7 v 3.8 issue where my cluster was using Docker and 3.8 and my local conda environment was 3.7.

I wonder if it would be useful to build multiple plages with different Python versions?

Missing libdouble-conversion.so.3 when using pyarrow

Hello folks,

I want to install pyarrow as an additional conda dependency. The installation works, but when I try to import it the following error appears: ImportError: libdouble-conversion.so.3: cannot open shared object file: No such file or directory

This does not happen with daskdev/dask.

Below are the commands to reproduce the error and the full console output of the error (I'm sorry for installing s3fs also, but it shouldn't make a difference):

Commands:

sudo docker run -it -e EXTRA_CONDA_PACKAGES="pyarrow s3fs -c conda-forge" daskdev/dask-notebook:2.16.0 bash
python
import pyarrow

Console output:

sudo docker run -it -e EXTRA_CONDA_PACKAGES="pyarrow s3fs -c conda-forge" daskdev/dask-notebook:2.16.0 bash
+ '[' '' ']'
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
no environment.yml
+ '[' 'pyarrow s3fs -c conda-forge' ']'
+ echo 'EXTRA_CONDA_PACKAGES environment variable found.  Installing.'
EXTRA_CONDA_PACKAGES environment variable found.  Installing.
+ /opt/conda/bin/conda install -y pyarrow s3fs -c conda-forge
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.7.12
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - pyarrow
    - s3fs


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    arrow-cpp-0.13.0           |   py37hdbb9910_4         3.5 MB  conda-forge
    boost-cpp-1.70.0           |       h8e57a91_2        21.1 MB  conda-forge
    boto3-1.13.16              |     pyh9f0ad1d_0          69 KB  conda-forge
    botocore-1.16.16           |     pyh9f0ad1d_0         3.8 MB  conda-forge
    brotli-1.0.7               |    he1b5a44_1001         1.0 MB  conda-forge
    bzip2-1.0.8                |       h516909a_2         396 KB  conda-forge
    docutils-0.15.2            |           py37_0         736 KB  conda-forge
    gflags-2.2.2               |    he1b5a44_1002         175 KB  conda-forge
    glog-0.4.0                 |       h49b9bf7_3         104 KB  conda-forge
    jmespath-0.10.0            |     pyh9f0ad1d_0          21 KB  conda-forge
    libevent-2.1.10            |       h72c5cf5_0         1.3 MB  conda-forge
    libprotobuf-3.7.1          |       h8b12597_0         4.6 MB  conda-forge
    lz4-3.0.2                  |   py37hb076c26_1          43 KB  conda-forge
    lz4-c-1.8.3                |    he1b5a44_1001         187 KB  conda-forge
    parquet-cpp-1.5.1          |                2           3 KB  conda-forge
    pyarrow-0.13.0             |   py37h8b68381_2         2.2 MB  conda-forge
    re2-2019.08.01             |       he6710b0_0         456 KB  defaults
    s3fs-0.4.2                 |             py_0          21 KB  conda-forge
    s3transfer-0.3.3           |   py37hc8dfbb8_1          90 KB  conda-forge
    snappy-1.1.8               |       he1b5a44_1          39 KB  conda-forge
    thrift-cpp-0.12.0          |    hf3afdfd_1004         2.4 MB  conda-forge
    ------------------------------------------------------------
                                           Total:        42.2 MB

The following NEW packages will be INSTALLED:

  arrow-cpp          conda-forge/linux-64::arrow-cpp-0.13.0-py37hdbb9910_4
  boost-cpp          conda-forge/linux-64::boost-cpp-1.70.0-h8e57a91_2
  boto3              conda-forge/noarch::boto3-1.13.16-pyh9f0ad1d_0
  botocore           conda-forge/noarch::botocore-1.16.16-pyh9f0ad1d_0
  brotli             conda-forge/linux-64::brotli-1.0.7-he1b5a44_1001
  bzip2              conda-forge/linux-64::bzip2-1.0.8-h516909a_2
  docutils           conda-forge/linux-64::docutils-0.15.2-py37_0
  gflags             conda-forge/linux-64::gflags-2.2.2-he1b5a44_1002
  glog               conda-forge/linux-64::glog-0.4.0-h49b9bf7_3
  jmespath           conda-forge/noarch::jmespath-0.10.0-pyh9f0ad1d_0
  libevent           conda-forge/linux-64::libevent-2.1.10-h72c5cf5_0
  libprotobuf        conda-forge/linux-64::libprotobuf-3.7.1-h8b12597_0
  parquet-cpp        conda-forge/noarch::parquet-cpp-1.5.1-2
  pyarrow            conda-forge/linux-64::pyarrow-0.13.0-py37h8b68381_2
  re2                pkgs/main/linux-64::re2-2019.08.01-he6710b0_0
  s3fs               conda-forge/noarch::s3fs-0.4.2-py_0
  s3transfer         conda-forge/linux-64::s3transfer-0.3.3-py37hc8dfbb8_1
  snappy             conda-forge/linux-64::snappy-1.1.8-he1b5a44_1
  thrift-cpp         conda-forge/linux-64::thrift-cpp-0.12.0-hf3afdfd_1004

The following packages will be DOWNGRADED:

  lz4                                  3.0.2-py37h5a7ed16_2 --> 3.0.2-py37hb076c26_1
  lz4-c                                    1.9.2-he1b5a44_1 --> 1.8.3-he1b5a44_1001



Downloading and Extracting Packages
arrow-cpp-0.13.0     | 3.5 MB    | ############################################### | 100% 
botocore-1.16.16     | 3.8 MB    | ############################################### | 100% 
s3fs-0.4.2           | 21 KB     | ############################################### | 100% 
lz4-c-1.8.3          | 187 KB    | ############################################### | 100% 
boost-cpp-1.70.0     | 21.1 MB   | ############################################### | 100% 
pyarrow-0.13.0       | 2.2 MB    | ############################################### | 100% 
s3transfer-0.3.3     | 90 KB     | ############################################### | 100% 
libevent-2.1.10      | 1.3 MB    | ############################################### | 100% 
bzip2-1.0.8          | 396 KB    | ############################################### | 100% 
jmespath-0.10.0      | 21 KB     | ############################################### | 100% 
glog-0.4.0           | 104 KB    | ############################################### | 100% 
gflags-2.2.2         | 175 KB    | ############################################### | 100% 
docutils-0.15.2      | 736 KB    | ############################################### | 100% 
boto3-1.13.16        | 69 KB     | ############################################### | 100% 
re2-2019.08.01       | 456 KB    | ############################################### | 100% 
thrift-cpp-0.12.0    | 2.4 MB    | ############################################### | 100% 
lz4-3.0.2            | 43 KB     | ############################################### | 100% 
brotli-1.0.7         | 1.0 MB    | ############################################### | 100% 
snappy-1.1.8         | 39 KB     | ############################################### | 100% 
libprotobuf-3.7.1    | 4.6 MB    | ############################################### | 100% 
parquet-cpp-1.5.1    | 3 KB      | ############################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
+ '[' '' ']'
+ exec bash
jovyan@2df202a75cc8:~$ python
Python 3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/__init__.py", line 47, in <module>
    from pyarrow.lib import cpu_count, set_cpu_count
ImportError: libdouble-conversion.so.3: cannot open shared object file: No such file or directory
>>> exit()

Update docker images to dask 2.6

We should probably update both dask, and also numpy=1.17.2 and pandas=0.25.1 .

I tried doing this locally but ended up running into some jupyterlab conflicts, so I thought I'd raise an issue.

@jacobtomlinson any interest in handling this? Also, any interest in automating this process?

Docker best practices and R packages

I'd like to have some clarifications about best practices for Dockers. I'm referring to this file

It will be eventually possible to use an env.yml file instead of list all packages to install after RUN conda install --yes \? If yes is prepare.sh dealing with the environment name or should I add an extra step?
It took me quite some time to understand how to build a docker with a non-conda R package. The problem is that if I try

FROM continuumio/miniconda3:4.7.12

RUN conda install --yes \
    -c conda-forge \
    python=3.7.3 \
    python-blosc \
    cytoolz \
    dask==2.14.0 \
    msgpack-python=1.0.0 \
    nomkl \
    numpy=1.17.5 \ 
    pandas==0.25.3 \
    numba=0.49.1 \
    pyarrow=0.16.0 \
    cython=0.29.21 \
    boto3==1.14.39 \
    s3fs==0.4.2 \
    tini==0.18.0 \
    rpy2=2.9.1 \
    r-base=3.5.0 \
    r-devtools=2.3.0\
    r-stringr=1.4.0 \
    r-stringi=1.4.3 \
    r-curl=4.3 \
    && conda clean -tipsy \
    && find /opt/conda/ -type f,l -name '*.a' -delete \
    && find /opt/conda/ -type f,l -name '*.pyc' -delete \
    && find /opt/conda/ -type f,l -name '*.js.map' -delete \
    && find /opt/conda/lib/python*/site-packages/bokeh/server/static -type f,l -name '*.js' -not -name '*.min.js' -delete \
    && rm -rf /opt/conda/pkgs

# Install R libraries
COPY packages.R /packages.R
RUN ln -s /bin/tar /bin/gtar
RUN Rscript packages.R

COPY prepare.sh /usr/bin/prepare.sh

RUN mkdir /opt/app

where package.R is

options(Ncpus = parallel::detectCores())
library(devtools)
devtools::install_github("robjhyndman/anomalous")

I get an error compiling several R libraries but if I comment out (or add after R installation) the lines

    && conda clean -tipsy \
    && find /opt/conda/ -type f,l -name '*.a' -delete \
    && find /opt/conda/ -type f,l -name '*.pyc' -delete \
    && find /opt/conda/ -type f,l -name '*.js.map' -delete \
    && find /opt/conda/lib/python*/site-packages/bokeh/server/static -type f,l -name '*.js' -not -name '*.min.js' -delete \
    && rm -rf /opt/conda/pkgs

it works fine. I don't quite understand why this is happening but I think it will be great to better document this or comment the Dockerfile accordingly. I could help with it.

Tornado and Python version mismatch between notebook and base images

Due to a version mismatch, a client cannot be made in the notebook which connects to a scheduler from the base. This is related to #36 and likely a symptom of the fact that the base images are different.

Using a different base image or pinning dependencies is a possible remedy.

notebook:

python 3.7.3
tornado 6.0.3

base:

python 3.7.4
tornado 6.0.4

Bump numpy and pandas versions

We should probably update the numpy and versions that we pin to at some point. We should verify that examples.dask.org continues to build and run nicely with these updates.

dask / dask-docker Goto Github PK

dask-docker's Introduction

Dask

LICENSE

dask-docker's People

Contributors

Stargazers

Watchers

Forkers

dask-docker's Issues

What you'll see

What you need to do

NameError: name 'z' is not defined

Recommend Projects

Recommend Topics

Recommend Org