replicate / cog Goto Github PK

View Code? Open in Web Editor NEW

7.2K 7.2K 489.0 6.62 MB

Containers for machine learning

Home Page: https://cog.run

License: Apache License 2.0

Go 47.37% Python 52.63%

ai containers cuda deep-learning docker machine-learning pytorch tensorflow

cog's People

Contributors

Stargazers

Watchers

Forkers

bfirsh rorybyrne andreasjansson filippobrizzi amarjandu ianherri imshashank bigpurpletoast bencevans victorxlr xtynger shalevy1 dhruvinsinh leopiney clbarnes mbrukman dashstander deneutoy undercontroller wx-b useada kennivelez maggotttt conexuz kanapazombie oadeniyi23 plint-peklund adbmd nicolasanjoran cogitovsmachina mysticaltech nanderoo ssahgal agporto yakshavingcatherder spread0x chuckhend mokpolar cprakashagr lcsouzamenezes st7ma784 iperdomo santialferez adhityaraar auaan enginbozkurt manikant92 abhinavm24 deanofthewebb josrod napo178 kp-forks marencc benjamin-ky dhee2211 charlesfrye dharaneepatel15 emmastarkk raghadalnouri qiuzhuang abhilb elazarg anas-zafar takshan weiquangreenphyto zeyefkey camiloyate09 uneidel arbruzaz xkey- wes-kay edenweb1 creative-research-project-v1-1 laplacekorea nicholascelestin marcus-arcadius hoangthanh283 daniyalt uakbr joshuaword2alt diamondgeisha nightmareai gqadonis tkoe2 ske159 strategist922 jags111 chrilledallas74 ninetailskim tuanbc yaroslavivanov2901 sakunaharinda navaneeth-sharma szajmon66 igyorfi sa1an tungvuthanh stevethebloody yoursimpcard spiderking1108

cog's Issues

Better explain what signing commits and DCO is about

It's not clear that it's just a string attached to your commit message. It looks scary. https://github.com/replicate/cog/blob/main/CONTRIBUTING.md

Local `cog predict` doesn't print full error when setup fails

I think we're holding StepGroup wrong.

Use separate `required` option for `@cog.input()`

The double behavior of default is not obvious.

This might want to be optional=True instead, so that inputs are required by default. If the inputs were not required by default, then users normally would make them required. If required inputs are not marked as required, this will cause breakage.

End to end tests don't work on macOS

From #118, the end to end tests can't connect to the bridge IP on macOS:

This should either be run in a consistent dev environment, or if we actually want to run the end to end tests on macOS (which probably makes sense in CI?), then we might need some kind of OS-based switch in there.

gnutls_handshake() failed: The TLS connection was non-properly terminated.

cog build
═══╡ Uploading /Users/tekumara/code3/cog-examples/inst-colorization to localhost:8080/examples/inst-colorization
⠋ uploading (925 MB, 269.985 MB/s) ═══╡ Building model...
═══╡ Received model
═══╡ Building cpu image
═══╡   * Installing Python prerequisites
═══╡   * Installing Python 3.8
═══╡   * Installing system packages
═══╡   * Installing Python packages
═══╡   * Installing Cog
═══╡   * Copying code
═══╡ Successfully built 507cf5936fd9
═══╡ Pushing localhost:5000/inst-colorization:507cf5936fd9 to registry
═══╡ Building gpu image
═══╡   * Installing Python prerequisites
═══╡   * Installing Python 3.8
═══╡  ---> Using cache
═══╡  ---> 68aac6e4699f
═══╡ Step 8/20 : RUN curl https://pyenv.run | bash && 	git clone https://github.com/momo-lab/pyenv-install-latest.git "$(pyenv root)"/plugins/pyenv-install-latest && 	pyenv
   │ install-latest "3.8" && 	pyenv global $(pyenv install-latest --print "3.8")
═══╡  ---> Running in ae5b74d815ca
═══╡   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
═══╡                                  Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   285  100   285    0     0    198      0  0:00:01  0:00:01 --:--:--   198  0
═══╡ Cloning into '/root/.pyenv'...
═══╡ Cloning into '/root/.pyenv/plugins/pyenv-doctor'...
═══╡ Cloning into '/root/.pyenv/plugins/pyenv-installer'...
═══╡ Cloning into '/root/.pyenv/plugins/pyenv-update'...
═══╡ fatal: unable to access 'https://github.com/pyenv/pyenv-update.git/': gnutls_handshake() failed: The TLS connection was non-properly terminated.
═══╡ Failed to git clone https://github.com/pyenv/pyenv-update.git
═══╡ Error: Failed to build Docker image: exit status 255

High CPU usage during the build.

inst-colorization build runs out of memory with default Docker for Mac install

pip install in building GPU image throws OOM with standard 4GB of memory. (Within Docker, not outside.)

Output of build should not display as console log messages

═══╡ Building gpu image
═══╡   * Installing Python prerequisites
═══╡   * Installing Python 3.8
═══╡   * Installing system packages
═══╡   * Installing Python packages
═══╡ #11 sha256:6ba92f3047b5dec04235ade8528c87bc142e66bb38015765dc4f9cbb7d185cd8
═══╡ #11 DONE 0.5s
═══╡
═══╡ #12 [ 9/15] RUN pip install -f
   │ https://dl.fbaipublicfiles.com/detectron2/wheels/cpu/index.html -f
   │ https://download.pytorch.org/whl/cu101/torch_stable.html
   │ --extra-index-url=git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI
   │ cachetools==4.1.0 chardet==3.0.4 future==0.18.2 fvcore==0.1.dev200506
   │ idna==2.9 importlib-metadata==1.6.0 jsonpatch==1.25 jsonpointer==2.0
   │ markdown==3.2.2 mock==4.0.2 opencv-python==4.3.0.38 portalocker==1.7.0
   │ pyasn1==0.4.8 pyasn1-modules==0.2.8 pydot==1.4.1 requests==2.23.0
   │ requests-oauthlib==1.3.0 rsa==4.0 tabulate==0.8.7 termcolor==1.1.0
   │ urllib3==1.25.8 visdom==0.1.8.9 websocket-client==0.57.0 werkzeug==1.0.1
   │ yacs==0.1.7 zipp==3.1.0 cython==0.29.22 pyyaml==5.1 dominate==2.4.0
   │ detectron2==0.1.2 torch==1.5.0 torchvision==0.6.0 pycocotools==2.0.2
   │ ipython==7.21.0 scikit-image==0.18.1
═══╡ #12 sha256:84e81bafd53e4595c28450182c770765e490c57fe351dd48fbad3418b7ad1697
═══╡ #12 1.283 Looking in indexes: https://pypi.org/simple,
   │ git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI
═══╡ #12 1.283 Looking in links:
   │ https://dl.fbaipublicfiles.com/detectron2/wheels/cpu/index.html,
   │ https://download.pytorch.org/whl/cu101/torch_stable.html
═══╡ #12 1.370 WARNING: Cannot look at git URL
   │ git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI/cachetools/
   │ because it does not support lookup as web pages.
═══╡ #12 2.202 Collecting cachetools==4.1.0
═══╡ #12 2.235   Downloading cachetools-4.1.0-py3-none-any.whl (10 kB)
═══╡ #12 2.265 WARNING: Cannot look at git URL
   │ git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI/chardet/
   │ because it does not support lookup as web pages.
═══╡ #12 2.942 Collecting chardet==3.0.4
═══╡ #12 2.950   Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
......

Console ═══╡ messages are for displaying messages to the user, not for large amounts of debugging output. The build output should be displayed as plain text, clearly delineated from the informational messages.

Repositories shouldn't require username

If I'm running my own server, I should just be able to point at http://10.1.1.1/hotdog-detector

Build log output is missing pushing model

In cog build log, there is no line saying it is pushing the model, which is a time-consuming process.

Better error message when `Model.setup()` is not set

If you don't set a setup() function, you get an incomprehensible error:

═══╡ Traceback (most recent call last):
   │   File "/usr/bin/cog-http-server", line 8, in <module>
   │     cog.HTTPServer(Model()).start_server()
   │ TypeError: Can't instantiate abstract class Model with abstract methods setup
   │
═══╡ Container exited unexpectedly

My feeling is we should require setup() to encourage users to do the right thing, but there should be a clearer error message.

Design how to run arbitrary scripts

There is preinstall but it implies that it comes before installing other things, but it doesn't -- it becomes before copying code. We should:

Figure out the "default" place to run arbitrary commands
Figure out the right name for it

Default values are sometimes the string "None"

Server eats up all disk space very quickly

Server should probably delete all local images besides the most recent one. That way caching still works, but we don't infinitely use up disk space.

Throw error if invalid keys exist in cog.yaml

Currently invalid keys are silently ignored.

For example, a cog.yaml that contains this:

buildd:
  python_version: "3.8"
  python_packages:
    - "torch=1.8.0"

Will silently not install torch, instead of complaining that the key buildd doesn't exist.

Future

We may also want to validate the values at some point. For example, ensuring python_version is a string and matches a given pattern. (If you omit the quotes, 3.1 and 3.10 are indistinguishable! Thanks YAML!)

(Edited by @andreasjansson and @bfirsh.)

Pushing version with same ID to different model fails silently

We ran into this yesterday but haven't reproduced locally. Needs confirmation.

"run" and "infer" used interchangeably

It's run() in Python but cog infer on CLI. We need to decide on the verb and stick to it.

Installing `cog.py` busts Docker cache because it chooses a random filename

It could be a fixed filename, but maybe it isn't that so as not to cause a race condition with multiple builds? Perhaps it could be a hash of the content.

Cog server stores data in current directory by default

It is unexpected behavior that the server behaves differently in different working directories. Simplest solution here might be that you need to explicitly specify where data is stored.

Throw error if version already exists

Make Dockerfile generation more robust

There is a lot of string concatenation without escaping and things like that. This feels like it needs a DSL.

Perhaps the simplest way would be to put all generated data inside environment variables, which can easily be escaped, then carefully input that data in fixed commands through the rest of the Dockerfile.

We can also do stuff like use the RUN ["foo", ...] form.

Ideally we wouldn't use Dockerfiles at all, which is a larger piece of work in #21.

Run end-to-end tests inside Docker, Poetry, Virtualenv, or something like that

Currently depends on Python dependencies installed globally.

Batch version of `cog predict`

Use case: I have a folder of images, I want them all colorized. At the moment you have to wait a minute for the model to boot for each image.

It should also support reading from a file, as requested by @DeNeutoy.

(written by @andreasjansson @bfirsh)

Repositories should start with http://

I think we decided on this, but doesn't seem to be the case.

End to end tests should clean up after themselves

My machine is filled with this stuff.

Cog stalls when pushing large, cached files

Steps to reproduce:

Push some large files
Push again

It now stalls saying ⠙ uploading (3.1 kB, 0.494 kB/s). Perhaps it's calculating some hashes, or something. Whatever it's doing it should show progress instead of looking broken.

Design prediction API versioning

I want to rename /infer to /predict but we can't change it because all the old models have it and there is no way for Cog to detect what version it is and what it should call.

Alternatively, perhaps it is the client which should detect what version of Cog the model has been made with, and adjust its API calls as appropriate.

Concept of "workdir" doesn't map to Docker's concept of "workdir"

Cog's is relative to /code, Docker's is absolute. This is quite confusing, particularly when cog run gets involved.

As far as I can see, the intention of cog.yaml's workdir is two-fold: to set PYTHONPATH correctly, and as a shortcut to set the directory for post_install.

Run inference on CPU image before building GPU image

When I'm iterating, I have to wait for two images to build before I know it's completely broken.

Support streaming real-time prediction

Currently Cog requires you to upload an input file, then it's processed and results are returned. But there are cases when you might want to stream a continuous input to the model. For example, if you have a model that does audio event detection, you might want to display the current event as it happens.

Define path types without having to import pathlib

This is clumsy to explain in getting started, and means we have to have extra imports in all the model definition docs. Also a thing users will stumble on.

`cog build log` doesn't display output when build fails

It should use the same TerminalLogger as cog run.

When starting Docker image and running setup(), it should show log output

When running locally, it'd be neat if this displayed the log output from your program, like it displayed output when building:

Jupyter Notebook Integration

Would love to see some kind of notebook integration. Can we expose the environment built in Docker as a Jupyter Notebook kernel possibly?

Require users to set `python_version` in cog.yaml

This will stop models from breaking when we update our default.

Versions with different file trees get the same ID

Steps to reproduce:

cog push
touch test
cog push

This will produce a version with the same ID. (And will presumably fail once #90 is in.)

Inputs in metadata should be a list, not a dict

Sorting is significant.

Webhooks don't get sent until both builds have finished

`cog predict` should print JSON and plain text output to stdout by default

If you don't pass -o, it should just print the output to stdout. This is a regression. It was working at some point.

Predict shouldn't have network access

It's common for models to download weights in the setup() function. This isn't reproducible (models might run without a network connection and weights files can disappear from the internet) so we should discourage it. In cases where you actually need network access we can provide a config option to allow the model to hit the network.

The tricky bit is allowing incoming access for the HTTP server, while disallowing outgoing connections. On a cursory search, there seems to be no simple way to do this without iptables rules on the host, or the container to a private network and using another container as a proxy. Some creativity might be needed.

As a start, perhaps we could bodge it inside the container. That way we can guarantee it isn't downloading any files for reproducibility reasons, but doesn't have any security guarantees.

There are some cases where network access might be needed. A few ideas:

We don't add any way of allowing network access now, and see how far we get. Maybe we don't need to add an option at all.
If you want network access, you need to run it via Docker directly.
We add an option in cog.yaml to allow network access. However -- Untrusted models not having network access is a neat security feature, so it would be a shame to allow model creators to break this.
We add a runtime option to allow network access. For users who are in control of their environment, this lets them do weird stuff like this. "Turn this off at your own risk."

[This issue has been authored by @andreasjansson and @bfirsh.]

Fix content type isn't multipart/form-data bug on push

Seeing this again in CI.

https://github.com/replicate/cog/runs/2713873111

Add future compatibility for server API

The server should be able to say "this client is not supported, you should upgrade!" elegantly.

Support DO_NOT_TRACK

Proposal here: https://consoledonottrack.com/

Build Docker images with buildkit directly

Instead of using Dockerfiles. Concatenating strings is fragile. https://www.docker.com/blog/compiling-containers-dockerfiles-llvm-and-buildkit/

Note that we use buildkit to build Dockerfiles for CPU. This is about calling the buildkit API directly instead of going via a Dockerfile.

We could also call the Docker API to create and commit containers, emulating the build process.

A nice side-effect of using Dockerfiles is we can generate a Dockerfile for users if they want to "eject" from Cog.

Related to #165

Enable auth by default?

Cog should work in subdirectories

Cog should search up the file tree for cog.yaml, like Keepsake does.

For example, if /home/ben/hotdog-detector/cog.yaml exists, then I should be able to run cog predict in /home/ben/hotdog-detector/subdir/ and it should do what I expect.

There is some nuance here with cog run. Should the working directory be the relative current directory inside the container?

Local image management

As part of implementing local mode and on server (#18) we need a sensible way of managing local images.

Requirements

Tags should not grow uncontrollably, so that images can be cleaned up by docker system prune
The image should be given a name so you can identify it in docker images
The previous image should be removed

For clarity, this is additional work on top of #18. This also includes picking a sensible name when running locally and you aren't pointed at a registry.

Sensible defaults for `.cogignore`

As suggested by @zeke.

.npmignore defaults to .gitignore, but there is a dangerous silent failure in that: Suppose .gitignore ignores secrets.json. If you then you add .npmignore with something new you want to ignore, it stops inheriting from .gitignore therefore unignoring secrets.json.

There is also an additional consideration for machine learning models: .gitignore will normally ignore your model weights, but you want to include that for Cog, so maybe in this case the default would always be not what you want. In which case, it probably shouldn't be the default.

Maybe we need some sensible defaults that are clear to the user? Maybe there's something clever we can do based on .gitignore?