Git Product home page Git Product logo

sklearn-build-lambda's Introduction

sklearn-build-lambda

Building scikit-learn for AWS Lambda

This repo contains a build.sh script that's intended to be run in an Amazon Linux docker container, and build scikit-learn, numpy, and scipy for use in AWS Lambda. For more info about how the script works, and how to use it, see my blog post on deploying sklearn to Lambda.

There was an older version of this repo, now archived in the ec2-build-process branch, used an EC2 instance to perform the build process and an Ansible playbook to execute the build. That version still works, but the new dockerized version doesn't require you to launch a remote instance.

To build the zipfile, pull the Amazon Linux image and run the build script in it.

$ docker pull amazonlinux:2016.09
$ docker run -v $(pwd):/outputs -it amazonlinux:2016.09 \
      /bin/bash /outputs/build.sh

That will make a file called venv.zip in the local directory that's around 40MB.

Once you run this, you'll have a zipfile containing sklearn and its dependencies, to use them add your handler file to the zip, and add the lib directory so it can be used for shared libs. The minimum viable sklearn handler would thus look like:

import os
import ctypes

for d, _, files in os.walk('lib'):
    for f in files:
        if f.endswith('.a'):
            continue
        ctypes.cdll.LoadLibrary(os.path.join(d, f))

import sklearn

def handler(event, context):
    # do sklearn stuff here
    return {'yay': 'done'}

Extra Packages

To add extra packages to the build, create a requirements.txt file alongside the build.sh in this repo. All packages listed there will be installed in addition to sklearn, numpy, and related dependencies.

Sizing and Future Work

With just compression and stripped binaries, the full sklearn stack weighs in at 39 MB, and could probably be reduced further by:

  1. Pre-compiling all .pyc files and deleting their source
  2. Removing test files
  3. Removing documentation

For my purposes, 39 MB is sufficiently small, if you have any improvements to share pull requests or issues are welcome.

License

This project is MIT Licensed, for license info on the numpy, scipy, and sklearn packages see their respective sites. Full text of the MIT license is in LICENSE.txt.

sklearn-build-lambda's People

Contributors

ryansb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sklearn-build-lambda's Issues

no such option: --use-wheel

Whenever I run $ docker run -v $(pwd):/outputs -it amazonlinux:2016.09 \ /bin/bash /outputs/build.sh

most things seem to work, but then I get the message

no such option: --use-wheel

Then I try pip install wheel while in the sklearn-build-lambda directory (not sure that matters), and I get the response

Requirement already satisfied: wheel in /Users/spencer/anaconda3/lib/python3.6/site-packages

So wheel is on my local machine, but not in the Docker container. I've researched wheel and how to install python modules within Docker, but no luck. Any help is much appreciated!

windows

Hi,
are those instructions works on windows 7?
Thanks

OSError: libquadmath.so.0: cannot open shared object file: No such file or directory

Hey there, I was having trouble with numpy in my serverless project and came across this repo/blog post. I lifted the version of numpy from your zip file, along with the C dependencies included in your lib folder. However, when I deploy my serverless function I'm getting this error:

Traceback (most recent call last):
  File "/var/task/add-return-items/handler.py", line 20, in handler
    ctypes.cdll.LoadLibrary(os.path.join(d, f))
  File "/usr/lib64/python2.7/ctypes/__init__.py", line 438, in LoadLibrary
    return self._dlltype(name)
  File "/usr/lib64/python2.7/ctypes/__init__.py", line 360, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libquadmath.so.0: cannot open shared object file: No such file or directory

which seems to be related to this for loop:

 import ctypes

    for d, _, files in os.walk('vendored/lib'):
        for f in files:
            if f.endswith('.a'):
                continue
            ctypes.cdll.LoadLibrary(os.path.join(d, f))

As far as I can tell, it seems like ctypes needs libquadmath.so.0 , however this is one of the files included in the lib I'm trying to load. Any ideas on how to address this?

how does this work

Hi

  1. thanks for making this, it really helps!
  2. New to lambda, May I ask when you"to use them add your handler file to the zip, and add the lib directory so it can be used for shared libs" How exactly does it work? say I create a main.py with handler function, where do I place the main.py?

lib file

Hello,
can you explain the sentence
"and add the lib directory so it can be used for shared libs"
what exactly is "lib"?

thanks.

Yum not working

Apologies in advance, I'm new to Docker.

I suspect this has nothing to do with your script, but I'm not able to run yum with the latest amazonlinux image. The same command works just fine if I run on the latest version of centos.

$ docker run -it amazonlinux yum
There was a problem importing one of the Python modules
required to run yum. The error leading to this problem was:

   No module named yum

Please install a package which provides this module, or
verify that the module is installed correctly.

It's possible that the above module doesn't match the
current version of Python, which is:
2.7.12 (default, Sep  1 2016, 22:14:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)]

If you cannot solve this problem yourself, please go to
the yum faq at:
  http://yum.baseurl.org/wiki/Faq

In fact, the python version seems pretty crippled as well:

$ docker run -it amazonlinux python -c 'import os.path'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named path

I'm running docker version 17.03.1-ce-mac5, and amazonlinux with image ID 8ae6f52035b5

failure: repodata/filelists.sqlite.bz2 from amzn-main: [Errno 256] No more mirrors to try.

Hi, thanks for setting this up it looks very useful and promising for my teams
lambda use. I've been the below blog post, but I am getting a few errors when I run the docker command.

https://serverlesscode.com/post/scikitlearn-with-amazon-linux-container/

docker run -v $(pwd):/outputs -it amazonlinux:2017.09 \ /bin/bash /outputs/build.sh

Here is the stacktrace

 One of the configured repositories failed (amzn-main-Base),
 and yum doesn't have enough cached data to continue. At this point the only
 safe thing yum can do is fail. There are a few ways to work "fix" this:

     1. Contact the upstream for the repository and get them to fix the problem.

     2. Reconfigure the baseurl/etc. for the repository, to point to a working
        upstream. This is most often useful if you are using a newer
        distribution release than is supported by the repository (and the
        packages for the previous distribution release still work).

     3. Disable the repository, so yum won't use it by default. Yum will then
        just ignore the repository until you permanently enable it again or use
        --enablerepo for temporary usage:

            yum-config-manager --disable amzn-main

     4. Configure the failing repository to be skipped, if it is unavailable.
        Note that yum will try to contact the repo. when it runs most commands,
        so will have to try and fail each time (and thus. yum will be be much
        slower). If it is a very temporary problem though, this is often a nice
        compromise:

            yum-config-manager --save --setopt=amzn-main.skip_if_unavailable=true

failure: repodata/filelists.sqlite.bz2 from amzn-main: [Errno 256] No more mirrors to try.
http://packages.us-west-1.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: [Errno 12] Timeout on http://packages.us-west-1.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 5 seconds')
http://packages.eu-west-1.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: [Errno 12] Timeout on http://packages.eu-west-1.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 5 seconds')
http://packages.ap-northeast-1.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: [Errno 12] Timeout on http://packages.ap-northeast-1.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 5 seconds')
http://packages.ap-northeast-2.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: [Errno 12] Timeout on http://packages.ap-northeast-2.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 5 seconds')
http://packages.sa-east-1.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: [Errno 12] Timeout on http://packages.sa-east-1.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 5 seconds')
http://packages.ap-southeast-2.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: [Errno 12] Timeout on http://packages.ap-southeast-2.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 5 seconds')
http://packages.eu-central-1.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: [Errno 12] Timeout on http://packages.eu-central-1.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 5 seconds')
http://packages.us-west-2.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: [Errno 12] Timeout on http://packages.us-west-2.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 5 seconds')
http://packages.ap-southeast-1.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: [Errno 12] Timeout on http://packages.ap-southeast-1.amazonaws.com/2017.09/main/154a6dd467e2/x86_64/repodata/filelists.sqlite.bz2?instance_id=fail&region=HTTPError: (28, 'Operation too slow. Less than 1000 bytes/sec transferred the last 5 seconds')```

Strip failed: BFD: /home/ec2-user/sklearn_build/lib64/python2.7/site-packages/numpy/.libs/stV4bcuq: Not enough room for program headers, try linking with -N

I tried running the script on an Amazon Linux instance but the strip part failed with a "Not enough room for program headers, try linking with -N" error.

BFD: /home/ec2-user/sklearn_build/lib64/python2.7/site-packages/numpy/.libs/stV4bcuq: Not enough room for program headers, try linking with -N
strip:/home/ec2-user/sklearn_build/lib64/python2.7/site-packages/numpy/.libs/stV4bcuq[.note.gnu.build-id]: Bad value

Do you have an idea how to fix this problem?

Thanks

$(pwd):

I followed well until this step:

$ docker run -v $(pwd):/outputs -it amazonlinux:2016.09
/bin/bash /outputs/build.sh

My local path is e:/Science_git_sklearn-build-lambda. What is the right value for $(pwd)?
Thank you.

removing test/doc files

  1. thanks for making this, it really helps!
  2. i am now having a situation where i need numpy and some other libraries and all of this built with your docker solution is still too large. i just deleted the tests via a crude find . -name 'test*' -exec rm {} \; and now the zip file seems small enough. would it be an idea to add this in the bash script? id be happy to have a go at it (but i have little experience with docker)
  3. would a similar solution for docs perhaps also be an idea?

How can I extend this to Python 3.6 ?

AWS Lambda now handle Python 3.6 and i want to use it, but do not sure if is simple to extend it to use Python 3.6 instead.

Any guide would be appreciated

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.