Git Product home page Git Product logo

Comments (9)

aaronsteers avatar aaronsteers commented on June 9, 2024 1

@betizad - I think I see the issue here. From the logs, I see you are running in Databricks/Spark, and their runtime apparently does not support the venv library - or the venv CLI is not findable.

I'm glad to hear you have a temporary workaround, but we'd still like to find a solution that works for Databricks users broadly.

(Updated the title of this issue to reflect what I now think is the root cause.)

Can you provide the specifics to your runtime?

And can you try the workaround which we applied to Colab?

In Colab, our examples like this one start with !apt-get install -qq python3.10-venv. I'm not confident this same workaround would work on Databricks, but it seems worth trying.

from pyairbyte.

mattppal avatar mattppal commented on June 9, 2024 1

I'm running into a similar problem on another platform (Replit).

Replit is built on Nix and I suspect there are some permissions / config issues with trying to install venvs into the project folder.

ERROR: Can not perform a '--user' install. User site-packages are not visible in this virtualenv.

My workaround:

  SOURCE_GOOGLE_SHEETS = "source-google-sheets"
  
  source = ab.get_source(
      name=SOURCE_GOOGLE_SHEETS,
      local_executable=f".pythonlibs/bin/{SOURCE_GOOGLE_SHEETS}"
  )

  source.set_config({
      "credentials": {
          "auth_type": "Service",
          "service_account_info": os.environ["SERVICE_ACCOUNT_JSON"]
      },
      "spreadsheet_id": SPREADSHEET_ID
  })

Of course, that presents it's own challenges because now there are dependency issues 😅

Would love to find a solution for environments with challenging venv configurations.

from pyairbyte.

aaronsteers avatar aaronsteers commented on June 9, 2024

@betizad - thanks for creating this issue!

Have you tried skipping the install of the connector? PyAirbyte is able to install your connectors in their own dedicated virtual environments and it does this by default in order to prevent version conflicts.

from pyairbyte.

aaronsteers avatar aaronsteers commented on June 9, 2024

Alternatively, you can use a tool like pipx to install your connector if it's available. Pipx is a drop-in replacement for pip, but I haven't used it before in a notebook environment so I can't say for sure if it would work in your case.

from pyairbyte.

betizad avatar betizad commented on June 9, 2024

I tried leeting airbyte to install the library needed, but it did not work. I get the following error:

source = airbyte.get_source('source-linkedin-ads', version="0.7.0")
AirbyteSubprocessFailedError: AirbyteSubprocessFailedError: Subprocess failed.
    Run Args: ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-cb8b9359-8554-45c6-bb46-ae96bbd591dd/bin/python', '-m', 'venv', '/home/spark-cb8b9359-8554-45c6-bb46-ae/.venv-source-linkedin-ads']
    Exit Code: 1

My current workaround is:

  • first install the source-linkedin-ads, then airbyte.
  • Ignore the warning and errors.
  • then get the executable for linkedin LINKEDIN_EXEC = subprocess.Popen("which source-linkedin-ads", shell=True, stdout=subprocess.PIPE).stdout.read().decode().replace("\n","")
  • us the executable instead of installing automatically: source = airbyte.get_source('source-linkedin-ads', local_executable=LINKEDIN_EXEC)

from pyairbyte.

betizad avatar betizad commented on June 9, 2024

@betizad - I think I see the issue here. From the logs, I see you are running in Databricks/Spark, and their runtime apparently does not support the venv library - or the venv CLI is not findable.

I'm glad to hear you have a temporary workaround, but we'd still like to find a solution that works for Databricks users broadly.

(Updated the title of this issue to reflect what I now think is the root cause.)

Can you provide the specifics to your runtime?

And can you try the workaround which we applied to Colab?

In Colab, our examples like this one start with !apt-get install -qq python3.10-venv. I'm not confident this same workaround would work on Databricks, but it seems worth trying.

I took me a while to get back to this.

I'm using:
DBR 13.3LTS
Python '3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]'
pyAirbyte 0.10.4
airbyte-source-linkedin-ads 2.0.0

The workaround in colab does not work in DBX. If I run !apt-get install -qq python3.10-venv I get a permission error:

E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?

from pyairbyte.

aaronsteers avatar aaronsteers commented on June 9, 2024

@betizad and @mattppal - Thank you both for sharing more about your context and execution requirements.

I did a bit of digging (mostly ChatGPT 🙄) and I believe I've confirmed that in both the Spark and also the Replit runtimes, there is no ability to create an 'isolated virtual env' - which we would need to ensure proper dependency isolation.

If we don't want to roll the dice on a per-connector basis about whether the connectors will have conflicts with each other and/or with PyAirbyte or other libraries that you are using in these environments, I can think of two decent paths forward:

Option 1: Leverage Conda across connectors and PyAirbyte to align dependency versions

This requires net new work on the side of Airbyte, and it would (probably?) also require some work from the user in terms of interacting with Conda or building a Conda environment.

This has an added benefit of streamlining usage in other environments that have Conda-based delivery integration - for instance with Snowflake's Snowpark Python runtime.

Option 2: Use a tool like Shiv or PyOxydizer to pre-build the connector executable

In this approach, we would design a process to build connectors into CLI executables - and the executable itself would handle delivery of dependencies and the needed environment isolation.

I believe this would work well in the case of Replit, where the executable would be uploaded to the Replit environment and then invoked/called by PyAirbyte. But getting this working correctly in a Spark cluster could be more complicated - since you'd need to ensure the CLI executable is available to all nodes in the cluster. (Not impossible, but also probably not a trivial effort.)

@betizad and @mattppal - I'm curious of your thoughts on both of these approaches. Let me know if one or both seem like they could be a good fit, and/or if you have any other ideas not mentioned above.

Thanks! 🙏

from pyairbyte.

aaronsteers avatar aaronsteers commented on June 9, 2024

Circling back to this thread - A few other runtimes have been requested since my last post.

Cethan in Slack has reported difficulty deploying with the www.render.com and separately we've had some progress getting this to work with Airflow.

The trick that worked in Airflow was to use a Dockerfile that handles the isolation of installing the connectors into their own virtualenvs:

# Pre-install the connnector(s) in their own virtualenv
RUN python -m venv source_github && source source_github/bin/activate &&\
    pip install --no-cache-dir airbyte-source-github && deactivate

# ... repeat for other connectors ...

# Test that the executable works and we can find it
RUN source/bin/source-github spec

# Go ahead and install PyAirbyte as usual
RUN python -m venv pyairbyte_venv && source pyairbyte_venv/bin/activate &&\
    pip install --no-cache-dir airbyte==0.10.4 && deactivate

If pipx is preinstalled on the image, this is slightly easier:

# pipx handles the virtual-env and auto-adds the connector CLI to PATH:
RUN pipx install airbyte-source-github
RUN pipx install airbyte-source-faker

# Test that the executables work and we can find them on PATH
RUN source-github spec
RUN source-faker spec

# Go ahead and install PyAirbyte as usual
RUN python -m venv pyairbyte_venv && source pyairbyte_venv/bin/activate &&\
    pip install --no-cache-dir airbyte==0.10.4 && deactivate

from pyairbyte.

aaronsteers avatar aaronsteers commented on June 9, 2024

Hello, @betizad, @mattppal -

Circling back here again. 👋

Very happy to announce that we have a new "yaml" installation option that works for ~135 different API source connectors - along with all custom connectors built with our no-code Connector Builder. We're also investing heavily in migrating python connectors to the no-code/low-code framework, which means the number of supported connectors will continue to grow.

Here is a Loom I recorded to walk through the feature:

from pyairbyte.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.