Something seems wrong here. <a target="_blank" rel="noo

This will be merged shortly: <a class="issue-link js-issue-lin

🐛 Bug: Read from GitHub is really slow (~3 records per second) about pyairbyte HOT 3 CLOSED

aaronsteers commented on May 31, 2024 1

🐛 Bug: Read from GitHub is *really* slow (~3 records per second)

from pyairbyte.

Comments (3)

aaronsteers commented on May 31, 2024 1

This will be merged shortly:

Also - after some additional testing, I can report back that while the first stream in the 'select all' scenario has around 2-3 records per second, the performance for issues and pull requests is about 10x faster. 😌

from pyairbyte.

aaronsteers commented on May 31, 2024

I've found that performance is much faster if we filter for just the streams we care about. For instance, selecting just issues and pull_requests gives about 10x the performance. Still not fast, but not a bug-level defect at that speed.

For tests and benchmarking, I'm going to start using airbytehq/quickstarts rather than airbytehq/airbyte.

Regarding DX:

The developer experience when auto-selecting all streams unless the user requests otherwise is probably is not scalable and it's setting up users for a frustrating time. Other similar libraries, such as in LangChain, will require users to pick a single stream.

I'm going to suggest we fail if users have not requested any specific streams. The failure message will list what streams are available - so it's easy to remedy the omission. We can also add a "select_all_streams()" method so that if that's what the user wants, they can still quickly achieve it.

In the GitHub example, the recommended added step would be:

# Create the source as before
source_github = get_source(...)

# Add this step to pick the streams we want:
source_github.set_streams(["issues", "pull_requests"])

# Now we sync as usual
read_result = ...

from pyairbyte.

aaronsteers commented on May 31, 2024

Confirmed today that our performance is back in acceptable range, using the DuckDB default cache strategy. There are still some slow streams, but this is mitigated by now requiring users to either run select_streams() or select_all_streams().

Closing as resolved.

from pyairbyte.

Recommend Projects

🐛 Bug: Read from GitHub is really slow (~3 records per second) about pyairbyte HOT 3 CLOSED

Comments (3)

Regarding DX:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent