Comments (3)
This will be merged shortly:
Also - after some additional testing, I can report back that while the first stream in the 'select all' scenario has around 2-3 records per second, the performance for issues and pull requests is about 10x faster. 😌
from pyairbyte.
I've found that performance is much faster if we filter for just the streams we care about. For instance, selecting just issues and pull_requests gives about 10x the performance. Still not fast, but not a bug-level defect at that speed.
For tests and benchmarking, I'm going to start using airbytehq/quickstarts
rather than airbytehq/airbyte
.
Regarding DX:
The developer experience when auto-selecting all streams unless the user requests otherwise is probably is not scalable and it's setting up users for a frustrating time. Other similar libraries, such as in LangChain, will require users to pick a single stream.
I'm going to suggest we fail if users have not requested any specific streams. The failure message will list what streams are available - so it's easy to remedy the omission. We can also add a "select_all_streams()" method so that if that's what the user wants, they can still quickly achieve it.
In the GitHub example, the recommended added step would be:
# Create the source as before
source_github = get_source(...)
# Add this step to pick the streams we want:
source_github.set_streams(["issues", "pull_requests"])
# Now we sync as usual
read_result = ...
from pyairbyte.
Confirmed today that our performance is back in acceptable range, using the DuckDB default cache strategy. There are still some slow streams, but this is mitigated by now requiring users to either run select_streams()
or select_all_streams()
.
Closing as resolved.
from pyairbyte.
Related Issues (20)
- 🐛 Bug: Unexpected behavior if stream name is non-standard
- 🐛 Bug: PyAirbyte fails when field contains special characters (e.g. `discountapplied(%)`)
- 🐛 Bug: source-instagram fails HOT 1
- 🐛 Bug: Syncs show "0 records" synced in the summary without mentioning that the sync was in incremental mode (no *new* records)
- Usage analytics: detect "AIRFLOW_HOME" env var as signal for Airflow-managed runtimes
- Zilliz/PyAirbyte blog post HOT 1
- source-zendesk-support: TypeError: unhashable type: 'list' HOT 2
- Blog post: Using Custom connectors with PyAirbyte
- Postgresql cache issue HOT 1
- 💡 Feature Request: Provide more robust handling of `anyOf(string, object)` scenarios
- SQL Alchemy warns of unsupported `JSON` data type in SQLAlchemy
- Feature: Add option for embedding using e5-base-v2 for Snowflake Cortex connector
- Feature: Add multiple authorization options for Snowflake Cortex connector.
- Add support for DuckDB 0.10.2+
- is_interactive() throwing an exception inside an ASGI HOT 2
- Hackathon - Tutorial tasks are complete with description, examples and deliverables. HOT 1
- Hackathon: For features, add description, guidance and deliverables.
- source-google-analytics-data-api authentication - 'Client' was expected even only with service id HOT 1
- Feature Request: Snowflake Snowpark support via registering PyAirbyte on the Snowpark Anaconda Channel
- UX Issue: Challenging to authenticate against OAuth sources on PyAirbyte
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyairbyte.