Comments (8)
Hey, I'm adding my notes here on COPY support.
I think some of the points here could relate to #61 @jasonmp85 , if you see items in here that are relevant for #61, could you copy+paste them?
- How does one invoke the COPY operation? In cstore_fdw, we intercept the utility hook. If we have a COPY command, we then route that logic to our insert function.
- How do we process options (format, delimiter, null character, etc.) to copy? If we intercept the utility hook, this happens automatically.
- What happens when we observe a failure? I imagine two types of failures: (a) bad data, and (b) can't send request to any of the replicas. The first error happens much more frequently.
On the last item, this has been heavily discussed in the context of PostgreSQL too: https://wiki.postgresql.org/wiki/Error_logging_in_COPY
Proprietary databases that extend PostgreSQL usually set a threshold for COPY errors. For example, if the COPY observes 5 errors in one file (or 1% of rows), it stops altogether. Otherwise, COPY tries to continue data loading.
from pg_shard.
I think the most difficult problem to overcome will be "what happens when a replica fails partway through", not "what happens when you can't send data to any replica". Do we rollback entirely, or do we support partial data loads (which is a feature not directly supported by the existing COPY
interface within PostgreSQL)?
We can mark a shard as bad if it has a failure, but what about the other shards? Do we finish ingesting the data to them all? If so, how does the user fill in the missing data while omitting the shards that have already been processed? These are questions we'll need to answer for any usable implementation.
from pg_shard.
@marcocitus mentioned pgloader the other day… maybe we can look at it for inspiration re: partial failures or ignore-and-continue semantics.
from pg_shard.
Hi,
I'm following this issue and #61 with interest. Have you decided on the failure modes?
With the current version of pg_shard, we're planning to INSERT
one row at a time instead of using a COPY
trigger, because it's hard to recover from a failed COPY
trigger. I'd like to mirror your COPY
failure modes with our INSERT
failure modes, so that it'll be easier to migrate our INSERT
to a COPY
when the time comes.
from pg_shard.
What makes recovering from a failed COPY
trigger difficult? I believe we were careful to output the number of rows copied, which should allow a caller to resume at a certain row number to continue the operation. That's the short-term plan at the moment (output total number of rows copied, to reflect the contiguous successes from the beginning of the input file).
from pg_shard.
We couldn't safely parallelize different instances of the copy_to_insert
function, so we couldn't keep track of the count of rows copied. If we create the function only once, we get safe parallelism, but we lose the counts.
This answer is related to my comments on #61 last week. Maybe it'd be more on-topic there?
from pg_shard.
Yeah let's move there.
from pg_shard.
+1 I wanted to combine bdr with pg_shard to have a multi-cluster setup. But bdr uses copy for at least the initial data dump and thus preventing me from setting this up.
from pg_shard.
Related Issues (20)
- Allow IMMUTABLE functions in modification commands HOT 8
- DISTINCT and JSON field selection HOT 2
- Does pg_shard support user defined functions? HOT 3
- Does pg_shard encrypt the messages between nodes? HOT 3
- build issue HOT 2
- Support UNLOGGED tables HOT 2
- Schema support incomplete? HOT 1
- Support RETURNING in modification commands HOT 3
- sync-metadata does not support quoted column names HOT 1
- Zapier Test
- Default to CitusDB SELECT plan when within CitusDB
- Let CitusDB invalidate pg_shard's shard interval cache
- Make zero-shard SELECT consistent with CitusDB
- Support "UPSERT"
- Support prepared statements HOT 2
- 1.2.1 build error against PostgreSQL 9.4 HOT 5
- 1.2.1 build error against PostgreSQL 9.5 HOT 2
- cant read data from master [issue moved to pg_shard mailing list]
- Cannot read data in partition sets HOT 3
- Handling bigserial / auto-increment [Moved to mailing list] HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pg_shard.