Comments (12)
Default in loader was changed to auto commit true ab820c9
from tipoca-stream.
Message commit scenarios tested after upgrading the redshift cluster.
- Test1: maxSize: 3, autoCommit: true ==> works ✅ (it is = 3*10k messages)
- Test2: maxSize: 100, autoCommit: true ==> works ✅ (it is = 100*10k messages)
- Test3: maxSize: 100, autoCommit: false ==> works ✅ (it is = 100*10k messages)
Cannot reproduce.
The weird thing is how is the redshift cluster related to the Kafka commit problem. The only reason i can think of is somehow the kafka consumer session is expiring due to slowness in the resource starved redshift cluster. Since the Commit()
of sarama does not return any error, we do not know if any issue happened.
from tipoca-stream.
One more observation with autoCommit: false, maxSize: 100
First consumeClaim did Commit(), but the commit did not happen.
When the manager started a fresh consumeClaim and did Commit(), it happened.
from tipoca-stream.
Reproduced.
- Test4: maxSize: 1000, autoCommit: true ==> (it is = 1000*10k messages) ❌
I0303 04:16:16.658142 1 load_processor.go:151] customers, offset: 3367, marking
I0303 04:16:16.658147 1 load_processor.go:158] customers, offset: 3367, marked and commited
I0303 04:16:16.658151 1 load_processor.go:679] customers, batchId:1, startOffset:2367, endOffset:3366: Processed
The next consumer Session starts and the same thing happens, this big batch is not getting committed.
Issue reproduced with auto commit true.
Finally at 04:26 the current consumer group started showing as 3367.
So the problem is the delay in the commit reflecting in Kafka when the batch size is huge. And this happens only when the auto commit is true.
Solution
This is a distributed system problem and the loader code should be resilient.
- If auto commit is true, we should prevent reprocessing of batch by checking if the last committed offset is not reflecting.
- Or simply use autoCommit=false and always commit. Loader has the default false so not an issue for it, but batcher has true.
from tipoca-stream.
Issue was seen to happen with maxSize=400 with autoCommit=true
and then was happening with maxSize=400 with autoCommit=false
as well.
Deleting one of the three kafka pod and retrying maxSize=3 with autoCommit=false
has started moving the offsets...
from tipoca-stream.
Tracing what happens in sarama's session.Commit()
, i saw the following errors
offset_manager.go:340 kafka server: The provided member is not known in the current generation.
Then i found this IBM/sarama#1310 which says you get the generation error when you are trying to commit from a closed session.
This is the reason big batches were not working out.
Increasing the timeouts of kafka consumer would help. Also it should be configurable.
cfg.Consumer.Group.Session.Timeout = 20 * time.Second
cfg.Consumer.Group.Heartbeat.Interval = 6 * time.Second
cfg.Consumer.MaxProcessingTime = 500 * time.Millisecond
(only maxProcessingTime was tried)
Also the sarama logs are turned off by the opeartor which should be turned on to not have to do this tracing.
from tipoca-stream.
Increasing the sarama's MaxProcessingTime
6ea5b68
This solves this issue.
from tipoca-stream.
Issue is happening still even after maxProcessing time was increased. It is now happening irrespective of batch size.
from tipoca-stream.
Deleted the kafka pod things started working! Only to fail again.
from tipoca-stream.
Trying retry session 5cc9052 as mentioned in IBM/sarama#1685
from tipoca-stream.
Things have been working out well for us after re-establishing sessions. And at present we think sessions were getting closed due to rebalancing happening. As everytime it happens we have seen this errror loop check partition number coroutine will stop to consume.
https://stackoverflow.com/questions/39730126/difference-between-session-timeout-ms-and-max-poll-interval-ms-for-kafka-0-10
Every consumer in a consumer group is assigned one or more topic partitions exclusively, and Rebalance is the re-assignment of partition ownership among consumers.
A Rebalance happens when:
a consumer JOINS the group
a consumer SHUTS DOWN cleanly
a consumer is considered DEAD by the group coordinator. This may happen after a crash or when the consumer is busy with a long-running processing, which means that no heartbeats has been sent in the meanwhile by the consumer to the group coordinator within the configured session interval
new partitions are added
Will keep an eye on which case exactly the rebalance is happening, is it due to slow loader consumers?
from tipoca-stream.
This was due to this #160
from tipoca-stream.
Related Issues (20)
- Postgres support
- Redshift load frequency, connection and concurrent load problem HOT 2
- Loader Table migration should finish HOT 1
- DIST KEY support
- Option to allow full table reloads of huge table only at night
- Reconciliation not happening in the specified requeAfter HOT 1
- UUID Primary Key support HOT 1
- VACUUM tables HOT 1
- Option to skip merge and direct copy for insert only tables
- Conditional_non_pii_keys behavior is not consistent with original LIKE behavior HOT 1
- Operator was seen to release a topic which had no reload table
- After throttle budget exhaustion maximum load should happen, currently it does not.
- Disable lag alerts on table reloads
- Schema migration fails due to column order issue due to extra columns HOT 3
- Retrying release for an already released table HOT 2
- Cleanup bug for dead consumer groups
- Column order migration bug
- Pods does not recover from failure: abandoned subscription: was taking too long HOT 1
- Update API version for CRD and ClusterRole for K8S v1.22+ support
- [Producer] Schema incompatibility breaks the CDC
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tipoca-stream.