Git Product home page Git Product logo

trihydro / jpo-ode Goto Github PK

View Code? Open in Web Editor NEW

This project forked from usdot-jpo-ode/jpo-ode

1.0 1.0 0.0 99.1 MB

US Department of Transportation (USDOT) Intelligent Transportation Systems Operational Data Environment (ITS ODE). This is the main repository that integrates and coordinates ODE Submodules.

Dockerfile 0.17% HTML 2.39% Makefile 0.03% Shell 0.25% C 0.16% Java 96.20% Batchfile 0.06% JavaScript 0.11% CSS 0.06% Python 0.56%

jpo-ode's People

Contributors

0111sandesh avatar abey-yoseph avatar alexsobledotgov avatar bbrotsos avatar codygarver avatar dan-du-car avatar danrasband avatar dependabot[bot] avatar dmccoystephenson avatar drewjj avatar hmusavi avatar iyourshaw avatar jtbaird avatar mgarramo avatar michael7371 avatar mvs5465 avatar paulbourelly999 avatar paynebrandon avatar saikrishnabairamoni avatar schwartz-matthew-bah avatar snallamothu avatar southernsun avatar tonychen091 avatar tonyenglish avatar toryb1 avatar trevor-trou avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

jpo-ode's Issues

Asn1EncodedDataRouter Reprocessing Data

Bug Report

We recently discovered that the Asn1EncodedDataRouter was re-processing messages that had been published to topic.Asn1EncoderOutput in our ODE deployment. We've been able to identify the root cause, and reproduce the issue in a new, test consumer subscribed to the same topic.

Symptoms

ODE Logs

Below is an excerpt from the ODE logs (the full 5,000 lines captured can be found in
ode_logs_abbreviated.txt:

[Lines 441-443]

2020-02-10 21:29:13 [pool-3-thread-1] WARN  ConsumerCoordinator - Auto offset commit failed for group AsnCodecRouterServiceController: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records. 
2020-02-10 21:29:14 [pool-3-thread-1] DEBUG MessageConsumer - Asn1EncoderConsumer consuming 406 message(s) 
2020-02-10 21:29:14 [pool-3-thread-1] DEBUG Asn1EncodedDataRouter - Consumed: <?xml version="1.0"?><OdeAsn1Data> ... <streamId>43afff7d-7c47-4941-b89e-9455e52d4687</streamId> ... <odeReceivedAt>2020-02-08T22:09:40.335Z</odeReceivedAt> ... <recordId>BB875239</recordId> ... </OdeAsn1Data> 

...

[Lines 3712-3714]

2020-02-10 21:38:24 [pool-3-thread-1] WARN  ConsumerCoordinator - Auto offset commit failed for group AsnCodecRouterServiceController: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records. 
2020-02-10 21:38:27 [pool-3-thread-1] DEBUG MessageConsumer - Asn1EncoderConsumer consuming 406 message(s) 
2020-02-10 21:38:27 [pool-3-thread-1] DEBUG Asn1EncodedDataRouter - Consumed: <?xml version="1.0"?><OdeAsn1Data> ... <streamId>43afff7d-7c47-4941-b89e-9455e52d4687</streamId> ... <odeReceivedAt>2020-02-08T22:09:40.335Z</odeReceivedAt> ... <recordId>BB875239</recordId> ... </OdeAsn1Data> 

On line 442, the Asn1EncodedDataRouter received 406 messages after polling the topic. The next time it polls the topic (on line 3713), it again consumed 406 messages (the same messages, in fact). To emphasize the point the the same messages are being re-processed by the Asn1EncodedDataRouter, I've included an abbreviated version of the first message consumed in both events (lines 443 and 3714). In each case, the first message processed has the same streamId, odeReceivedAt and recordId values.

AsnCodecRouterServiceController Group Status

To further illustrate that messages are being re-processed by the Asn1EncodedDataRouter, I queried the AsnCodecRouterServiceController group status, as shown below:

% ./kafka-consumer-groups.sh --bootstrap-server 10.145.9.204:9092 --describe --group AsnCodecRouterServiceController

GROUP                           TOPIC                   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                     HOST            CLIENT-ID
AsnCodecRouterServiceController topic.Asn1DecoderOutput 0          1892835         1892835         0               consumer-1-dacf3c3a-e5e5-42d2-af89-92cb49230b4c /172.18.0.1     consumer-1
AsnCodecRouterServiceController topic.Asn1EncoderOutput 0          4728            46788           42060           consumer-2-0e2fad59-4dc7-436a-9494-5719b6f9f540 /172.18.0.1     consumer-2

The consumer of topic.Asn1EncoderOutput is 42060 messages behind, apparently "stuck".

Root Cause

As indicated in the ODE logs (lines 441 and 3712), the auto offset commit is continually failing. This is because Kafka 0.10.1.0 has a default max.poll.interval.ms property value of 300000 (5 minutes). If the consumer takes longer than 5 minutes between polls(), Kafka will assume the consumer is down. Then when the consumer attempts an auto commit during the next poll(), it will fail. Since the consumer's progress in the topic's commit log isn't being recorded, on the next poll(), the consumer will receive the same messages it had processed previously.

So with the default max.poll.interval.ms value of 5 minutes, and a default max.poll.records value of 500, the Asn1EncodedDataRouter needs to be capable of processing up to 500 messages in a 5 minute period, between polls(). According to the ODE logs, it took the Asn1EncodedDataRouter 9:13 to process 406 messages. Since it's taking > 5 minutes, it's stuck re-processing data.

Possible solutions

The easy solution would be to adjust the max.poll.interval.ms and max.poll.records values accordingly (maybe 600000 and 100, respectively). I would also argue these (and other configuration values) shouldn't be hardcoded in MessageConsumer.java. Rather, they should be set in a configuration file that gets read at runtime.

Another solution that would probably be better long term would be to extract the "deposit to RSUs" logic into it's own module. I think this is the main reason it's taking 9+ minutes to process 406 messages. This would also be preferable, as deposits to the SDW are currently blocked by the logic that deposits messages to RSUs. The "deposit to SDW" logic is already abstracted away from the Asn1EncodedDataRouter, and exists in the jpo-ode-sdw-depositor module. I'd argue the Asn1EncodedDataRouter should behave similarly for RSU deposits. As it does with SDW messages, it could publish messages destined for RSUs to a topic like topic.RSUDepositorInput, then another module (say, jpo-ode-rsu-depositor) handles the details of pushing those messages out to RSUs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.