Git Product home page Git Product logo

Comments (17)

ejona86 avatar ejona86 commented on August 27, 2024 3

I've posted a warning to grpc-io, updated the 1.64.0 release notes, and notified Netty folks.

from grpc-java.

XN137 avatar XN137 commented on August 27, 2024 1

another datapoint:

in our maven build we have this in the root pom:

<dependencyManagement>
(...)
     <dependency>
        <groupId>io.netty</groupId>
        <artifactId>netty-bom</artifactId>
        <version>4.1.109.Final</version>
        <type>pom</type>
        <scope>import</scope>
      </dependency>
      <dependency>
        <groupId>io.grpc</groupId>
        <artifactId>grpc-bom</artifactId>
        <version>1.64.0</version>
        <type>pom</type>
        <scope>import</scope>
      </dependency>
(...)
</dependencyManagement>

when upgrading to 4.1.111.Final many tests (running on linux box) show errors like:

[2024-06-12T13:11:15.494Z] 13:11:15.361 [grpc-default-worker-ELG-3-22] WARN  io.grpc.netty.NettyServerStream - Exception processing message
[2024-06-12T13:11:15.494Z] io.grpc.StatusRuntimeException: INTERNAL: Encountered end-of-stream mid-frame
[2024-06-12T13:11:15.494Z] 	at io.grpc.Status.asRuntimeException(Status.java:525)
[2024-06-12T13:11:15.494Z] 	at io.grpc.internal.AbstractServerStream$TransportState.deframerClosed(AbstractServerStream.java:247)
[2024-06-12T13:11:15.494Z] 	at io.grpc.netty.NettyServerStream$TransportState.deframerClosed(NettyServerStream.java:139)
[2024-06-12T13:11:15.494Z] 	at io.grpc.internal.MessageDeframer.close(MessageDeframer.java:234)
[2024-06-12T13:11:15.494Z] 	at io.grpc.internal.MessageDeframer.deliver(MessageDeframer.java:301)
[2024-06-12T13:11:15.494Z] 	at io.grpc.internal.MessageDeframer.request(MessageDeframer.java:162)
[2024-06-12T13:11:15.494Z] 	at io.grpc.internal.AbstractStream$TransportState$1RequestRunnable.run(AbstractStream.java:264)
[2024-06-12T13:11:15.495Z] 	at io.grpc.netty.NettyServerStream$TransportState$1.run(NettyServerStream.java:174)
[2024-06-12T13:11:15.495Z] 	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
[2024-06-12T13:11:15.495Z] 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
[2024-06-12T13:11:15.495Z] 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
[2024-06-12T13:11:15.495Z] 	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405)
[2024-06-12T13:11:15.495Z] 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
[2024-06-12T13:11:15.495Z] 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
[2024-06-12T13:11:15.495Z] 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[2024-06-12T13:11:15.495Z] 	at java.base/java.lang.Thread.run(Thread.java:829)

and the tests eventually time out...

upgrading to 4.1.110.Final does not run into these problems.

from grpc-java.

ejona86 avatar ejona86 commented on August 27, 2024 1

Given the recent changes to ByteBufs and there's now corruption, I'm strongly suspicious of gRPC's NettyAdaptiveCumulator being a contributing factor. Once I can reproduce this, the first thing to try is to disable the cumulator and see if the problem resolves.

from grpc-java.

normanmaurer avatar normanmaurer commented on August 27, 2024 1

@ejona86 so just to summarize you agree that what we did in netty is correct ? I think it is but wanted to double check :)

from grpc-java.

ejona86 avatar ejona86 commented on August 27, 2024 1

@normanmaurer, I think Netty's new "duplicating a slice returns a slice" behavior is appropriate. The previous behavior confused me. I'll need to add a comment to netty/netty#12844 . I think the failure is a short-term workaround became a permanent fix (until it broke).

from grpc-java.

ejona86 avatar ejona86 commented on August 27, 2024

You wrote a lot, but very little about what we actually need to be told. The tests in grpc-java run fine with 4.1.111. We could run the wider tests that we do when upgrading Netty, but right now this report is essentially useless to us.

Could you at least tell us what you were doing when it hanged, whether it was a reliable hang or random, how long of a hang it is (1s, 100s?)? Since you are controlling the Netty version, it'd be helpful to know if 4.1.110 had the issue as well.

As a hint, in the past when we've seen random hangs on upgrade, it generally meant you need a newer netty-tcnative or it was a netty-tcnative bug.

from grpc-java.

ejona86 avatar ejona86 commented on August 27, 2024

Note: I was testing on Linux, so maybe this is a Mac problem.

from grpc-java.

lhotari avatar lhotari commented on August 27, 2024

You wrote a lot, but very little about what we actually need to be told. The tests in grpc-java run fine with 4.1.111. We could run the wider tests that we do when upgrading Netty, but right now this report is essentially useless to us.

Sure, it doesn't have relevant details. You can ignore it in the meantime. :)

I created the issue also to see if others are running into similar issues in the upgrade. I'll share more relevant details when I get back to debugging the issue.

from grpc-java.

lhotari avatar lhotari commented on August 27, 2024

@ejona86 Here's the reproducer with more details: https://github.com/lhotari/pulsar/blob/lh-grpc-netty-4.1.111/README.md

There are 2 cases which fail:

The libraries that are used under the covers are jetcd 0.8.2, vert-grpc 4.5.8, grpc-java 1.64.0 (versions aligned with grpc-bom) and Netty 4.1.111.Final (versions aligned with netty-bom, handles also other libs such as tcnative, it's 2.0.65.Final).

I also tested on my Linux laptop (x86_64) and the problem reproduces in the same way as on Mac OS (arm64).

from grpc-java.

ejona86 avatar ejona86 commented on August 27, 2024

DO NOT USE NETTY 4.1.111 WITH GRPC. It appears there is corruption. There's no guarantees the corruption would be detected.


@lhotari, thanks for the reproducer. Unfortunately I can't easily run it because it requires docker. I've got podman and don't have time to fight getting rootless docker or spinning up a new VM at the moment.

Based on the logs (thanks!), I'm worried this is corruption. 00000201040000009d is 00 00020104 0000009d which means a 0x20104 byte message. And protobuf wouldn't start a message with a bunch of zeros (compare to the previous response 00 0000001c 0a1a08b2, and the 0a1a part). It is even more severe in the second log with 00 00020104 0000001b88c000018300000000001b000000017. That's a lot of zeros, which could make sense for protobuf bytes, but it wouldn't be the very start of the message.

000002 01 04 0000001b 88c0 looks like the response headers HTTP/2 frame that came immediately before. LENGTH=2, HEADERS (0x1), flags=END_HEADERS, streamid=27, indexed header :status 200, indexed dynamic header at position 3 (64 - 61 static headers).

from grpc-java.

sergiitk avatar sergiitk commented on August 27, 2024

I suspect NettyAdaptiveCumulator got broken by netty/netty#14093.
Because of netty/netty#12844, we had a temporary workaround to correct r/w indexes, which never got resolved:

// TODO(https://github.com/netty/netty/issues/12844): remove when we use Netty with
// the issue fixed.
// In certain cases, removing the CompositeByteBuf component, and then adding it back
// isn't idempotent. An example is provided in https://github.com/netty/netty/issues/12844.
// This happens because the buffer returned by composite.component() has out-of-sync
// indexes. Under the hood the CompositeByteBuf returns a duplicate() of the underlying
// buffer, but doesn't set the indexes.
//
// To get the right indexes we use the fact that composite.internalComponent() returns
// the slice() into the readable portion of the underlying buffer.
// We use this implementation detail (internalComponent() returning a *SlicedByteBuf),
// and combine it with the fact that SlicedByteBuf duplicates have their indexes
// adjusted so they correspond to the to the readable portion of the slice.
//
// Hence composite.internalComponent().duplicate() returns a buffer with the
// indexes that should've been on the composite.component() in the first place.
// Until the issue is fixed, we manually adjust the indexes of the removed component.

Without this fix, we saw a rare, but persistent HTTP2 frame corruption in certain situations. Similar to this one.

We should consider how to make it work with pre-4.1.111 and post-4.1.111 Netty versions. The NettyAdaptiveCumulator itself was introduced in #9558, and released in v1.51.0. At the moment unshaded Netty 4.1.111.Final can't be used with gRPC >= v1.51.0.

from grpc-java.

sergiitk avatar sergiitk commented on August 27, 2024

netty/netty#12844 has very detailed description on what's wrong with CompositeByteBuf, and the nature of our workaround. component(int) is returningslice() which provides incorrect view into the data. Our fix relied on the fact that slice().duplicate() returned correct indexes (IRRC because duplicate did proper unwrapping). Our workaround corrects ByteBuf indexes by getting them from the slice duplicate:

// Hence composite.internalComponent().duplicate() returns a buffer with the
// indexes that should've been on the composite.component() in the first place.
// Until the issue is fixed, we manually adjust the indexes of the removed component.
ByteBuf sliceDuplicate = composite.internalComponent(tailComponentIndex).duplicate();
newTail.setIndex(sliceDuplicate.readerIndex(), sliceDuplicate.writerIndex());

netty/netty#14093 changes that, and apparently now the indexes we're correcting are from the slice duplicate are wrong. Probably because unwrapping was removed, but it's too early for conclusions. I won't be surprised if the root cause was actually fixed by that pr.

Relevant excerpt from netty/netty#12844 - read the issue description for the full context.


Let's explore the result of CompositeByteBuf.component(0):

###### component
Full:
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 2d 2d 2d 30 31 32 33 34                         |---01234        |
+--------+-------------------------------------------------+----------------+
Readable:
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 2d 2d 2d 30 31 32 33 34                         |---01234        |
+--------+-------------------------------------------------+----------------+
UnpooledDuplicatedByteBuf(ridx: 0, widx: 8, cap: 32, unwrapped: PooledUnsafeHeapByteBuf(ridx: 0, widx: 8, cap: 32))

Note that the component still contains the bytes discarded by the CompositeByteBuf(). This contradicts to what's returned by internalComponent(), which returns the slice() under the hood:

###### internalComponent
Full:
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 30 31 32 33 34                                  |01234           |
+--------+-------------------------------------------------+----------------+
Readable:
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 30 31 32 33 34                                  |01234           |
+--------+-------------------------------------------------+----------------+
UnpooledSlicedByteBuf(ridx: 0, widx: 5, cap: 5/5, unwrapped: PooledUnsafeHeapByteBuf(ridx: 0, widx: 8, cap: 32))

What's even more interesting, if we call .duplicate() on the slice, it'll return the full view of the buffer, and with correctly offset indexes!

###### internalComponentDup
Full:
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 2d 2d 2d 30 31 32 33 34                         |---01234        |
+--------+-------------------------------------------------+----------------+
Readable:
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 30 31 32 33 34                                  |01234           |
+--------+-------------------------------------------------+----------------+
UnpooledDuplicatedByteBuf(ridx: 3, widx: 8, cap: 32, unwrapped: PooledUnsafeHeapByteBuf(ridx: 0, widx: 8, cap: 32))

from grpc-java.

ejona86 avatar ejona86 commented on August 27, 2024

@larry-safran informed me unit tests did fail with 4.1.111. I re-tried (twice; third time was the charm) and now I'm seeing the failures. And they are all in NettyAdaptiveCumulatorTest. I've also been able to see a RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 16777216: 39256064 error from our interop tests, which is clear corruption. For those wondering, the corruption happens in the reading code path.

So this really seems triggered by netty/netty#14093, but only because of the workaround for netty/netty#12844. We're probably going to need Netty version-detection logic to avoid going through a corrupting code path. And it seems well past time to fix netty/netty#12844.

from grpc-java.

ludoch avatar ludoch commented on August 27, 2024

If you want another internal datapoint, you can look at failure in https://screenshot.googleplex.com/59NDiYjVoeWcd6N
solved with this rollback: cr/643373868

from grpc-java.

chrisvest avatar chrisvest commented on August 27, 2024

@ejona86 @larry-safran What are the release and back-port plans for the fix?

from grpc-java.

cfredri4 avatar cfredri4 commented on August 27, 2024

We're probably going to need Netty version-detection logic to avoid going through a corrupting code path.

@ejona86 Not fully following all the details, but instead of detecting specific Netty versions; would it be possible to look at the slice and/or duplicate buffer indexes to detect the old behaviour and not apply the workaround?

from grpc-java.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.