Comments (18)
Thanks Mike. I have seen the java.lang.OutOfMemoryError: Direct buffer memory
on Linux too recently. I did some investigation by printing out the refCnt on each buffer and it seemed a little strange - buffers had high refCnt values (e.g. 5 or 6) which would often descend monotonically. I can replicate the bug in much the same way, by steaming a local large file.
My current approach to this is to consider the bug is either in yada (bY), manifold (bM), aleph (bA) or netty (bN).
It could be in bY, but the issue also manifests itself outside of multipart (iirc), since I've seen the same thing with large files streaming directly to a temp dir, with no multipart involved. The only code in yada involved here is minimal. See https://github.com/juxt/yada/blob/master/dev/src/yada/dev/upload.clj which adds a :consumer which yada calls directly with the stream of Netty ByfeBuf instances it gets from Aleph. So I'm not sure how the bug could be in yada.
At this stage I'd like to reach out to @ztellman for any wisdom he can impart but since this is OSS I can't exactly escalate this through management ;)
So let's assume it's a bug in manifold, aleph or netty. In order to discount manifold I'm working on a replacement for aleph and netty (undertow and xnio) on the undertow branch. This is nearly done and should point to whether it's bM or (bA or bN). In any case, it should be relatively straight-forward to create a failing test for manifold+aleph to present to Zach. Or he might indicate that yada is not calling manifold correctly.
I will prioritise this issue for work over the Easter break.
from yada.
Also see clj-commons/aleph#214
from yada.
I have raised clj-commons/aleph#224 and will follow up in due course with a smaller failing test that doesn't involve yada. @mfikes thanks for this and sorry I haven't got a better answer for you yet.
from yada.
Thanks @malcolmsparks. I appreciate your help!
from yada.
Can you try -Dio.netty.allocator.numDirectArenas=0
as a JVM flag, and see if the issue disappears?
from yada.
@ztellman Yes, I can confirm that that JVM flag causes the issue to disappear. (I tried it on OS X.)
from yada.
Interestingly, I've seen that exact error on OS X, but never on Linux. I've used that flag to work around it on my dev box. Honestly, I never dug into it too much, because it didn't affect production. You might try -XX:MaxDirectMemorySize=1g
(or whatever is appropriate for your machine) and see what happens. If it doesn't, it might be worth talking to the Netty folks and see if there's an official workaround.
from yada.
I have some failing tests that I'll try over the weekend which stress the
stack quite hard. I'll report on Sunday.
Malcolm Sparks
Director
Email: [email protected]
Web: https://juxt.pro
JUXT LTD.
Software Consulting, Delivery, Training
On 18 March 2016 at 17:42, Mike Fikes [email protected] wrote:
@ztellman https://github.com/ztellman Yes, I can confirm that that JVM
flag causes the issue to disappear. (I tried it on OS X.)—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#75 (comment)
from yada.
I've tried the MaxDirectMemorySize=1g before, it doesn't help (or at least,
it only delays the problem)
Malcolm Sparks
Director
Email: [email protected]
Web: https://juxt.pro
JUXT LTD.
Software Consulting, Delivery, Training
On 18 March 2016 at 17:49, Malcolm Sparks [email protected] wrote:
I have some failing tests that I'll try over the weekend which stress the
stack quite hard. I'll report on Sunday.Malcolm Sparks
DirectorEmail: [email protected]
Web: https://juxt.proJUXT LTD.
Software Consulting, Delivery, TrainingOn 18 March 2016 at 17:42, Mike Fikes [email protected] wrote:
@ztellman https://github.com/ztellman Yes, I can confirm that that JVM
flag causes the issue to disappear. (I tried it on OS X.)—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#75 (comment)
from yada.
And just to be clear, I've seen this issue with a vanilla Java Netty application, so I'm fairly sure the issue is not with the Clojure elements of the stack.
EDIT: Though it is possible that the Clojure stuff is exacerbating the problem, somehow.
from yada.
That's good to know. Zach, thanks for your help here - netty still feels a bit like the 'dark arts' to me.
from yada.
When you do your tests, you should add (aleph.netty/leak-detector-level! :paranoid)
and see what sort of "I got GCed before my reference count went to zero" warnings you get.
from yada.
I found this issue while working through my own unrelated problem with netty direct buffer pooling.
I don't use yada, and have only a passing familiarity with aleph, but I have been using clojure and netty for years so thought I might chip in some info.
My first observation is that the issue is intermittent. Sometimes memory usage would blow out almost immediately, sometimes after a minute or two of execution, and sometimes the message would be received fully with zero issue.
My second observation is that message consumption is very slow. Consuming a 1GB file fully takes about 10 minutes. In comparison a simple netty server with only the HTTP codec that drops all chunks on the floor will receive that same 1GB file in about 3 seconds.
Thirdly, I tried briefly reproducing this error with the latest versions of bidi/yada/aleph and all three trial runs completed without issue. Maybe I was lucky, or maybe this error has been scrubbed out somewhere recently.
Diagnosing:
There is some background info on Netty buffer pooling in my link above.
If we start a new REPL we will have the default max direct memory of 2GB, that results in 16 Pool Arenas with, to begin with, zero PoolChunks each.
(.directArenas (PooledByteBufAllocator/DEFAULT))
=>
[#object[io.netty.buffer.PoolArena$DirectArena
0x157c4c39
"Chunk(s) at 0~25%:
none
Chunk(s) at 0~50%:
none
Chunk(s) at 25~75%:
none
Chunk(s) at 50~100%:
none
Chunk(s) at 75~100%:
none
Chunk(s) at 100%:
none
tiny subpages:
small subpages:
"]
#object[io.netty.buffer.PoolArena$DirectArena
0x45cbb12c
"Chunk(s) at 0~25%:
none
Chunk(s) at 0~50%:
none
Chunk(s) at 25~75%:
none
Chunk(s) at 50~100%:
none
Chunk(s) at 75~100%:
none
Chunk(s) at 100%:
none
tiny subpages:
small subpages:
"]
...
...
At this point it's also a good idea to fire up jvisualvm with the Buffer Pools plugin.
We expect netty to allocate all the memory required for a single message to a single arena, since there's a event-loop thread -> pool arena ThreadLocal cache, so one event-loop thread always uses the same arena.
We also expect that we will likely only allocate a single 16MB (default) PoolChunk to that arena, since Netty caches buffers in the same ThreadLocal cache, and we intend to receive a http-chunk, drop it, release it, receive another http-chunk, etc. We should only really use a single ByteBuffer (and this is the behaviour we see with a bare Netty server with HttpServerCodec dropping all http-chunks received for a single full http request).
If I then start a 1GB upload I see my direct memory usage (via visualvm) jump to 16MB as the PoolChunk is allocated, and for quite some time (several minutes generally) this remains the case. At any point we can check the state of the PoolArenas, and we see a single arena with a single PoolChunk, with a small amount of memory allocated (basically enough for one 64k buffer and a couple smaller).
(.directArenas (PooledByteBufAllocator/DEFAULT))
=>
[#object[io.netty.buffer.PoolArena$DirectArena
0x31bd8884
"Chunk(s) at 0~25%:
Chunk(46d7fd67: 1%, 114688/16777216)
Chunk(s) at 0~50%:
none
Chunk(s) at 25~75%:
none
Chunk(s) at 50~100%:
none
Chunk(s) at 75~100%:
none
Chunk(s) at 100%:
none
tiny subpages:
16: (2049: 1/32, offset: 8192, length: 8192, elemSize: 256)
small subpages:
1: (2051: 1/8, offset: 24576, length: 8192, elemSize: 1024)
"]
Now, at some point the issue occurs, and all of the remaining http-chunk are read immediately into memory within a few seconds, with none of the allocated buffers being released.
I've no idea what causes this, but effectively it looks as if each remaining http-chunk in the request is read by netty, allocated to a bytebuffer, and that buffer is not released, the next http-chunk being read and allocated, and so on. Remember netty is capable of performing that read for the entire 1GB in about 3s. This allocation without release is the large vertical spike in the image below.
Can also be seen by inspecting the PoolArena directly, we see a slew of allocated PoolChunk:
(.directArenas (PooledByteBufAllocator/DEFAULT))
=>
[#object[io.netty.buffer.PoolArena$DirectArena
0x19bda70e
"Chunk(s) at 0~25%:
Chunk(81c0942: 5%, 688128/16777216)
Chunk(s) at 0~50%:
none
Chunk(s) at 25~75%:
none
Chunk(s) at 50~100%:
Chunk(21913aed: 54%, 9027584/16777216)
Chunk(44bc617a: 99%, 16760832/16777216)
Chunk(s) at 75~100%:
Chunk(3e5b6a9d: 99%, 16760832/16777216)
Chunk(s) at 100%:
Chunk(56e5320d: 100%, 16777216/16777216)
Chunk(7e8b2c6e: 100%, 16777216/16777216)
Chunk(599b2574: 100%, 16777216/16777216)
Chunk(3cf1b358: 100%, 16777216/16777216)
Chunk(43fb8b91: 100%, 16777216/16777216)
Chunk(643427dc: 100%, 16777216/16777216)
Chunk(64840eab: 100%, 16777216/16777216)
Chunk(55de0a10: 100%, 16777216/16777216)
Chunk(4a66b439: 100%, 16777216/16777216)
Chunk(f29606e: 100%, 16777216/16777216)
Chunk(4b1cd676: 100%, 16777216/16777216)
Chunk(6eb7db8f: 100%, 16777216/16777216)
Chunk(118f6d1c: 100%, 16777216/16777216)
Chunk(55614553: 100%, 16777216/16777216)
Chunk(786386ef: 100%, 16777216/16777216)
Chunk(28d851cf: 100%, 16777216/16777216)
Chunk(42d53e45: 100%, 16777216/16777216)
Chunk(1dc1b091: 100%, 16777216/16777216)
Chunk(79710938: 100%, 16777216/16777216)
Chunk(13cd79f0: 100%, 16777216/16777216)
...
...
...
At this point if you have sent a file > max direct memory size you will see an OOM.
In this test case I have sent a 1GB file (somehow corresponds to almost 2GB memory usage, not entirely sure how that's doubled up - is @ztellman 's byte-streams smart enough to convert a direct ByteBuf to a direct ByteBuffer?).
Anyway, since I have already processed some small amount of the message (pre-spike), I can fit the remaining amount in memory. You'll see from the image above that the yada listener continues to consume these allocated bytebuffer, unaware that any one-by-one http-chunk execution has been overwhelmed and the entire full http request read entirely into memory. It will complete successfully within about ten minutes.
Hope some of this has been of use. As I said I'm not sure if this issue persists since I couldn't reproduce with latest versions, if it does then you have some more info to go forward with.
from yada.
@d-t-w, many thanks for this very thorough write-up. Your description of the problems concur with what I was seeing when testing with large files, and I failed in my attempt to get to the cause of the issue. Since then some dependencies have changed but I still have the original file-upload code and some large files that I can test with, to see if the issue remains.
from yada.
No problem @malcolmsparks, if you find the issue remains and you want help isolating it to netty or beyond just ping me - happy to look at it further, was just passing yesterday.
from yada.
I've done some testing with some large 4Gb files today.
With the current yada master 3963d26, which uses aleph 0.1.4 I am not seeing the issue any more, either with or without -Dio.netty.allocator.numDirectArenas=0
.
I'm only seeing about 10MBps throughput, however, but that may be because I'm writing the file. I've seen yada stream an upload at about 1GBps before, so I need to keep investigating. However, at least it's reliably working now, even if performance could be improved.
from yada.
Ah, I discovered I was still printing out to the console on every buffer. Now I've removed that I'm up to about 80MB/s. I'm streaming a 4Gb file, without chunked transfer encoding, and writing it to an NVMe SSD disk. I can stream the whole file this way in 52 seconds. For comparison, a cp
manages the same thing in 7 secs on my system, so yada seems pretty fast now.
from yada.
Closing as per above comments.
from yada.
Related Issues (20)
- Authentication schemes cannot be configured per request method HOT 3
- Request Body not present in ctx in case plumatic schema validation failure, defined in resource. HOT 2
- Yada manual edge example out of date HOT 3
- More reflection warnings HOT 1
- Incorrect multipart header processing
- Unable to access content-type in authorization
- Multipart throws "Malformed boundary" error when sent an empty submission
- is CI broken? HOT 1
- yada.swagger-parameters & yada.parameters are almost the same code. HOT 1
- :x-frame-options DENY is not a valid value
- Assumes clj-time will come in through ring-core (aka incompatible with ring-core 1.8.1)
- Plans for PATCH support
- Impossible to pretty-print response bodies by using `pretty=true` media type parameter
- Is it possible to determine if a path/href string is part of my system? HOT 1
- Yada may give out of memory exception (java.lang.OutOfMemoryError, OOM) when put under load
- Unsetting cookies: expires vs max-age
- `content-length` is incorrectly set to 0 when responding with HTTP 304 to GET requests
- Cannot use aleph 0.5.0 with yada HOT 6
- Test failure on master
- 404 for docs link
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yada.