thebusby / iota Goto Github PK
View Code? Open in Web Editor NEWA simple IO library for using Clojure's reducers
A simple IO library for using Clojure's reducers
Hi,
Simply trying to create a chunk-seq
on a ~10Gb file I get an IndexOutOfBoundsException.
I can read the file using other functions.
Versions used are:
See the stacktrace below:
user=> (iota/chunk-seq "/path/to/a_big_file.csv" 1024)
IndexOutOfBoundsException java.nio.Buffer.checkBounds (Buffer.java:567)
user=> *e
#error {
:cause nil
:via
[{:type java.lang.IndexOutOfBoundsException
:message nil
:at [java.nio.Buffer checkBounds "Buffer.java" 567]}]
:trace
[[java.nio.Buffer checkBounds "Buffer.java" 567]
[java.nio.ByteBuffer get "ByteBuffer.java" 686]
[java.nio.DirectByteBuffer get "DirectByteBuffer.java" 285]
[iota.Mmap get "Mmap.java" 79]
[iota.FileRecordSeq nextChunkEnd "FileRecordSeq.java" 116]
[iota.FileRecordSeq split "FileRecordSeq.java" 139]
[iota.FileChunkSeq first "FileChunkSeq.java" 35]
[clojure.lang.RT nthFrom "RT.java" 942]
[clojure.lang.RT nth "RT.java" 897]
[clojure.core$print_sequential invokeStatic "core_print.clj" 58]
[clojure.core$fn__6072 invokeStatic "core_print.clj" 153]
[clojure.core$fn__6072 invoke "core_print.clj" 153]
[clojure.lang.MultiFn invoke "MultiFn.java" 233]
[clojure.tools.nrepl.middleware.pr_values$pr_values$fn$reify__907 send "pr_values.clj" 35]
[clojure.tools.nrepl.middleware.interruptible_eval$evaluate$fn__941$fn__954 invoke "interruptible_eval.clj" 113]
[clojure.main$repl$read_eval_print__7408 invoke "main.clj" 241]
[clojure.main$repl$fn__7417 invoke "main.clj" 258]
[clojure.main$repl invokeStatic "main.clj" 258]
[clojure.main$repl doInvoke "main.clj" 174]
[clojure.lang.RestFn invoke "RestFn.java" 1523]
[clojure.tools.nrepl.middleware.interruptible_eval$evaluate$fn__941 invoke "interruptible_eval.clj" 87]
[clojure.lang.AFn applyToHelper "AFn.java" 152]
[clojure.lang.AFn applyTo "AFn.java" 144]
[clojure.core$apply invokeStatic "core.clj" 646]
[clojure.core$with_bindings_STAR_ invokeStatic "core.clj" 1881]
[clojure.core$with_bindings_STAR_ doInvoke "core.clj" 1881]
[clojure.lang.RestFn invoke "RestFn.java" 425]
[clojure.tools.nrepl.middleware.interruptible_eval$evaluate invokeStatic "interruptible_eval.clj" 85]
[clojure.tools.nrepl.middleware.interruptible_eval$evaluate invoke "interruptible_eval.clj" 55]
[clojure.tools.nrepl.middleware.interruptible_eval$interruptible_eval$fn__986$fn__989 invoke "interruptible_eval.clj" 222]
[clojure.tools.nrepl.middleware.interruptible_eval$run_next$fn__981 invoke "interruptible_eval.clj" 190]
[clojure.lang.AFn run "AFn.java" 22]
[java.util.concurrent.ThreadPoolExecutor runWorker "ThreadPoolExecutor.java" 1142]
[java.util.concurrent.ThreadPoolExecutor$Worker run "ThreadPoolExecutor.java" 617]
[java.lang.Thread run "Thread.java" 745]]}
I'm working with a synthetic dataset that I generated. It's about 2GB, with about 496 chars per field, 40 tab-delimited fields per line. Newline separated.
Much to my surprise, iota (using 1.1.2, although it looks like FileSeq hasn't changed outside of the chunked fileseq support) just chokes. The issues and context are detailed at:
(https://gist.github.com/joinr/5cbdaf6c66924129f398). I assume that the issues persist in 1.3, but I have not tested as of yet (for my use case, I believe the seq-functionality and buffer size as are unchanged).
At least one concrete issue is the default buffer size being too large (for my machine at least, I may be an outlier). I know the standard libs default to a much smaller buffer. On my machine, the iota default just kills performance. I understand that the caller can set the buffer, but that ends up being non-obvious. There are perhaps deeper performance issues described in the gist (when comparing performance with alternate implementations).
Thanks for iota.
The following does what I expect:
(let [f "iotavectest-succeed"]
(spit f "0\n1")
(let [s (iota/seq f)
v (iota/vec f)]
(doseq [xs [s v]]
(println (type xs))
(doseq [x xs]
(printf "|%s|\n" x)))))
; iota.FileSeq
; |0|
; |1|
; iota.FileVector
; |0|
; |1|
The following (using an arbitrary separator) does not do what I expect:
(let [f "iotavectest-fail"]
(spit f "0.1")
(let [s (iota/seq f 0x40000 \.)
v (iota/vec f 10 \.)]
(doseq [xs [s v]]
(println (type xs))
(doseq [x xs]
(printf "|%s|\n" x)))))
; iota.FileSeq
; |0|
; |1|
; iota.FileVector
; IndexOutOfBoundsException getLine() failure: Chunk #0's [1] is not within 0...1iota.FileVector.getLine (FileVector.java:152)
; |0.1|
My assumption is that I should be able to use an arbitrary byte other than newline to delimit records/lines in my file. Am I mistaken?
Thanks!
Hi, I'm tempted to use Iota but I have large gzipped files.
Does Iota support this?
Thanks,
Avram
(def mem-file (iota/vec "path/to/moby-dic.txt"))
(defn words [coll](mapcat
%28fn [line]
%28re-seq #"[a-z]+" line%29%29
coll))
(frequencies (words mem-file))
I get a
NullPointerException java.util.regex.Matcher.getTextLength (Matcher.java:1234)
But if I call
(words mem-file)
I get a beautiful list of words.
I guess it has something to do with the machinery of lazy sequences in clojure that get calculated in batches.
So when here´s no lazy seq to return, this doesn´t happen
Otherwise that java.util.regex.Matcher.getTextLength get called and this NNMAP based representation of the file returns a null pointer.
But I didn´t look at the code, these are wild guesses on my side
Can you help me ?
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.