andyferris / dictionaries.jl Goto Github PK
View Code? Open in Web Editor NEWAn alternative interface for dictionaries in Julia, for improved productivity and performance
License: Other
An alternative interface for dictionaries in Julia, for improved productivity and performance
License: Other
just getting to know Dictionaries
the very first thing which feels uncomfortable when coming from Base.Dict
is that there is no support for something like Dict(:a => 1, :b => 2)
.
If I understand the interfaces correctly, this would be rather easy to add to the dictionary
function, because so far it only supports a single argument call with something iterable.
Is there anything preventing the support of dictionary(:a => 1, :b => 2, ...)
?
Using v0.2.1
and julia v1.3.1
, I've run into a method error when trying to construct a dictionary from an array I'm pretty sure should work. Here's a reduced example:
julia> using Dictionaries
julia> a = ["a" => [1,2,3], "b" => ['a', 'b', 'c']]
2-element Array{Pair{String,Array{T,1} where T},1}:
"a" => [1, 2, 3]
"b" => ['a', 'b', 'c']
julia> b = convert(Array{Pair{String,_A} where _A,1}, a)
2-element Array{Pair{String,_A} where _A,1}:
Pair{String,Array{T,1} where T}("a", [1, 2, 3])
Pair{String,Array{T,1} where T}("b", ['a', 'b', 'c'])
julia> dictionary(a) # Works fine
2-element HashDictionary{String,Array{T,1} where T}
"b" │ ['a', 'b', 'c']
"a" │ [1, 2, 3]
julia> dictionary(b)
ERROR: MethodError: no method matching _dictionary(::Type{Pair{String,_A} where _A}, ::Array{Pair{String,_A} where _A,1})
Closest candidates are:
_dictionary(::Type{Pair{I,T}}, ::Any) where {I, T} at /Users/isaac/.julia/packages/Dictionaries/cZc5T/src/HashDictionary.jl:117
Stacktrace:
[1] dictionary(::Array{Pair{String,_A} where _A,1}) at /Users/isaac/.julia/packages/Dictionaries/cZc5T/src/HashDictionary.jl:111
[2] top-level scope at REPL[9]:1
julia> typeof(a)
Array{Pair{String,Array{T,1} where T},1}
julia> typeof(b)
Array{Pair{String,_A} where _A,1}
Ah, I think I see what's happening now:
julia> Tuple{eltype(b), typeof(b)} <: Tuple{Pair{I, T} where {I, T}, Any}
true
julia> Tuple{Type{eltype(b)}, typeof(b)} <: Tuple{Type{Pair{I, T} where {I, T}}, Any}
false
But I can't seem to fix it without making inference much worse. Hm.
dictionary(Dict(b))
does the trick as a workaround.
I think that the intersection of Indices
can be made trivially faster by ordering of arguments.
a = distinct(rand(Int, 10000))
b = distinct(rand(Int, 100))
julia> @btime intersect(a, b)
134.240 μs (8 allocations: 284.55 KiB)
0-element Indices{Int64}
julia> @btime intersect(b, a)
1.238 μs (5 allocations: 3.94 KiB)
0-element Indices{Int64}
julia> @btime length(a) > length(b) ? intersect(b, a) : intersect(a, b)
1.283 μs (6 allocations: 3.95 KiB)
0-element Indices{Int64}
Would it make sense?
It is time.
@JuliaRegistrator register
With dictionaries, I'm used to being able to write stuff like this:
d = Dict((i,j) => j^i for i in 1:10, j in ["a","b"])
d[1,"a"] # returns "a"
This isn't possible with this package:
d = dictionary((i,j) => j^i for i in 1:10, j in ["a","b"])
d[1,"a"]
ERROR: MethodError: no method matching getindex(::HashDictionary{Tuple{Int64,String},String}, ::Int64, ::String)
d[(1,"a")] # This is fine
Maybe this isn't possible or not what this package is designed to do, but I thought I'd bring it up.
It seems like "f is idempotent" is being used to mean f(..., x) == x
, but I understand it to mean (at least when describing a function) that f(f(x)) = f(x)
. I see that there is perhaps also a notion of "idempotent element of a set", which I am not familiar with.
Unfortunately, I do not have a term to suggest for this. "f is the identity function after partial application" doesn't roll off the tongue.
Any plan to implement samplings from Indices
and Dictionary
?
Do you have a plan to remove all return_type
from the API-visible part (including empty input)? Could you share your plan if you have one?
Also, I don't think it's possible to specify the eltype
for Iterators.map(f, d::AbstractDictionary)
. I'd suggest using EltypeUnknown
.
This came up on discourse. It's pretty important to have indices that are mutable for a lot of things methods to work (append!, push!, etc.), but there are times it's worth improving performance and losing mutating functions. I mentioned my work on StaticRanges.jl may be helpful. It provides mutable ranges as well with protective setproperty!
methods so that users don't accidently make bizarre changes to a mutable range.
This could be as simple as:
const StaticIndicesRange{T,L} = Indices{T,OneToSRange{T,L}}
const DynamicIndicesRange{T} = Indices{T,OneToMRange{T}}
const FixedIndicesRange{T} = Indices{T,OneTo{T}}
If this is something you're interested in I'd be interested in further discussing how to implement it.
Hello! I'm playing with this cool package on Christmas and happened upon an error.
The docstring for Dictionary
says that it is possible to construct a Dictionary from two arbitrary Julia iterables:
Dictionary(indices, values)
Construct a Dictionary <: AbstractDictionary from two arbitrary Julia iterables, one specifying the indices and one specifying the values. Lookup uses naive
iteration.
It looks like right now you can indeed construct such a dictionary where the keys and/or values are given as a UnitRange
, but then there's an error upon show
:
julia> Dictionary([1,2,3], [4,5,6])
3-element Dictionary{Int64,Int64,Array{Int64,1},Array{Int64,1}}
1 │ 4
2 │ 5
3 │ 6
julia> Dictionary(1:3, 4:6); # You can construct one, just not show it.
julia> Dictionary(1:3, 4:6)
3-element Dictionary{Int64,Int64,UnitRange{Int64},UnitRange{Int64}}Error showing value of type Dictionary{Int64,Int64,UnitRange{Int64},UnitRange{Int64}}:
ERROR: Every settable AbstractDictionary type must define a method for `isassigned`: Dictionary{Int64,Int64,UnitRange{Int64},UnitRange{Int64}}
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] isassigned(::Dictionary{Int64,Int64,UnitRange{Int64},UnitRange{Int64}}, ::Int64) at /Users/yurivish/.julia/packages/Dictionaries/mjUAn/src/AbstractDictionary.jl:260
[3] show(::IOContext{REPL.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::Dictionary{Int64,Int64,UnitRange{Int64},UnitRange{Int64}}) at /Users/yurivish/.julia/packages/Dictionaries/mjUAn/src/show.jl:76
[4] display(::REPL.REPLDisplay, ::MIME{Symbol("text/plain")}, ::Any) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:132
[5] display(::REPL.REPLDisplay, ::Any) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:136
[6] display(::Any) at ./multimedia.jl:323
[7] #invokelatest#1 at ./essentials.jl:709 [inlined]
[8] invokelatest at ./essentials.jl:708 [inlined]
[9] print_response(::IO, ::Any, ::Bool, ::Bool, ::Any) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:156
[10] print_response(::REPL.AbstractREPL, ::Any, ::Bool, ::Bool) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:141
[11] (::REPL.var"#do_respond#38"{Bool,REPL.var"#48#57"{REPL.LineEditREPL,REPL.REPLHistoryProvider},REPL.LineEditREPL,REPL.LineEdit.Prompt})(::Any, ::Any, ::Any) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:719
[12] #invokelatest#1 at ./essentials.jl:709 [inlined]
[13] invokelatest at ./essentials.jl:708 [inlined]
[14] run_interface(::REPL.Terminals.TextTerminal, ::REPL.LineEdit.ModalInterface, ::REPL.LineEdit.MIState) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/LineEdit.jl:2306
[15] run_frontend(::REPL.LineEditREPL, ::REPL.REPLBackendRef) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:1045
[16] run_repl(::REPL.AbstractREPL, ::Any) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:201
[17] (::Base.var"#770#772"{Bool,Bool,Bool,Bool})(::Module) at ./client.jl:382
[18] #invokelatest#1 at ./essentials.jl:709 [inlined]
[19] invokelatest at ./essentials.jl:708 [inlined]
[20] run_main_repl(::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at ./client.jl:366
[21] exec_options(::Base.JLOptions) at ./client.jl:304
[22] _start() at ./client.jl:460
Hi, sorry if this question is somehow inbetween an discourse question and an issue/feature request.
Is it possible to create a an empty AbstractDictionary
given the concrete type of the system?
I am asking because it seems within AbstractDictionaries
/AbstractDict
it is especially common to have totally different kind of constructors which hence do not provide a common interface.
I see that I can construct empty AbstractDictionary
, given an instance, but I wasn't able to find one which takes the type instead.
julia> using Dictionaries
julia> d = Dictionary([:a], [Dictionary([:b], [:c])])
1-element Dictionary{Symbol,Dictionary{Symbol,Symbol}}
:a │ {:b │ :c}
julia> values(d)
1-element Dictionary{Symbol,Dictionary{Symbol,Symbol}}
:a │ {:b │ :c}
julia> collect(values(d))
1-element Array{Dictionary{Symbol,Symbol},1}:
{:b │ :c}
I find it quite surprising that values(d)
prints the keys of d
and prints as if values(d)
has the same type as d
. The keys don't seem to be present if you collect
them, however. I'm not sure if this is a bug?
Edit: Once I filed this I had the thought that this might just be a consequence of the fact that a dictionary iterates its values. So I don't even need values
here.
I guess it's just as strange as pairs(Dict(:a => :b))
returning the dict itself!
It would be nice if some of the traits here were replaced with those in https://github.com/JuliaDiffEq/ArrayInterface.jl. For example, issettable
could be replace with can_setindex
.
I feel like LittleDict benifits from many of the same advantages as Dictionaries.jl has in general.
Common operation I do on LittleDict is to strip off the keys, and map over the values and then reattach the keys. Which is much cleaner if done in the Dictionaries.jl way.
LittleDict is basically just a pair of two vectors.
One for keys and one for values.
It searchs the first to find the key position (with isequal
).
Then it uses that to index into the vector of values
Right now, if you access an Indices with a wrong index, you get a helpful error message:
julia> Indices(1:3)[4]
ERROR: IndexError("Index 4 not found in indices {1, 2, 3}")
However, the error message is not so helpful if your datastructure is large:
julia> Indices(1:3000)[4000]
ERROR: IndexError("Index 4000 not found in indices {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96,
[ etc etc etc]
It's probably better to avoid interpolating these datastructures into the error strings.
Sometimes one wants to group
, then map
, then group
again, but after map
, we have a dictionary and can no longer group
. Using values
seems natural (we're choosing a new set of keys to group by, so forget everything but the values) but since values
takes dictionaries to dictionaries, it doesn't work. For example,
julia> using SplitApplyCombine, Dictionaries
julia> words = [ join(rand('a':'z', 5)) for _ = 1:100]
100-element Array{String,1}:
"jplzj"
"honfd"
"uqgsh"
"ktlwr"
"gxcce"
"teyon"
"lgrru"
"yffkc"
"udllu"
"uxyas"
⋮
"ivvan"
"uyifb"
"wlnlx"
"iycii"
"mxxeu"
"rljux"
"tawgf"
"kwaal"
"rqtcb"
"dmlum"
julia> g = group(w -> (first(w), last(w)), words)
91-element HashDictionary{Tuple{Char,Char},Array{String,1}}
('u', 't') │ ["uyrmt"]
('g', 'w') │ ["gfvvw"]
('z', 'i') │ ["zuhbi", "zlcdi"]
('g', 'h') │ ["gjsuh"]
('d', 'm') │ ["dmlum"]
('l', 'u') │ ["lgrru"]
('a', 'f') │ ["axbhf"]
('o', 'h') │ ["odyzh"]
('m', 'z') │ ["mhrwz"]
('y', 'c') │ ["yffkc"]
('j', 'k') │ ["jxbok"]
('k', 'e') │ ["kuzre"]
('p', 'v') │ ["pamxv"]
('d', 'f') │ ["doudf"]
('e', 'q') │ ["erqzq"]
('r', 'b') │ ["rqtcb"]
('l', 'm') │ ["lbkmm"]
('a', 'g') │ ["azlzg"]
('g', 'b') │ ["glqab"]
('e', 'p') │ ["edfbp"]
⋮ │ ⋮
julia> m = map(join, g)
91-element HashDictionary{Tuple{Char,Char},String}
('u', 't') │ "uyrmt"
('g', 'w') │ "gfvvw"
('z', 'i') │ "zuhbizlcdi"
('g', 'h') │ "gjsuh"
('d', 'm') │ "dmlum"
('l', 'u') │ "lgrru"
('a', 'f') │ "axbhf"
('o', 'h') │ "odyzh"
('m', 'z') │ "mhrwz"
('y', 'c') │ "yffkc"
('j', 'k') │ "jxbok"
('k', 'e') │ "kuzre"
('p', 'v') │ "pamxv"
('d', 'f') │ "doudf"
('e', 'q') │ "erqzq"
('r', 'b') │ "rqtcb"
('l', 'm') │ "lbkmm"
('a', 'g') │ "azlzg"
('g', 'b') │ "glqab"
('e', 'p') │ "edfbp"
⋮ │ ⋮
julia> g2 = group(first, m)
ERROR: type MappedDictionary has no field maps
Stacktrace:
[1] getproperty(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}, ::Symbol) at ./Base.jl:33
[2] istokenizable(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}) at /Users/eh540/.julia/packages/Dictionaries/cZc5T/src/MappedDictionary.jl:29
[3] sharetokens(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}, ::HashDictionary{Tuple{Char,Char},String}) at /Users/eh540/.julia/packages/Dictionaries/cZc5T/src/tokens.jl:225
[4] group(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}, ::HashDictionary{Tuple{Char,Char},String}) at /Users/eh540/.julia/packages/SplitApplyCombine/mA0eV/src/group.jl:85
[5] group(::Function, ::HashDictionary{Tuple{Char,Char},String}) at /Users/eh540/.julia/packages/SplitApplyCombine/mA0eV/src/group.jl:26
[6] top-level scope at REPL[22]:1
[7] eval(::Module, ::Any) at ./boot.jl:331
[8] eval_user_input(::Any, ::REPL.REPLBackend) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
[9] run_backend(::REPL.REPLBackend) at /Users/eh540/.julia/packages/Revise/WkyNB/src/Revise.jl:1023
[10] top-level scope at none:0
julia> g2 = group(first, values(m))
ERROR: type MappedDictionary has no field maps
Stacktrace:
[1] getproperty(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}, ::Symbol) at ./Base.jl:33
[2] istokenizable(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}) at /Users/eh540/.julia/packages/Dictionaries/cZc5T/src/MappedDictionary.jl:29
[3] sharetokens(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}, ::HashDictionary{Tuple{Char,Char},String}) at /Users/eh540/.julia/packages/Dictionaries/cZc5T/src/tokens.jl:225
[4] group(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}, ::HashDictionary{Tuple{Char,Char},String}) at /Users/eh540/.julia/packages/SplitApplyCombine/mA0eV/src/group.jl:85
[5] group(::Function, ::HashDictionary{Tuple{Char,Char},String}) at /Users/eh540/.julia/packages/SplitApplyCombine/mA0eV/src/group.jl:26
[6] top-level scope at REPL[23]:1
[7] eval(::Module, ::Any) at ./boot.jl:331
[8] eval_user_input(::Any, ::REPL.REPLBackend) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
[9] run_backend(::REPL.REPLBackend) at /Users/eh540/.julia/packages/Revise/WkyNB/src/Revise.jl:1023
[10] top-level scope at none:0
julia> g2 = group(first, collect(m))
25-element HashDictionary{Char,Array{String,1}}
'n' │ ["napjn"]
'w' │ ["wlnlx", "wfynuwwotuwvpfu", "wybpk", "wqjho", "wvsrq"]
'd' │ ["dmlum", "doudf", "dmfjl", "dsqow", "ddqbo"]
'e' │ ["erqzq", "edfbp", "edltw", "ethto"]
'o' │ ["odyzh", "oxtav", "oizai", "oawqd"]
'j' │ ["jxbok", "jplzj", "jpcfy"]
'y' │ ["yffkc", "yasme", "yohvs", "yunyj"]
's' │ ["stbwl", "siisdspzsd"]
'k' │ ["kuzre", "ktlwr", "kwaal"]
'r' │ ["rqtcb", "rdxmi", "rutih", "rztuxrljux", "rdrhz", "rubuc"]
't' │ ["teutbtblyb", "tffci", "tawgf", "teyontilen", "tyeqd", "tkjlr"]
'i' │ ["izkguicvau", "ivvan", "iatuo", "iycii"]
'q' │ ["qqciz"]
'h' │ ["hsyrn", "honfdhjjrd"]
'c' │ ["coqna", "ctslh", "crntk", "ccgyj", "czmvf", "cwtbg"]
'a' │ ["axbhf", "azlzg", "ayhwi"]
'p' │ ["pamxv", "perla", "pnage"]
'm' │ ["mhrwz", "mffrf", "mxxeu"]
'z' │ ["zuhbizlcdi", "zkump", "zfbfm", "ziieu"]
'g' │ ["gfvvw", "gjsuh", "glqab", "gxcce", "gqlsz"]
⋮ │ ⋮
I see collect(m)
works, so maybe that's the right way to do it. I just wanted to bring this up to see if there's a better way to do these kind of transformations.
I think this mostly comes up when you group
, then process the groups somehow so in some sense you now have single records again (like e.g. join
did above), and you want to forget the old grouped structure to continue processing. For me, I had data with both an id field and a time, and I wanted to group by both id and time first, map those groups to some struct which represents a collection of annotations at a particular time within a single id, and then regroup by id.
Just leaving this request here to raise some awareness of multidics and for a possible consideration.
Thx
While reviewing the interfaces here with @nickrobinson251, we found it curious that issettable
is only expected to be defined on an instance instead of as a "trait" that can be defined on the type. From the docs, it says that if mutability is supported, then issettable
should return true
, and setindex!
should be defined. It seems like the majority of implementations would be able to define issettable
statically on a type once, instead of on instances.
One pattern we've followed in the Tables.jl interface for these kind of type traits is to allow defining the trait function on either instances or types, with the encouragement to define on the type where possible. But users/callers of the trait, should always call on the instance, while having a generic fallback like issettable(x::T) where {T} = issettable(T)
. That allows the compiler to correctly statically infer traits where possible, and rely on runtime checking when necessary.
Just curious how intentional the current issettable
design was and if so, what the context is around the decision.
I found this behavior to be potentially confusing:
julia> d1 = dictionary(["b" => 3, "a" => 1])
2-element Dictionary{String,Int64}
"b" │ 3
"a" │ 1
julia> d2 = d1 .+ 4
2-element Dictionary{String,Int64}
"b" │ 7
"a" │ 5
julia> insert!(d2, "t", 9)
3-element Dictionary{String,Int64}
"b" │ 7
"a" │ 5
"t" │ 9
julia> d1
3-element Dictionary{String,Int64}
"b" │ 3
"a" │ 1
"t" │ #undef
The readme mentions that this is to take advantage of coiteration, but I wonder whether it's better to make this explicit. I think it's reasonable to expect this to behave like for collections in base, e.g. not to share the same memory, otherwise this could have unexpected consequence somewhere else in users' code.
julia> using Dictionaries, Random, BenchmarkTools
julia> ks = shuffle(1:20_000);
julia> vs = shuffle(ks);
julia> d1 = Dict(ks .=> vs);
julia> d2 = Dictionary(ks, vs);
julia> @btime getindices(d1, 1:20_000);
155.533 μs (4 allocations: 156.38 KiB)
julia> @btime getindices(d2, 1:20_000);
86.961 ms (4 allocations: 156.38 KiB)
julia> @btime getindex.(Ref(d2), 1:20_000);
86.091 ms (5 allocations: 156.39 KiB)
Currently, Base.Dict
allows
julia> convert(Dict{Int,Int}, Dict())
Dict{Int64, Int64}()
but this fails with Dictionary
:
julia> convert(Dictionary{Int,Int}, Dictionary())
ERROR: MethodError: Cannot `convert` an object of type
Dictionary{Any,Any} to an object of type
Dictionary{Int64,Int64}
Closest candidates are:
convert(::Type{T}, ::T) where T at essentials.jl:218
Dictionary{I, T}(::Any) where {I, T} at /home/belmant/.julia/packages/Dictionaries/khydh/src/Dictionary.jl:60
Dictionary{I, T}(::Any, ::Any) where {I, T} at /home/belmant/.julia/packages/Dictionaries/khydh/src/Dictionary.jl:108
Stacktrace:
[1] top-level scope
@ REPL[8]:1
Would it be possible to define conversion operators that would allow it? Is it a deliberate choice not to define them? It is useful for initializing structs that have dictionary fields, which currently require typing in Dictionary{K,V}()
with K
and V
potentially long types.
Since I believe keys and values are stored in vectors, we could delegate the conversion to conversion between vector types.
julia> d = dictionary(:a => 1, :b => 2)
2-element HashDictionary{Symbol,Int64}
:a │ 1
:b │ 2
julia> e = dictionary(:a => 3, :c => 4)
2-element HashDictionary{Symbol,Int64}
:a │ 3
:c │ 4
julia> d .+ e
2-element HashDictionary{Symbol,Int64}
:a │ 4
:b │ 6
Maybe it should do what mergewith
does? However, each kind of join also makes sense. I think slightly more manual API for specifying the join would be nice JuliaLang/julia#25904 (comment)
The behavior seems to depend in an odd way on the order in which keys are added in to the dictionary. For example, I would expect that the following two dictionaries would be identical:
julia> d1 = Dictionary(["a","b"],[1;2])
2-element Dictionary{String,Int64}
"a" │ 1
"b" │ 2
julia> d2 = Dictionary(["b","a"],[1;2])
2-element Dictionary{String,Int64}
"b" │ 1
"a" │ 2
However, d1 .+ d2 gives an error:
julia> d1 .+ d2
ERROR: IndexError("Indices do not match")
Stacktrace:
[1] _sharetokens at /Users/spielman/.julia/packages/Dictionaries/uSQZZ/src/broadcast.jl:68 [inlined]
[2] BroadcastedDictionary(::Function, ::Tuple{Dictionary{String,Int64},Dictionary{String,Int64}}) at /Users/spielman/.julia/packages/Dictionaries/uSQZZ/src/broadcast.jl:12
[3] broadcasted at /Users/spielman/.julia/packages/Dictionaries/uSQZZ/src/broadcast.jl:116 [inlined]
[4] broadcasted(::Function, ::Dictionary{String,Int64}, ::Dictionary{String,Int64}) at ./broadcast.jl:1238
[5] top-level scope at REPL[25]:1
[6] eval(::Module, ::Any) at ./boot.jl:331
[7] eval_user_input(::Any, ::REPL.REPLBackend) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
[8] run_backend(::REPL.REPLBackend) at /Users/spielman/.julia/packages/Revise/kqqw8/src/Revise.jl:1163
[9] top-level scope at none:0
Whereas, I do not get that error if I create the dictionary in the other order.
julia> d2 = Dictionary(["a","b"],[1;2])
2-element Dictionary{String,Int64}
"a" │ 1
"b" │ 2
julia> d1 .+ d2
2-element Dictionary{String,Int64}
"a" │ 2
"b" │ 4
I would like to enforce a particular partial order upon dict construction and for that I'd like to have an AbstractIndices
object backed by any collection that supports a notion of indexing (e.g. getindex
, setindex!
, firstindex
, nextind
and lastindex
) instead of being limited to Vector
s like Indices
.
My first thought was to just make the types of hashes
and values
in Indices
parametric. Looking at the implementation, I believe one difficulty would be that there needs to be some way of figuring out a sentinel value to denote an empty slot in slots
, but perhaps that can just default to something like firstindex(x) - 1
and be overloadable in any cases where that doesn't work? Another problem would probably the use of resize!
since that assumes new items always get inserted at the end, which wouldn't hold true in the case I'm thinking of.
Another option might be to separate all the hashtable code from the rest of the Indices
impkementation, so it's easy to reuse in custom subtypes of AbstractIndices
. Any thoughts? Is this something you thought about before?
The speed difference is quite real, but the examples in the documentation shortchange the Base
implementation of Dict
by a factor of about 2
. The base implementation is as efficient at iterating over (key,value) pairs as it is at iterating over keys by themselves, i.e. no hashing required, so the proper idiomatic way to deal with the function g
from the documentation example is
function g2(d1, d2)
out = sizehint!(Dict{Int64,Int64}(),length(d1)*3÷2+1)
for (k,v) in d1
out[k] = v + d2[k]
end
return out
end
Faster still, we can avoid hashing out
as well, by deep copying d1
using the 8-argument constructor of Dict{K,V}
, so out
and d1
initially share indices. Then we only need to use the getkey
version of hashing on d2
.
function combinematchingdictionaries(op,d1::Dict{K,V},d2::Dict{K,V}) where {K,V}
out = Dict{K,V}(
copy(d1.slots),copy(d1.keys),Array{V}(undef,length(d1.slots)),
d1.ndel,d1.count,d1.age,d1.idxfloor,d1.maxprobe
)
for (i,(k,v)) in enumerate(d1)
out.vals[i] = op(v,d2[k])
end
return out
end
@btime combinematchingdictionaries((+),dict1,dict2)
for a further significant savings in my benchmarking.
I think the overall goal of Dictionaries
is great.
Here's a dictionary where values contain collections of length not equal to 1. Broadcasting only
should result in an error, but I get the first element instead:
julia> d = Dictionary([:a, :b], [[1,2], [3,4]])
2-element Dictionary{Symbol, Vector{Int64}}
:a │ [1, 2]
:b │ [3, 4]
julia> only.(d)
2-element Dictionary{Symbol, Int64}
:a │ 1
:b │ 3
Digging a little through the dispatch chain leads to
julia> bc = Base.broadcasted(only, d);
julia> out = similar(bc, Int)
2-element Dictionary{Symbol, Int64}
:a │ 1
:b │ 7
julia> map!(identity, out, bc)
2-element Dictionary{Symbol, Int64}
:a │ 1
:b │ 3
which further leads to the @inbounds
assertion within the map!
implementation, and the @propagate_inbounds
within gettokenvalue(d::BroadcastedDictionary, t)
. This removes the @boundscheck
within only
when only
is inlined. So while we have the expected:
julia> Dictionaries.gettokenvalue(bc, (0,1))
ERROR: ArgumentError: Collection has multiple elements, must contain exactly 1 element
Stacktrace:
[1] only
@ ./iterators.jl:1327 [inlined]
When this is optimized within an inbounds context (within map!
), the bounds check disappears:
julia> foo(bc) = @inbounds Dictionaries.gettokenvalue(bc, (0,1))
julia> foo(bc)
1
I'm a happy user of your package, in our line of work we process many many independent files to make a summary histograms or extract parts of the data. In the end I simply do a reduce((x,y) -> append!.(x,y), results)
to collect the results together without manually tracking the order of things.
However, it's rather difficult if I want to put them into a table or anything because Dictionary
doesn't conform with Table interface (, expected), but also can't go back to Dict:
10-element Dictionaries.Dictionary{Symbol, Vector{Float32}}
:lep1_pt │ Float32[33638.438, 92686.78, 38112.855, 110358.19, 164663.92, 9687…
:lep1_eta │ Float32[0.45212966, -0.83190763, -1.0084606, 0.20597617, 1.0637895…
:lep1_phi │ Float32[-1.4313298, -1.067748, -1.569317, 0.66097116, 2.2409034, -…
:lep1_pid │ Float32[13.0, -11.0, 13.0, -11.0, 13.0, 13.0, -11.0, 11.0, -13.0, …
:lep2_pt │ Float32[26518.098, 51955.1, 33665.395, 67624.08, 78728.49, 75583.3…
:lep2_eta │ Float32[-2.23238, -0.5876625, -2.065182, -1.0818828, 2.0865402, 0.…
:lep2_phi │ Float32[-2.4220688, 2.4145992, 0.50862205, 2.3646975, 0.4607052, 2…
:lep2_pid │ Float32[-13.0, 11.0, 11.0, -13.0, 13.0, -13.0, -13.0, -13.0, 13.0,…
:MET │ Float32[64662.047, 39246.2, 90002.63, 121876.12, 41813.074, 125130…
:mass_4l │ Float32[154441.69, 289828.94, 148317.48, 248640.03, 339248.1, 2372…
julia> Dict(SIGhist)
Dict{Float32, Float32} with 10 entries:
13.0 => -11.0
-2.23238 => -0.587663
64662.0 => 39246.2
33638.4 => 92686.8
26518.1 => 51955.1
0.45213 => -0.831908
-13.0 => 11.0
1.54442f5 => 2.89829f5
-1.43133 => -1.06775
-2.42207 => 2.4146
What's the recommended workflow?
Hi (sorry for the dense opening of issues, I hope you appreciate my attempt to just help)
Just stumbled upon Base.haskey
, which is quite standard for Base.AbstractDict
and makes especially sense for AbstractDictionary
as well, because the iterator does not go over the keys, but the values, so plain in
is still not working.
Would be great to have it for completenes
Array:
Array{Type}(undef, size)
Dictionary:
Dictionary{Key,Value}(inds, undef)
what was the reason for flipping the argument order? OffsetArrays are the most obvious case as it fits into both domains.
Hi,
Use case: we create an index/dict over the content of a file. The file has a header with the number of records. Could be 1.000, 1mio or 100mio, whatever. Currently we are using index() to iteratively scan each record, determine the key and create a key->record entry in the dict. For large files with many records, that probably means that the Dictionary internal data structures need to grow on demand. In our use case, I know the size (number of keys) upfront. Can I somehow use index() with pre-allocate structures?
thanks a lot
Hi, I just registered a small package SplittablesBase.jl which is used from Transducers.jl. Essentially, you can support parallel reduce and everything that can be derived from it (e.g., map) by defining a single function halve(collection)
. I wonder if you are interested in defining it.
Hi!
I'm not sure if I overlooked this, but I can't find a method in the README that would allow me to remove all items in a Dictionary
.
This way gives me behaviour that seems a bit arbitrary:
julia> dict = Dictionary(["a", "b", "c"], [1, 2, 3])
3-element Dictionary{String,Int64}
"a" │ 1
"b" │ 2
"c" │ 3
julia> for k in keys(dict)
unset!(dict, k) # Or delete!
end
julia> dict
1-element Dictionary{String,Int64}
"b" │ 2
I was just reading along and was curious if it would make sense to reuse the position in the value vector (because indices are unique) if it corresponds to a deleted index.
Dictionaries.jl/src/Indices.jl
Line 391 in 6918361
insert!(dict, k, val)
with a dictionary means something very different from insert!
with a Vector
. The Vector API from Base is more or less sequence- or iterable-like: insert something at a given index of the sequence, and move the remaining values to make room, preserving iteration order.
I wondered whether this is an oversight, or whether there's a plan to deal with this in some way?
Using julia 1.6.1 with Dictionaries v0.3.9, I get the following error when executing part of the README.md
julia> using Dictionaries
julia> dict = Dictionary(["a", "b", "c"], [1, 2, 3])
3-element Dictionary{String, Int64}
"a" │ 1
"b" │ 2
"c" │ 3
julia> similar(dict, Vector{Int})
Error showing value of type Dictionary{String, Vector{Int64}}:
ERROR: UndefRefError: access to undefined reference
Stacktrace:
[1] getindex
@ ./array.jl:801 [inlined]
[2] gettokenvalue
@ ~/.julia/packages/Dictionaries/YpAxR/src/Dictionary.jl:288 [inlined]
[3] iterate
@ ~/.julia/packages/Dictionaries/YpAxR/src/AbstractDictionary.jl:52 [inlined]
[4] isempty
@ ./essentials.jl:767 [inlined]
[5] show(io::IOContext{Base.TTY}, #unused#::MIME{Symbol("text/plain")}, d::Dictionary{String, Vector{Int64}})
@ Dictionaries ~/.julia/packages/Dictionaries/YpAxR/src/show.jl:127
[6] (::REPL.var"#38#39"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol("text/plain")}, Base.RefValue{Any}})(io::Any)
@ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:220
[7] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
@ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:462
[8] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
@ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:213
[9] display(d::REPL.REPLDisplay, x::Any)
@ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:225
[10] display(x::Any)
@ Base.Multimedia ./multimedia.jl:328
[11] #invokelatest#2
@ ./essentials.jl:708 [inlined]
[12] invokelatest
@ ./essentials.jl:706 [inlined]
[13] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
@ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:247
[14] (::REPL.var"#40#41"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
@ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:231
[15] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
@ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:462
[16] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
@ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:229
[17] (::REPL.var"#do_respond#61"{Bool, Bool, REPL.var"#72#82"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
@ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:798
[18] #invokelatest#2
@ ./essentials.jl:708 [inlined]
[19] invokelatest
@ ./essentials.jl:706 [inlined]
[20] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
@ REPL.LineEdit /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/LineEdit.jl:2441
[21] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
@ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:1126
[22] (::REPL.var"#44#49"{REPL.LineEditREPL, REPL.REPLBackendRef})()
@ REPL ./task.jl:411
Here my full versioninfo:
julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, icelake-client)
Environment:
JULIA_EDITOR = code-insiders
JULIA_NUM_THREADS =
Hello, and thanks for developing Dictionaries!
I noticed in the source code for Indices that, when checking for the presence of a key K which hashes to hash H, there is no check if H is in hashes.
In Python's new dict implementation in 3.6, this optimization was mentioned by the contributer. It does incur one extra array index (which can even incur a cache miss), but if the keys are expensive to compare with isequal, potentially a lot of time can be saved. Also, when the key do not match, there is overwhelming probability the hashes will not match either, and the next hash can be compared instead - which may speed up the check even further.
Not sure if you have already tried implementing it, but it may be a low hanging fruit!
I ran into this issue a while ago trying to add/remove keys from a dictionary and having the memory build up despite the dictionary itself not growing in size. I was curious how the behavior might be different with Dictionaries.jl but I'm running into the same difference in behavior from Windows to Linux that I experienced here.
For the following code I have it saved in a file called dict_vs_Dictionaries_allocation_test.jl. I have it in an environment with BenchmarkTools and Dictionaries installed
using Pkg
Pkg.activate(@__DIR__)
using Dictionaries
using BenchmarkTools
using InteractiveUtils
versioninfo()
function test(use_dict)
use_dict ? println("Using Base dict") : println("Using Dictionaries.jl")
totmem = Int64(Sys.total_memory())
getfreemem() = Int64(Sys.free_memory())
# println()
# if isempty(ARGS)
# println("Testing in global space")
# else
# println("Testing inside a function")
# end
# println()
function create_test_dict(totmem)
testdict = Dict{Int64, Matrix{Float64}}()
#insert 100k elements
for i in 1:100000
push!(testdict, i => ones(Float64, 100, 100))
end
freemem = totmem - getfreemem()
#delete the first 50k elements
for i in 1:50000
delete!(testdict, i)
end
return freemem, testdict
end
function create_test_dictionary(totmem)
testdict = Dictionary{Int64, Matrix{Float64}}()
#insert 100k elements
for i in 1:100000
insert!(testdict, i, ones(Float64, 100, 100))
end
freemem = totmem - getfreemem()
#delete the first 50k elements
for i in 1:50000
delete!(testdict, i)
end
return freemem, testdict
end
initialfreemem = totmem - getfreemem()
mem2, testdict = if use_dict
@time create_test_dict(totmem)
else
@time create_test_dictionary(totmem)
end
#try to garbage collect parts of dictionary now that it is no longer needed
GC.gc()
post_gc_mem = totmem - getfreemem()
println("Memory added from dictionary is $((mem2 - initialfreemem)/(10^9)) gigabytes")
println("Memory freed from clearing half of dictionary is $((mem2 - post_gc_mem)/(10^9)) gigabytes")
println("Final memory usage as a percent of allocated is $(100*(post_gc_mem - initialfreemem)/(mem2 - initialfreemem))% (ideally would be 50%)")
#verify that testdict is missing the first 50k indicies
try
testdict[1]
catch
println("Could not access index 1 as expected")
end
println("Waiting 10 seconds to see if gc occurs")
1 + 1 #random code execution to see if it affects gc
sleep(10)
println("Final memory usage as a percent of allocated is $(100*((totmem - getfreemem()) - initialfreemem)/(mem2 - initialfreemem))% (ideally would be 50%)")
println("_____________________________________________________________________________")
end
#toggle to use dict or Dictionaries.jl
use_dict = isempty(ARGS) ? true : false
test(use_dict)
Project.toml
[deps]
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
Dictionaries = "85a47980-9c8c-11e8-2b9f-f7ca1fa99fb4"
Manifest.toml
# This file is machine-generated - editing it directly is not advised
[[BenchmarkTools]]
deps = ["JSON", "Logging", "Printf", "Statistics", "UUIDs"]
git-tree-sha1 = "c31ebabde28d102b602bada60ce8922c266d205b"
uuid = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
version = "1.1.1"
[[Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"
[[Dictionaries]]
deps = ["Indexing", "Random"]
git-tree-sha1 = "011d10589449681ad631c0f866a2ce72771648c6"
uuid = "85a47980-9c8c-11e8-2b9f-f7ca1fa99fb4"
version = "0.3.10"
[[Indexing]]
git-tree-sha1 = "ce1566720fd6b19ff3411404d4b977acd4814f9f"
uuid = "313cdc1a-70c2-5d6a-ae34-0150d3930a38"
version = "1.1.1"
[[JSON]]
deps = ["Dates", "Mmap", "Parsers", "Unicode"]
git-tree-sha1 = "81690084b6198a2e1da36fcfda16eeca9f9f24e4"
uuid = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
version = "0.21.1"
[[Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"
[[LinearAlgebra]]
deps = ["Libdl"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
[[Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"
[[Mmap]]
uuid = "a63ad114-7e13-5084-954f-fe012c677804"
[[Parsers]]
deps = ["Dates"]
git-tree-sha1 = "94bf17e83a0e4b20c8d77f6af8ffe8cc3b386c0a"
uuid = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0"
version = "1.1.1"
[[Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"
[[Random]]
deps = ["Serialization"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
[[SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"
[[Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
[[SparseArrays]]
deps = ["LinearAlgebra", "Random"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
[[Statistics]]
deps = ["LinearAlgebra", "SparseArrays"]
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
[[UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"
[[Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"
These are the results of running the script in both Windows and Linux (I used WSL but experienced the same problem in native Linux).
$ julia-1.6 dict_vs_Dictionaries_alloaction_test.jl true
Activating environment at `E:\Dropbox (Personal)\Misc. Programs\Julia Code Testing\DictionaryTesting\Project.toml`
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Using Dictionaries.jl
3.976637 seconds (519.21 k allocations: 7.478 GiB, 19.32% gc time, 2.74% compilation time)
Memory added from dictionary is 8.004358144 gigabytes
Memory freed from clearing half of dictionary is 3.968114688 gigabytes
Final memory usage as a percent of allocated is 50.42557296146892% (ideally would be 50%)
Could not access index 1 as expected
Waiting 10 seconds to see if gc occurs
Final memory usage as a percent of allocated is 50.33003460770683% (ideally would be 50%)
_____________________________________________________________________________
$ julia-1.6 dict_vs_Dictionaries_alloaction_test.jl
Activating environment at `E:\Dropbox (Personal)\Misc. Programs\Julia Code Testing\DictionaryTesting\Project.toml`
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Using Base dict
4.037212 seconds (544.58 k allocations: 7.478 GiB, 20.08% gc time, 2.76% compilation time)
Memory added from dictionary is 8.015466496 gigabytes
Memory freed from clearing half of dictionary is 3.960201216 gigabytes
Final memory usage as a percent of allocated is 50.593003938369904% (ideally would be 50%)
Could not access index 1 as expected
Waiting 10 seconds to see if gc occurs
Final memory usage as a percent of allocated is 50.54691065107535% (ideally would be 50%)
_____________________________________________________________________________
$ julia-1.6 dict_vs_Dictionaries_alloaction_test.jl
Activating environment at `/mnt/e/Dropbox (Personal)/Misc. Programs/Julia Code Testing/DictionaryTesting/Project.toml`
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Using Base dict
5.088407 seconds (547.23 k allocations: 7.478 GiB, 21.41% gc time, 2.21% compilation time)
Memory added from dictionary is 8.02164736 gigabytes
Memory freed from clearing half of dictionary is -0.001548288 gigabytes
Final memory usage as a percent of allocated is 100.01930137203139
Could not access index 1
Waiting 10 seconds to see if gc occurs
Final memory usage as a percent of allocated is 100.01930137203139
_____________________________________________________________________________
$ julia-1.6 dict_vs_Dictionaries_alloaction_test.jl true
Activating environment at `/mnt/e/Dropbox (Personal)/Misc. Programs/Julia Code Testing/DictionaryTesting/Project.toml`
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Using Dictionaries.jl
5.269116 seconds (519.20 k allocations: 7.478 GiB, 20.45% gc time, 2.21% compilation time)
Memory added from dictionary is 8.024260608 gigabytes
Memory freed from clearing half of dictionary is -0.001032192 gigabytes
Final memory usage as a percent of allocated is 100.01286339078982
Could not access index 1 as expected
Waiting 10 seconds to see if gc occurs
Final memory usage as a percent of allocated is 100.01929508618471 (ideally would be 50%)
_____________________________________________________________________________
In Linux the memory is not being freed and this occurs with removing the entire dictionary as well which was the previous discourse test case. Also I'm interested in hearing if for this type of use case there is a more efficient way to implement it with Dictionaries.jl
I noticed:
Simultaneously, we are pushing the performance of working with dictionaries to be closer to that of working with arrays.
I'm looking into replacing Dict with OrderedDict (already done successfully in Base) or other such variant.
My work should maybe be independent of yours, unless you know it might be helpful to incorporate it too. I'll look more closely laer at this package.
Minimal broken example:
julia> dict = HashDictionary(["a", "b", "c"], [1, 2, 3])
3-element HashDictionary{String,Int64}
"c" │ 3
"b" │ 2
"a" │ 1
julia> sizehint!(dict, 100)
3-element HashDictionary{String,Int64}
"c" │ 3
"b" │ #undef
"a" │ #undef
julia> insert!(dict, "d", 4)
4-element HashDictionary{String,Int64}
"c" │ 3
"d" │ #undef
"b" │ #undef
"a" │ #undef
I noticed this when experimenting with a few benchmarks against Base.Dict
for some of my use cases (mostly initializing an empty dict, and then doing a bunch of upserts).
HashDictionary
objects seem to allocate once per insert!
, while Base.Dict
seem to have some growth factor.
I honestly don't know much about dictionary implementations, but was curious how they would compare for such use cases. sizehint!
isn't actually that useful for me, because I rarely know in advance just how many elements the dict will have.
HashDictionary does work, but not Dictionary:
julia> d = Dictionaries.Dictionary([1], [:a])
1-element Dictionary{Int64,Symbol,Array{Int64,1},Array{Symbol,1}}
1 │ :a
julia> collect(d)
ERROR: MethodError: no method matching size(::Dictionary{Int64,Symbol,Array{Int64,1},Array{Symbol,1}})
For some reason, get(dict, k, default)
tries to convert default
to the dict value type. Clearly, this fails in the following typical case:
dict = Dictionary(Dict(:a => 1, :b => 2))
get(dict, :a, -1) # == 1, ok
get(dict, :x, -1) # ==-1, ok
get(dict, :x, nothing) # MethodError: Cannot `convert` an object of type Nothing to an object of type Int64
julia> using CSV, DataFrames, SplitApplyCombine, Dictionaries
julia> data = CSV.read("minimal.csv", DataFrame);
julia> samples = index(last, pairs(groupcount(data.Date)));
zsh: segmentation fault julia
This sometimes results in a StackOverflowError
instead:
ERROR: StackOverflowError:
Stacktrace:
[1] Array at ./boot.jl:406 [inlined]
[2] similar at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/Dictionary.jl:339 [inlined]
[3] similar at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/AbstractDictionary.jl:351 [inlined]
[4] copy at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/AbstractDictionary.jl:480 [inlined]
[5] __dictionary(::typeof(last), ::typeof(identity), ::Dictionary{Int64,Pair{Union{Missing, String},Int64}}, ::Dictionaries.PairDictionary{Union{Missing, String},Int64,Dictionary{Union{Missing, String},Int64}}, ::Int64) at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/Dictionary.jl:225
[6] __dictionary(::typeof(last), ::typeof(identity), ::Dictionary{Int64,Pair{Union{Missing, String},Int64}}, ::Dictionaries.PairDictionary{Union{Missing, String},Int64,Dictionary{Union{Missing, String},Int64}}, ::Int64) at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/Dictionary.jl:230 (repeats 29087 times)
[7] _dictionary(::typeof(last), ::typeof(identity), ::Type{Dictionary}, ::Dictionaries.PairDictionary{Union{Missing, String},Int64,Dictionary{Union{Missing, String},Int64}}) at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/Dictionary.jl:204
[8] index(::Function, ::Dictionaries.PairDictionary{Union{Missing, String},Int64,Dictionary{Union{Missing, String},Int64}}) at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/Dictionary.jl:265
[9] top-level scope at REPL[20]:1
The file minimal.csv is here.
julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
Hi,
I extracted the below snippet into a separate file, and insert it from REPL to validate/demonstrate the issue. I'm on a Win10 laptop running Julia 1.5.3, the issue is in my env 100% repeatable.
using Dictionaries
function unique_index_from_in_memory(len::Int)
sizehint = floor(Int, len * 1.2)
d = Dictionary{Int,Int}(; sizehint=sizehint)
count = 1
@show sizehint, len
for value in 1:len # values(this, field)
println(Base.stderr, "$count: $value")
#insert!(d, value, count)
set!(d, value, count)
count += 1
end
return d
end
function test_fn()
unique_index_from_in_memory(76525)
end
test_fn()
What is the issue: The app blocks/stops progressing forever at a certain point.
My output is as follows. I stopped the app after 5 mins, by hitting ctrl-c several times (hitting it ones is not enough). I don't get any stacktrace.
The app always stops at the same position: 153
It depends on the value of sizehint, where it stops.
Without any sizehint, it runs without issues.
The issue is independent from set!() and insert!()
148: 148
149: 149
150: 150
151: 151
152: 152
153: 153
WARNING: Force throwing a SIGINT
Any idea why that might happen?
UPDATE: properly formatted the code
Awesome package!
I was trying out how sharing of the indices works, and came upon the following weird example:
julia> d1 = Dictionary([:a, :b], 1:2)
2-element Dictionary{Symbol, Int64}
:a │ 1
:b │ 2
julia> d2 = d1 .+ 10
2-element Dictionary{Symbol, Int64}
:a │ 11
:b │ 12
julia> set!(d2, :c, 20)
3-element Dictionary{Symbol, Int64}
:a │ 11
:b │ 12
:c │ 20
julia> d1
3-element Dictionary{Symbol, Int64}
:a │ 1
:b │ 2
:c │ #undef
julia> d1[:c]
0
julia> set!(d1, :c, 42)
3-element Dictionary{Symbol, Int64}
:a │ 1
:b │ 2
:c │ #undef
julia> d1
3-element Dictionary{Symbol, Int64}
:a │ 1
:b │ 2
:c │ #undef
julia> d1[:c]
42
What's going on there? Is it just printing?
julia> dict = Dictionary(["a", "b"], [rand(10), rand(10)])
2-element Dictionary{String, Vector{Float64}}
"a" │ [0.6754523471882936, 0.9575574532471981, 0.6527535430527817, 0.898230973122484, 0.886…
"b" │ [0.11071336894931672, 0.04832475875008868, 0.954036463626111, 0.6666379467912544, 0.5…
julia> d2 = Dictionary(["a", "b"], [rand(10), rand(10)])
2-element Dictionary{String, Vector{Float64}}
"a" │ [0.15383191039124666, 0.985055480434822, 0.2125298315455063, 0.08767770837077182, 0.2…
"b" │ [0.26587711060898767, 0.3254898431836074, 0.9454838456008, 0.1076792806156961, 0.4809…
julia> mergewith!(dict,d2)
ERROR: MethodError: no method matching mergewith!(::Dictionary{String, Vector{Float64}}, ::Dictionary{String, Vector{Float64}})
julia> mergewith(dict,d2)
ERROR: MethodError: no method matching mergewith(::Dictionary{String, Vector{Float64}}, ::Dictionary{String, Vector{Float64}})
Hi,
I just updated Dictionaries and a couple of my jldoctest failed right away. This itself is annoying, and inspecting it further showed that the current printing of Dictionaries is inconsistent and will probably break again in the future...
julia> using Dictionaries
julia> [dictionary([:a => 1, :b => 2])] # new in Dictionaries v0.3.10
1-element Vector{Dictionary{Symbol, Int64}}:
{:a = 1, :b = 2}
julia> dictionary([:a => 1, :b => 2]) # old style, still present in Dictionaries v0.3.10
2-element Dictionary{Symbol, Int64}
:a │ 1
:b │ 2
EDIT: It is about using =
or |
for separating key and value.
Hi Andy, thanks for the great package. We're currently using Dictionaries.jl
in our dictionary-of-keys data structure in our block sparse tensor type inside ITensors.jl since I found that it is much faster than Base.Dict
for iteration. I also like that the interface is closer to the Array
interface, once I started using this package it's tough going back to using Base.Dict
without getting totally confused by the interface for iteration and other operations.
A small issue I came across is the following:
julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) E-2176M CPU @ 2.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = vim
(@v1.7) pkg> st Dictionaries
Status `~/.julia/environments/v1.7/Project.toml`
[85a47980] Dictionaries v0.3.15
julia> d = dictionary(["A" => 1, :B => 2])
ERROR: MethodError: Cannot `convert` an object of type Symbol to an object of type String
Closest candidates are:
convert(::Type{String}, ::String) at ~/software/julia-1.7.0/share/julia/base/essentials.jl:223
convert(::Type{T}, ::T) where T<:AbstractString at ~/software/julia-1.7.0/share/julia/base/strings/basic.jl:231
convert(::Type{T}, ::AbstractString) where T<:AbstractString at ~/software/julia-1.7.0/share/julia/base/strings/basic.jl:232
...
Stacktrace:
[1] gettoken!(d::Dictionary{String, Int64}, i::Symbol)
@ Dictionaries ~/.julia/packages/Dictionaries/mxSnY/src/insertion.jl:46
[2] __dictionary(key::typeof(first), value::typeof(last), dict::Dictionary{String, Int64}, iter::Vector{Pair{Any, Int64}}, s::Int64)
@ Dictionaries ~/.julia/packages/Dictionaries/mxSnY/src/Dictionary.jl:225
[3] _dictionary(key::typeof(first), value::typeof(last), #unused#::Type{Dictionary}, iter::Vector{Pair{Any, Int64}})
@ Dictionaries ~/.julia/packages/Dictionaries/mxSnY/src/Dictionary.jl:210
[4] dictionary(iter::Vector{Pair{Any, Int64}})
@ Dictionaries ~/.julia/packages/Dictionaries/mxSnY/src/Dictionary.jl:193
[5] top-level scope
@ REPL[10]:1
I was hoping this would promote the keys to a common data type (in this case Any
). Instead it looks like it is basing the key type on the first key it sees.
Writing JuliaLang/julia#34744 (comment), I started wonder if ==
on your set types (AbstractIndices
) would be defined in terms of strong_equal(a, b) = isequal(a, b) && a == b
as they are the mapping to itself? Also, wouldn't this solve the O(N^2) problem for defining ==
you mentioned?
@JuliaRegistrator register master
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.