andyferris / dictionaries.jl Goto Github PK

View Code? Open in Web Editor NEW

275.0 275.0 28.0 387 KB

An alternative interface for dictionaries in Julia, for improved productivity and performance

License: Other

Julia 100.00%

dictionaries.jl's People

Contributors

Stargazers

Watchers

dictionaries.jl's Issues

support `dictionary(pair1, pair2, pair3, ...)`

just getting to know Dictionaries the very first thing which feels uncomfortable when coming from Base.Dict is that there is no support for something like Dict(:a => 1, :b => 2).

If I understand the interfaces correctly, this would be rather easy to add to the dictionary function, because so far it only supports a single argument call with something iterable.

Is there anything preventing the support of dictionary(:a => 1, :b => 2, ...)?

MethodError for `dictionary`

Using v0.2.1 and julia v1.3.1, I've run into a method error when trying to construct a dictionary from an array I'm pretty sure should work. Here's a reduced example:

julia> using Dictionaries

julia> a = ["a" => [1,2,3], "b" => ['a', 'b', 'c']]
2-element Array{Pair{String,Array{T,1} where T},1}:
 "a" => [1, 2, 3]      
 "b" => ['a', 'b', 'c']

julia> b = convert(Array{Pair{String,_A} where _A,1}, a)
2-element Array{Pair{String,_A} where _A,1}:
 Pair{String,Array{T,1} where T}("a", [1, 2, 3])      
 Pair{String,Array{T,1} where T}("b", ['a', 'b', 'c'])

julia> dictionary(a) # Works fine
2-element HashDictionary{String,Array{T,1} where T}
 "b" │ ['a', 'b', 'c']
 "a" │ [1, 2, 3]

julia> dictionary(b)
ERROR: MethodError: no method matching _dictionary(::Type{Pair{String,_A} where _A}, ::Array{Pair{String,_A} where _A,1})
Closest candidates are:
  _dictionary(::Type{Pair{I,T}}, ::Any) where {I, T} at /Users/isaac/.julia/packages/Dictionaries/cZc5T/src/HashDictionary.jl:117
Stacktrace:
 [1] dictionary(::Array{Pair{String,_A} where _A,1}) at /Users/isaac/.julia/packages/Dictionaries/cZc5T/src/HashDictionary.jl:111
 [2] top-level scope at REPL[9]:1

julia> typeof(a)
Array{Pair{String,Array{T,1} where T},1}

julia> typeof(b)
Array{Pair{String,_A} where _A,1}

Ah, I think I see what's happening now:

julia> Tuple{eltype(b), typeof(b)} <: Tuple{Pair{I, T} where {I, T}, Any}
true

julia> Tuple{Type{eltype(b)}, typeof(b)} <: Tuple{Type{Pair{I, T} where {I, T}}, Any}
false

But I can't seem to fix it without making inference much worse. Hm.

dictionary(Dict(b)) does the trick as a workaround.

Faster intersection of Indices

I think that the intersection of Indices can be made trivially faster by ordering of arguments.

 a = distinct(rand(Int, 10000))
 b = distinct(rand(Int, 100))
julia> @btime intersect(a, b)
  134.240 μs (8 allocations: 284.55 KiB)
0-element Indices{Int64}

julia> @btime intersect(b, a)
  1.238 μs (5 allocations: 3.94 KiB)
0-element Indices{Int64}

julia> @btime length(a) > length(b) ? intersect(b, a) : intersect(a, b)
  1.283 μs (6 allocations: 3.95 KiB)
0-element Indices{Int64}

Would it make sense?

Register Dictionaries.jl v0.1.0

It is time.

@JuliaRegistrator register

Indexing with tuples as keys

With dictionaries, I'm used to being able to write stuff like this:

d = Dict((i,j) => j^i for i in 1:10, j in ["a","b"])
d[1,"a"] # returns "a"

This isn't possible with this package:

d = dictionary((i,j) => j^i for i in 1:10, j in ["a","b"])
d[1,"a"]
ERROR: MethodError: no method matching getindex(::HashDictionary{Tuple{Int64,String},String}, ::Int64, ::String)
d[(1,"a")] # This is fine

Maybe this isn't possible or not what this package is designed to do, but I thought I'd bring it up.

term "idempotent" might not apply

It seems like "f is idempotent" is being used to mean f(..., x) == x, but I understand it to mean (at least when describing a function) that f(f(x)) = f(x). I see that there is perhaps also a notion of "idempotent element of a set", which I am not familiar with.

Unfortunately, I do not have a term to suggest for this. "f is the identity function after partial application" doesn't roll off the tongue.

Sampling

Any plan to implement samplings from Indices and Dictionary?

Typocalypse-free mapping

Do you have a plan to remove all return_type from the API-visible part (including empty input)? Could you share your plan if you have one?

Also, I don't think it's possible to specify the eltype for Iterators.map(f, d::AbstractDictionary). I'd suggest using EltypeUnknown.

Enabling dynamic and fixed indices.

This came up on discourse. It's pretty important to have indices that are mutable for a lot of things methods to work (append!, push!, etc.), but there are times it's worth improving performance and losing mutating functions. I mentioned my work on StaticRanges.jl may be helpful. It provides mutable ranges as well with protective setproperty! methods so that users don't accidently make bizarre changes to a mutable range.

This could be as simple as:

const StaticIndicesRange{T,L} = Indices{T,OneToSRange{T,L}}
const DynamicIndicesRange{T} = Indices{T,OneToMRange{T}}
const FixedIndicesRange{T} = Indices{T,OneTo{T}}

If this is something you're interested in I'd be interested in further discussing how to implement it.

Constructing a Dictionary from ranges

Hello! I'm playing with this cool package on Christmas and happened upon an error.

The docstring for Dictionary says that it is possible to construct a Dictionary from two arbitrary Julia iterables:

Dictionary(indices, values)

  Construct a Dictionary <: AbstractDictionary from two arbitrary Julia iterables, one specifying the indices and one specifying the values. Lookup uses naive
  iteration.

It looks like right now you can indeed construct such a dictionary where the keys and/or values are given as a UnitRange, but then there's an error upon show:

julia> Dictionary([1,2,3], [4,5,6])
3-element Dictionary{Int64,Int64,Array{Int64,1},Array{Int64,1}}
 1 │ 4
 2 │ 5
 3 │ 6

julia> Dictionary(1:3, 4:6); # You can construct one, just not show it.

julia> Dictionary(1:3, 4:6)
3-element Dictionary{Int64,Int64,UnitRange{Int64},UnitRange{Int64}}Error showing value of type Dictionary{Int64,Int64,UnitRange{Int64},UnitRange{Int64}}:
ERROR: Every settable AbstractDictionary type must define a method for `isassigned`: Dictionary{Int64,Int64,UnitRange{Int64},UnitRange{Int64}}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] isassigned(::Dictionary{Int64,Int64,UnitRange{Int64},UnitRange{Int64}}, ::Int64) at /Users/yurivish/.julia/packages/Dictionaries/mjUAn/src/AbstractDictionary.jl:260
 [3] show(::IOContext{REPL.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::Dictionary{Int64,Int64,UnitRange{Int64},UnitRange{Int64}}) at /Users/yurivish/.julia/packages/Dictionaries/mjUAn/src/show.jl:76
 [4] display(::REPL.REPLDisplay, ::MIME{Symbol("text/plain")}, ::Any) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:132
 [5] display(::REPL.REPLDisplay, ::Any) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:136
 [6] display(::Any) at ./multimedia.jl:323
 [7] #invokelatest#1 at ./essentials.jl:709 [inlined]
 [8] invokelatest at ./essentials.jl:708 [inlined]
 [9] print_response(::IO, ::Any, ::Bool, ::Bool, ::Any) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:156
 [10] print_response(::REPL.AbstractREPL, ::Any, ::Bool, ::Bool) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:141
 [11] (::REPL.var"#do_respond#38"{Bool,REPL.var"#48#57"{REPL.LineEditREPL,REPL.REPLHistoryProvider},REPL.LineEditREPL,REPL.LineEdit.Prompt})(::Any, ::Any, ::Any) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:719
 [12] #invokelatest#1 at ./essentials.jl:709 [inlined]
 [13] invokelatest at ./essentials.jl:708 [inlined]
 [14] run_interface(::REPL.Terminals.TextTerminal, ::REPL.LineEdit.ModalInterface, ::REPL.LineEdit.MIState) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/LineEdit.jl:2306
 [15] run_frontend(::REPL.LineEditREPL, ::REPL.REPLBackendRef) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:1045
 [16] run_repl(::REPL.AbstractREPL, ::Any) at /Users/sabae/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:201
 [17] (::Base.var"#770#772"{Bool,Bool,Bool,Bool})(::Module) at ./client.jl:382
 [18] #invokelatest#1 at ./essentials.jl:709 [inlined]
 [19] invokelatest at ./essentials.jl:708 [inlined]
 [20] run_main_repl(::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at ./client.jl:366
 [21] exec_options(::Base.JLOptions) at ./client.jl:304
 [22] _start() at ./client.jl:460

How to construct an empty AbstractDictionary given the Dictionary Type?

Hi, sorry if this question is somehow inbetween an discourse question and an issue/feature request.

Is it possible to create a an empty AbstractDictionary given the concrete type of the system?
I am asking because it seems within AbstractDictionaries/AbstractDict it is especially common to have totally different kind of constructors which hence do not provide a common interface.

I see that I can construct empty AbstractDictionary, given an instance, but I wasn't able to find one which takes the type instead.

`values` printing is strange

julia> using Dictionaries

julia> d = Dictionary([:a], [Dictionary([:b], [:c])])
1-element Dictionary{Symbol,Dictionary{Symbol,Symbol}}
 :a │ {:b │ :c}

julia> values(d)
1-element Dictionary{Symbol,Dictionary{Symbol,Symbol}}
 :a │ {:b │ :c}

julia> collect(values(d))
1-element Array{Dictionary{Symbol,Symbol},1}:
 {:b │ :c}

I find it quite surprising that values(d) prints the keys of d and prints as if values(d) has the same type as d. The keys don't seem to be present if you collect them, however. I'm not sure if this is a bug?

Edit: Once I filed this I had the thought that this might just be a consequence of the fact that a dictionary iterates its values. So I don't even need values here.

I guess it's just as strange as pairs(Dict(:a => :b)) returning the dict itself!

Merge traits with ArrayInterface.

It would be nice if some of the traits here were replaced with those in https://github.com/JuliaDiffEq/ArrayInterface.jl. For example, issettable could be replace with can_setindex.

Is there a equivelent of OrderedCollections.LittleDict?

I feel like LittleDict benifits from many of the same advantages as Dictionaries.jl has in general.
Common operation I do on LittleDict is to strip off the keys, and map over the values and then reattach the keys. Which is much cleaner if done in the Dictionaries.jl way.

LittleDict is basically just a pair of two vectors.
One for keys and one for values.
It searchs the first to find the key position (with isequal).
Then it uses that to index into the vector of values

Do not print arbitrary-length datastructures in error messages

Right now, if you access an Indices with a wrong index, you get a helpful error message:

julia> Indices(1:3)[4]
ERROR: IndexError("Index 4 not found in indices {1, 2, 3}")

However, the error message is not so helpful if your datastructure is large:

julia> Indices(1:3000)[4000]
ERROR: IndexError("Index 4000 not found in indices {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96,

[ etc etc etc]

It's probably better to avoid interpolating these datastructures into the error strings.

Interaction with `values` and `group` can be awkward

Sometimes one wants to group, then map, then group again, but after map, we have a dictionary and can no longer group. Using values seems natural (we're choosing a new set of keys to group by, so forget everything but the values) but since values takes dictionaries to dictionaries, it doesn't work. For example,

julia> using SplitApplyCombine, Dictionaries

julia> words = [ join(rand('a':'z', 5)) for _ = 1:100]
100-element Array{String,1}:
 "jplzj"
 "honfd"
 "uqgsh"
 "ktlwr"
 "gxcce"
 "teyon"
 "lgrru"
 "yffkc"
 "udllu"
 "uxyas"
 ⋮
 "ivvan"
 "uyifb"
 "wlnlx"
 "iycii"
 "mxxeu"
 "rljux"
 "tawgf"
 "kwaal"
 "rqtcb"
 "dmlum"

julia> g = group(w -> (first(w), last(w)), words)
91-element HashDictionary{Tuple{Char,Char},Array{String,1}}
 ('u', 't') │ ["uyrmt"]
 ('g', 'w') │ ["gfvvw"]
 ('z', 'i') │ ["zuhbi", "zlcdi"]
 ('g', 'h') │ ["gjsuh"]
 ('d', 'm') │ ["dmlum"]
 ('l', 'u') │ ["lgrru"]
 ('a', 'f') │ ["axbhf"]
 ('o', 'h') │ ["odyzh"]
 ('m', 'z') │ ["mhrwz"]
 ('y', 'c') │ ["yffkc"]
 ('j', 'k') │ ["jxbok"]
 ('k', 'e') │ ["kuzre"]
 ('p', 'v') │ ["pamxv"]
 ('d', 'f') │ ["doudf"]
 ('e', 'q') │ ["erqzq"]
 ('r', 'b') │ ["rqtcb"]
 ('l', 'm') │ ["lbkmm"]
 ('a', 'g') │ ["azlzg"]
 ('g', 'b') │ ["glqab"]
 ('e', 'p') │ ["edfbp"]
          ⋮ │ ⋮

julia> m = map(join, g)
91-element HashDictionary{Tuple{Char,Char},String}
 ('u', 't') │ "uyrmt"
 ('g', 'w') │ "gfvvw"
 ('z', 'i') │ "zuhbizlcdi"
 ('g', 'h') │ "gjsuh"
 ('d', 'm') │ "dmlum"
 ('l', 'u') │ "lgrru"
 ('a', 'f') │ "axbhf"
 ('o', 'h') │ "odyzh"
 ('m', 'z') │ "mhrwz"
 ('y', 'c') │ "yffkc"
 ('j', 'k') │ "jxbok"
 ('k', 'e') │ "kuzre"
 ('p', 'v') │ "pamxv"
 ('d', 'f') │ "doudf"
 ('e', 'q') │ "erqzq"
 ('r', 'b') │ "rqtcb"
 ('l', 'm') │ "lbkmm"
 ('a', 'g') │ "azlzg"
 ('g', 'b') │ "glqab"
 ('e', 'p') │ "edfbp"
          ⋮ │ ⋮

julia> g2 = group(first, m)
ERROR: type MappedDictionary has no field maps
Stacktrace:
 [1] getproperty(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}, ::Symbol) at ./Base.jl:33
 [2] istokenizable(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}) at /Users/eh540/.julia/packages/Dictionaries/cZc5T/src/MappedDictionary.jl:29
 [3] sharetokens(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}, ::HashDictionary{Tuple{Char,Char},String}) at /Users/eh540/.julia/packages/Dictionaries/cZc5T/src/tokens.jl:225
 [4] group(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}, ::HashDictionary{Tuple{Char,Char},String}) at /Users/eh540/.julia/packages/SplitApplyCombine/mA0eV/src/group.jl:85
 [5] group(::Function, ::HashDictionary{Tuple{Char,Char},String}) at /Users/eh540/.julia/packages/SplitApplyCombine/mA0eV/src/group.jl:26
 [6] top-level scope at REPL[22]:1
 [7] eval(::Module, ::Any) at ./boot.jl:331
 [8] eval_user_input(::Any, ::REPL.REPLBackend) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [9] run_backend(::REPL.REPLBackend) at /Users/eh540/.julia/packages/Revise/WkyNB/src/Revise.jl:1023
 [10] top-level scope at none:0

julia> g2 = group(first, values(m))
ERROR: type MappedDictionary has no field maps
Stacktrace:
 [1] getproperty(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}, ::Symbol) at ./Base.jl:33
 [2] istokenizable(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}) at /Users/eh540/.julia/packages/Dictionaries/cZc5T/src/MappedDictionary.jl:29
 [3] sharetokens(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}, ::HashDictionary{Tuple{Char,Char},String}) at /Users/eh540/.julia/packages/Dictionaries/cZc5T/src/tokens.jl:225
 [4] group(::MappedDictionary{Tuple{Char,Char},Char,typeof(first),Tuple{HashDictionary{Tuple{Char,Char},String}}}, ::HashDictionary{Tuple{Char,Char},String}) at /Users/eh540/.julia/packages/SplitApplyCombine/mA0eV/src/group.jl:85
 [5] group(::Function, ::HashDictionary{Tuple{Char,Char},String}) at /Users/eh540/.julia/packages/SplitApplyCombine/mA0eV/src/group.jl:26
 [6] top-level scope at REPL[23]:1
 [7] eval(::Module, ::Any) at ./boot.jl:331
 [8] eval_user_input(::Any, ::REPL.REPLBackend) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [9] run_backend(::REPL.REPLBackend) at /Users/eh540/.julia/packages/Revise/WkyNB/src/Revise.jl:1023
 [10] top-level scope at none:0

julia> g2 = group(first, collect(m))
25-element HashDictionary{Char,Array{String,1}}
 'n' │ ["napjn"]
 'w' │ ["wlnlx", "wfynuwwotuwvpfu", "wybpk", "wqjho", "wvsrq"]
 'd' │ ["dmlum", "doudf", "dmfjl", "dsqow", "ddqbo"]
 'e' │ ["erqzq", "edfbp", "edltw", "ethto"]
 'o' │ ["odyzh", "oxtav", "oizai", "oawqd"]
 'j' │ ["jxbok", "jplzj", "jpcfy"]
 'y' │ ["yffkc", "yasme", "yohvs", "yunyj"]
 's' │ ["stbwl", "siisdspzsd"]
 'k' │ ["kuzre", "ktlwr", "kwaal"]
 'r' │ ["rqtcb", "rdxmi", "rutih", "rztuxrljux", "rdrhz", "rubuc"]
 't' │ ["teutbtblyb", "tffci", "tawgf", "teyontilen", "tyeqd", "tkjlr"]
 'i' │ ["izkguicvau", "ivvan", "iatuo", "iycii"]
 'q' │ ["qqciz"]
 'h' │ ["hsyrn", "honfdhjjrd"]
 'c' │ ["coqna", "ctslh", "crntk", "ccgyj", "czmvf", "cwtbg"]
 'a' │ ["axbhf", "azlzg", "ayhwi"]
 'p' │ ["pamxv", "perla", "pnage"]
 'm' │ ["mhrwz", "mffrf", "mxxeu"]
 'z' │ ["zuhbizlcdi", "zkump", "zfbfm", "ziieu"]
 'g' │ ["gfvvw", "gjsuh", "glqab", "gxcce", "gqlsz"]
   ⋮ │ ⋮

I see collect(m) works, so maybe that's the right way to do it. I just wanted to bring this up to see if there's a better way to do these kind of transformations.

I think this mostly comes up when you group, then process the groups somehow so in some sense you now have single records again (like e.g. join did above), and you want to forget the old grouped structure to continue processing. For me, I had data with both an id field and a time, and I wanted to group by both id and time first, map those groups to some struct which represents a collection of annotations at a particular time within a single id, and then regroup by id.

[FR] multidict implementation

Just leaving this request here to raise some awareness of multidics and for a possible consideration.
Thx

Should `issettable` be allowed to be defined on a type?

While reviewing the interfaces here with @nickrobinson251, we found it curious that issettable is only expected to be defined on an instance instead of as a "trait" that can be defined on the type. From the docs, it says that if mutability is supported, then issettable should return true, and setindex! should be defined. It seems like the majority of implementations would be able to define issettable statically on a type once, instead of on instances.

One pattern we've followed in the Tables.jl interface for these kind of type traits is to allow defining the trait function on either instances or types, with the encouragement to define on the type where possible. But users/callers of the trait, should always call on the instance, while having a generic fallback like issettable(x::T) where {T} = issettable(T). That allows the compiler to correctly statically infer traits where possible, and rely on runtime checking when necessary.

Just curious how intentional the current issettable design was and if so, what the context is around the decision.

Should broadcasting copy indices?

I found this behavior to be potentially confusing:

julia> d1 = dictionary(["b" => 3, "a" => 1])
2-element Dictionary{String,Int64}
 "b" │ 3
 "a" │ 1

julia> d2 = d1 .+ 4
2-element Dictionary{String,Int64}
 "b" │ 7
 "a" │ 5

julia> insert!(d2, "t", 9)
3-element Dictionary{String,Int64}
 "b" │ 7
 "a" │ 5
 "t" │ 9

julia> d1
3-element Dictionary{String,Int64}
 "b" │ 3
 "a" │ 1
 "t" │ #undef

The readme mentions that this is to take advantage of coiteration, but I wonder whether it's better to make this explicit. I think it's reasonable to expect this to behave like for collections in base, e.g. not to share the same memory, otherwise this could have unexpected consequence somewhere else in users' code.

Performance of getindices significantly slower on Dictionary than Dict

julia> using Dictionaries, Random, BenchmarkTools

julia> ks = shuffle(1:20_000);

julia> vs = shuffle(ks);

julia> d1 = Dict(ks .=> vs);

julia> d2 = Dictionary(ks, vs);

julia> @btime getindices(d1, 1:20_000);
  155.533 μs (4 allocations: 156.38 KiB)

julia> @btime getindices(d2, 1:20_000);
  86.961 ms (4 allocations: 156.38 KiB)

julia> @btime getindex.(Ref(d2), 1:20_000);
  86.091 ms (5 allocations: 156.39 KiB)

Automatic conversion between dictionaries

Currently, Base.Dict allows

julia> convert(Dict{Int,Int}, Dict())
Dict{Int64, Int64}()

but this fails with Dictionary:

julia> convert(Dictionary{Int,Int}, Dictionary())

ERROR: MethodError: Cannot `convert` an object of type 
  Dictionary{Any,Any} to an object of type 
  Dictionary{Int64,Int64}
Closest candidates are:
  convert(::Type{T}, ::T) where T at essentials.jl:218
  Dictionary{I, T}(::Any) where {I, T} at /home/belmant/.julia/packages/Dictionaries/khydh/src/Dictionary.jl:60
  Dictionary{I, T}(::Any, ::Any) where {I, T} at /home/belmant/.julia/packages/Dictionaries/khydh/src/Dictionary.jl:108
Stacktrace:
 [1] top-level scope
   @ REPL[8]:1

Would it be possible to define conversion operators that would allow it? Is it a deliberate choice not to define them? It is useful for initializing structs that have dictionary fields, which currently require typing in Dictionary{K,V}() with K and V potentially long types.

Since I believe keys and values are stored in vectors, we could delegate the conversion to conversion between vector types.

Broadcasting dictionaries with unmatched keys gives confusing result

julia> d = dictionary(:a => 1, :b => 2)
2-element HashDictionary{Symbol,Int64}
 :a │ 1
 :b │ 2

julia> e = dictionary(:a => 3, :c => 4)
2-element HashDictionary{Symbol,Int64}
 :a │ 3
 :c │ 4

julia> d .+ e
2-element HashDictionary{Symbol,Int64}
 :a │ 4
 :b │ 6

Maybe it should do what mergewith does? However, each kind of join also makes sense. I think slightly more manual API for specifying the join would be nice JuliaLang/julia#25904 (comment)

Behavior seems to depend on order items added to dictionary

The behavior seems to depend in an odd way on the order in which keys are added in to the dictionary. For example, I would expect that the following two dictionaries would be identical:

julia> d1 = Dictionary(["a","b"],[1;2])
2-element Dictionary{String,Int64}
 "a" │ 1
 "b" │ 2

julia> d2 = Dictionary(["b","a"],[1;2])
2-element Dictionary{String,Int64}
 "b" │ 1
 "a" │ 2

However, d1 .+ d2 gives an error:

julia> d1 .+ d2
ERROR: IndexError("Indices do not match")
Stacktrace:
 [1] _sharetokens at /Users/spielman/.julia/packages/Dictionaries/uSQZZ/src/broadcast.jl:68 [inlined]
 [2] BroadcastedDictionary(::Function, ::Tuple{Dictionary{String,Int64},Dictionary{String,Int64}}) at /Users/spielman/.julia/packages/Dictionaries/uSQZZ/src/broadcast.jl:12
 [3] broadcasted at /Users/spielman/.julia/packages/Dictionaries/uSQZZ/src/broadcast.jl:116 [inlined]
 [4] broadcasted(::Function, ::Dictionary{String,Int64}, ::Dictionary{String,Int64}) at ./broadcast.jl:1238
 [5] top-level scope at REPL[25]:1
 [6] eval(::Module, ::Any) at ./boot.jl:331
 [7] eval_user_input(::Any, ::REPL.REPLBackend) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [8] run_backend(::REPL.REPLBackend) at /Users/spielman/.julia/packages/Revise/kqqw8/src/Revise.jl:1163
 [9] top-level scope at none:0

Whereas, I do not get that error if I create the dictionary in the other order.

julia> d2 = Dictionary(["a","b"],[1;2])
2-element Dictionary{String,Int64}
 "a" │ 1
 "b" │ 2

julia> d1 .+ d2
2-element Dictionary{String,Int64}
 "a" │ 2
 "b" │ 4

support any indexable collection in `Indices`?

I would like to enforce a particular partial order upon dict construction and for that I'd like to have an AbstractIndices object backed by any collection that supports a notion of indexing (e.g. getindex, setindex!, firstindex, nextind and lastindex) instead of being limited to Vectors like Indices.
My first thought was to just make the types of hashes and values in Indices parametric. Looking at the implementation, I believe one difficulty would be that there needs to be some way of figuring out a sentinel value to denote an empty slot in slots, but perhaps that can just default to something like firstindex(x) - 1 and be overloadable in any cases where that doesn't work? Another problem would probably the use of resize! since that assumes new items always get inserted at the end, which wouldn't hold true in the case I'm thinking of.
Another option might be to separate all the hashtable code from the rest of the Indices impkementation, so it's easy to reuse in custom subtypes of AbstractIndices. Any thoughts? Is this something you thought about before?

Example in the documentation shortchanges `Dict`

The speed difference is quite real, but the examples in the documentation shortchange the Base implementation of Dict by a factor of about 2. The base implementation is as efficient at iterating over (key,value) pairs as it is at iterating over keys by themselves, i.e. no hashing required, so the proper idiomatic way to deal with the function g from the documentation example is

function g2(d1, d2)
    out = sizehint!(Dict{Int64,Int64}(),length(d1)*3÷2+1)
    for (k,v) in d1
        out[k] = v + d2[k]
    end
    return out
end

Faster still, we can avoid hashing out as well, by deep copying d1 using the 8-argument constructor of Dict{K,V}, so out and d1 initially share indices. Then we only need to use the getkey version of hashing on d2.

function combinematchingdictionaries(op,d1::Dict{K,V},d2::Dict{K,V}) where {K,V}
    out = Dict{K,V}(
        copy(d1.slots),copy(d1.keys),Array{V}(undef,length(d1.slots)),
        d1.ndel,d1.count,d1.age,d1.idxfloor,d1.maxprobe
    )
    for (i,(k,v)) in enumerate(d1)
        out.vals[i] = op(v,d2[k])
    end
    return out
end
@btime combinematchingdictionaries((+),dict1,dict2)

for a further significant savings in my benchmarking.

I think the overall goal of Dictionaries is great.

Broadcasting `only` omits checks due to inbounds propagation

Here's a dictionary where values contain collections of length not equal to 1. Broadcasting only should result in an error, but I get the first element instead:

julia> d = Dictionary([:a, :b], [[1,2], [3,4]])
2-element Dictionary{Symbol, Vector{Int64}}
 :a │ [1, 2]
 :b │ [3, 4]

julia> only.(d)
2-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 3

Digging a little through the dispatch chain leads to

julia> bc = Base.broadcasted(only, d);

julia> out = similar(bc, Int)
2-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 7

julia> map!(identity, out, bc)
2-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 3

which further leads to the @inbounds assertion within the map! implementation, and the @propagate_inbounds within gettokenvalue(d::BroadcastedDictionary, t). This removes the @boundscheck within only when only is inlined. So while we have the expected:

julia> Dictionaries.gettokenvalue(bc, (0,1))
ERROR: ArgumentError: Collection has multiple elements, must contain exactly 1 element
Stacktrace:
 [1] only
   @ ./iterators.jl:1327 [inlined]

When this is optimized within an inbounds context (within map!), the bounds check disappears:

julia> foo(bc) = @inbounds Dictionaries.gettokenvalue(bc, (0,1))

julia> foo(bc)
1

Go back to a base `Dict` or some kind of table interface

I'm a happy user of your package, in our line of work we process many many independent files to make a summary histograms or extract parts of the data. In the end I simply do a reduce((x,y) -> append!.(x,y), results) to collect the results together without manually tracking the order of things.

However, it's rather difficult if I want to put them into a table or anything because Dictionary doesn't conform with Table interface (, expected), but also can't go back to Dict:

10-element Dictionaries.Dictionary{Symbol, Vector{Float32}}
  :lep1_pt │ Float32[33638.438, 92686.78, 38112.855, 110358.19, 164663.92, 9687…
 :lep1_eta │ Float32[0.45212966, -0.83190763, -1.0084606, 0.20597617, 1.0637895…
 :lep1_phi │ Float32[-1.4313298, -1.067748, -1.569317, 0.66097116, 2.2409034, -…
 :lep1_pid │ Float32[13.0, -11.0, 13.0, -11.0, 13.0, 13.0, -11.0, 11.0, -13.0, …
  :lep2_pt │ Float32[26518.098, 51955.1, 33665.395, 67624.08, 78728.49, 75583.3…
 :lep2_eta │ Float32[-2.23238, -0.5876625, -2.065182, -1.0818828, 2.0865402, 0.…
 :lep2_phi │ Float32[-2.4220688, 2.4145992, 0.50862205, 2.3646975, 0.4607052, 2…
 :lep2_pid │ Float32[-13.0, 11.0, 11.0, -13.0, 13.0, -13.0, -13.0, -13.0, 13.0,…
      :MET │ Float32[64662.047, 39246.2, 90002.63, 121876.12, 41813.074, 125130…
  :mass_4l │ Float32[154441.69, 289828.94, 148317.48, 248640.03, 339248.1, 2372…

julia> Dict(SIGhist)
Dict{Float32, Float32} with 10 entries:
  13.0      => -11.0
  -2.23238  => -0.587663
  64662.0   => 39246.2
  33638.4   => 92686.8
  26518.1   => 51955.1
  0.45213   => -0.831908
  -13.0     => 11.0
  1.54442f5 => 2.89829f5
  -1.43133  => -1.06775
  -2.42207  => 2.4146

What's the recommended workflow?

Feature-request: Support Base.haskey

Hi (sorry for the dense opening of issues, I hope you appreciate my attempt to just help)

Just stumbled upon Base.haskey, which is quite standard for Base.AbstractDict and makes especially sense for AbstractDictionary as well, because the iterator does not go over the keys, but the values, so plain in is still not working.

Would be great to have it for completenes

[Suggestion] mirror Array{Type}(undef, size) style

Array:
Array{Type}(undef, size)

Dictionary:
Dictionary{Key,Value}(inds, undef)

what was the reason for flipping the argument order? OffsetArrays are the most obvious case as it fits into both domains.

pre-allocate storage before index()

Hi,

Use case: we create an index/dict over the content of a file. The file has a header with the number of records. Could be 1.000, 1mio or 100mio, whatever. Currently we are using index() to iteratively scan each record, determine the key and create a key->record entry in the dict. For large files with many records, that probably means that the Dictionary internal data structures need to grow on demand. In our use case, I know the size (number of keys) upfront. Can I somehow use index() with pre-allocate structures?

thanks a lot

Define a divide-and-conquer parallelism API?

Hi, I just registered a small package SplittablesBase.jl which is used from Transducers.jl. Essentially, you can support parallel reduce and everything that can be derived from it (e.g., map) by defining a single function halve(collection). I wonder if you are interested in defining it.

Removing all elements from a Dictionary

Hi!

I'm not sure if I overlooked this, but I can't find a method in the README that would allow me to remove all items in a Dictionary.

This way gives me behaviour that seems a bit arbitrary:

julia> dict = Dictionary(["a", "b", "c"], [1, 2, 3])
3-element Dictionary{String,Int64}
 "a" │ 1
 "b" │ 2
 "c" │ 3

julia> for k in keys(dict)
           unset!(dict, k) # Or delete!
       end

julia> dict
1-element Dictionary{String,Int64}
 "b" │ 2

Reuse deleted places in value vector

I was just reading along and was curious if it would make sense to reuse the position in the value vector (because indices are unique) if it corresponds to a deleted index.

Dictionaries.jl/src/Indices.jl

Line 391 in 6918361

push!(_hashes(indices), full_hash)

Semantic clash with `Base.insert!`

insert!(dict, k, val) with a dictionary means something very different from insert! with a Vector. The Vector API from Base is more or less sequence- or iterable-like: insert something at a given index of the sequence, and move the remaining values to make room, preserving iteration order.

I wondered whether this is an oversight, or whether there's a plan to deal with this in some way?

Bug-report: Example `similar` does not work

Using julia 1.6.1 with Dictionaries v0.3.9, I get the following error when executing part of the README.md

julia> using Dictionaries

julia> dict = Dictionary(["a", "b", "c"], [1, 2, 3])
3-element Dictionary{String, Int64}
 "a" │ 1
 "b" │ 2
 "c" │ 3

julia> similar(dict, Vector{Int})
Error showing value of type Dictionary{String, Vector{Int64}}:
ERROR: UndefRefError: access to undefined reference
Stacktrace:
  [1] getindex
    @ ./array.jl:801 [inlined]
  [2] gettokenvalue
    @ ~/.julia/packages/Dictionaries/YpAxR/src/Dictionary.jl:288 [inlined]
  [3] iterate
    @ ~/.julia/packages/Dictionaries/YpAxR/src/AbstractDictionary.jl:52 [inlined]
  [4] isempty
    @ ./essentials.jl:767 [inlined]
  [5] show(io::IOContext{Base.TTY}, #unused#::MIME{Symbol("text/plain")}, d::Dictionary{String, Vector{Int64}})
    @ Dictionaries ~/.julia/packages/Dictionaries/YpAxR/src/show.jl:127
  [6] (::REPL.var"#38#39"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol("text/plain")}, Base.RefValue{Any}})(io::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:220
  [7] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:462
  [8] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:213
  [9] display(d::REPL.REPLDisplay, x::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:225
 [10] display(x::Any)
    @ Base.Multimedia ./multimedia.jl:328
 [11] #invokelatest#2
    @ ./essentials.jl:708 [inlined]
 [12] invokelatest
    @ ./essentials.jl:706 [inlined]
 [13] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:247
 [14] (::REPL.var"#40#41"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:231
 [15] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:462
 [16] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:229
 [17] (::REPL.var"#do_respond#61"{Bool, Bool, REPL.var"#72#82"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:798
 [18] #invokelatest#2
    @ ./essentials.jl:708 [inlined]
 [19] invokelatest
    @ ./essentials.jl:706 [inlined]
 [20] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/LineEdit.jl:2441
 [21] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:1126
 [22] (::REPL.var"#44#49"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL ./task.jl:411

Here my full versioninfo:

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, icelake-client)
Environment:
  JULIA_EDITOR = code-insiders
  JULIA_NUM_THREADS =

Compare hashes when gettoken

Hello, and thanks for developing Dictionaries!

I noticed in the source code for Indices that, when checking for the presence of a key K which hashes to hash H, there is no check if H is in hashes.

In Python's new dict implementation in 3.6, this optimization was mentioned by the contributer. It does incur one extra array index (which can even incur a cache miss), but if the keys are expensive to compare with isequal, potentially a lot of time can be saved. Also, when the key do not match, there is overwhelming probability the hashes will not match either, and the next hash can be compared instead - which may speed up the check even further.

Not sure if you have already tried implementing it, but it may be a low hanging fruit!

Garbage collection of deleted keys

I ran into this issue a while ago trying to add/remove keys from a dictionary and having the memory build up despite the dictionary itself not growing in size. I was curious how the behavior might be different with Dictionaries.jl but I'm running into the same difference in behavior from Windows to Linux that I experienced here.

For the following code I have it saved in a file called dict_vs_Dictionaries_allocation_test.jl. I have it in an environment with BenchmarkTools and Dictionaries installed

using Pkg
Pkg.activate(@__DIR__)

using Dictionaries
using BenchmarkTools
using InteractiveUtils

versioninfo()

function test(use_dict)
	use_dict ? println("Using Base dict") : println("Using Dictionaries.jl")
	totmem = Int64(Sys.total_memory())

	getfreemem() = Int64(Sys.free_memory())

	# println()
	# if isempty(ARGS)
	# 	println("Testing in global space")
	# else
	# 	println("Testing inside a function")
	# end
	# println()

	function create_test_dict(totmem)
		testdict = Dict{Int64, Matrix{Float64}}()
		#insert 100k elements
		for i in 1:100000
			push!(testdict, i => ones(Float64, 100, 100))
		end
		freemem = totmem - getfreemem()
		
		#delete the first 50k elements
		for i in 1:50000
			delete!(testdict, i)
		end
		return freemem, testdict
	end

	function create_test_dictionary(totmem)
		testdict = Dictionary{Int64, Matrix{Float64}}()
		#insert 100k elements
		for i in 1:100000
			insert!(testdict, i, ones(Float64, 100, 100))
		end
		freemem = totmem - getfreemem()
		
		#delete the first 50k elements
		for i in 1:50000
			delete!(testdict, i)
		end
		return freemem, testdict
	end


	initialfreemem = totmem - getfreemem()


	mem2, testdict = if use_dict
		@time create_test_dict(totmem)
	else
		@time create_test_dictionary(totmem)
	end


	#try to garbage collect parts of dictionary now that it is no longer needed
	GC.gc()

	post_gc_mem = totmem - getfreemem()

	println("Memory added from dictionary is $((mem2 - initialfreemem)/(10^9)) gigabytes")
	println("Memory freed from clearing half of dictionary is $((mem2 - post_gc_mem)/(10^9)) gigabytes")
	println("Final memory usage as a percent of allocated is $(100*(post_gc_mem - initialfreemem)/(mem2 - initialfreemem))% (ideally would be 50%)")
	#verify that testdict is missing the first 50k indicies
	try 
		testdict[1]
	catch
		println("Could not access index 1 as expected")
	end
	println("Waiting 10 seconds to see if gc occurs")
	1 + 1 #random code execution to see if it affects gc
	sleep(10)
	println("Final memory usage as a percent of allocated is $(100*((totmem - getfreemem()) - initialfreemem)/(mem2 - initialfreemem))% (ideally would be 50%)")
	println("_____________________________________________________________________________")
end

#toggle to use dict or Dictionaries.jl
use_dict = isempty(ARGS) ? true : false
test(use_dict)

Project.toml

[deps]
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
Dictionaries = "85a47980-9c8c-11e8-2b9f-f7ca1fa99fb4"

Manifest.toml

# This file is machine-generated - editing it directly is not advised

[[BenchmarkTools]]
deps = ["JSON", "Logging", "Printf", "Statistics", "UUIDs"]
git-tree-sha1 = "c31ebabde28d102b602bada60ce8922c266d205b"
uuid = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
version = "1.1.1"

[[Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"

[[Dictionaries]]
deps = ["Indexing", "Random"]
git-tree-sha1 = "011d10589449681ad631c0f866a2ce72771648c6"
uuid = "85a47980-9c8c-11e8-2b9f-f7ca1fa99fb4"
version = "0.3.10"

[[Indexing]]
git-tree-sha1 = "ce1566720fd6b19ff3411404d4b977acd4814f9f"
uuid = "313cdc1a-70c2-5d6a-ae34-0150d3930a38"
version = "1.1.1"

[[JSON]]
deps = ["Dates", "Mmap", "Parsers", "Unicode"]
git-tree-sha1 = "81690084b6198a2e1da36fcfda16eeca9f9f24e4"
uuid = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
version = "0.21.1"

[[Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"

[[LinearAlgebra]]
deps = ["Libdl"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

[[Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"

[[Mmap]]
uuid = "a63ad114-7e13-5084-954f-fe012c677804"

[[Parsers]]
deps = ["Dates"]
git-tree-sha1 = "94bf17e83a0e4b20c8d77f6af8ffe8cc3b386c0a"
uuid = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0"
version = "1.1.1"

[[Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"

[[Random]]
deps = ["Serialization"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[[SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"

[[Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

[[SparseArrays]]
deps = ["LinearAlgebra", "Random"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[[Statistics]]
deps = ["LinearAlgebra", "SparseArrays"]
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[[UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[[Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"

These are the results of running the script in both Windows and Linux (I used WSL but experienced the same problem in native Linux).

$ julia-1.6 dict_vs_Dictionaries_alloaction_test.jl true
  Activating environment at `E:\Dropbox (Personal)\Misc. Programs\Julia Code Testing\DictionaryTesting\Project.toml`
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Using Dictionaries.jl
  3.976637 seconds (519.21 k allocations: 7.478 GiB, 19.32% gc time, 2.74% compilation time)
Memory added from dictionary is 8.004358144 gigabytes
Memory freed from clearing half of dictionary is 3.968114688 gigabytes
Final memory usage as a percent of allocated is 50.42557296146892% (ideally would be 50%)
Could not access index 1 as expected
Waiting 10 seconds to see if gc occurs
Final memory usage as a percent of allocated is 50.33003460770683% (ideally would be 50%)
_____________________________________________________________________________

$ julia-1.6 dict_vs_Dictionaries_alloaction_test.jl
  Activating environment at `E:\Dropbox (Personal)\Misc. Programs\Julia Code Testing\DictionaryTesting\Project.toml`
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Using Base dict
  4.037212 seconds (544.58 k allocations: 7.478 GiB, 20.08% gc time, 2.76% compilation time)
Memory added from dictionary is 8.015466496 gigabytes
Memory freed from clearing half of dictionary is 3.960201216 gigabytes
Final memory usage as a percent of allocated is 50.593003938369904% (ideally would be 50%)
Could not access index 1 as expected
Waiting 10 seconds to see if gc occurs
Final memory usage as a percent of allocated is 50.54691065107535% (ideally would be 50%)
_____________________________________________________________________________

$ julia-1.6 dict_vs_Dictionaries_alloaction_test.jl
  Activating environment at `/mnt/e/Dropbox (Personal)/Misc. Programs/Julia Code Testing/DictionaryTesting/Project.toml`
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Using Base dict
  5.088407 seconds (547.23 k allocations: 7.478 GiB, 21.41% gc time, 2.21% compilation time)
Memory added from dictionary is 8.02164736 gigabytes
Memory freed from clearing half of dictionary is -0.001548288 gigabytes
Final memory usage as a percent of allocated is 100.01930137203139
Could not access index 1
Waiting 10 seconds to see if gc occurs
Final memory usage as a percent of allocated is 100.01930137203139
_____________________________________________________________________________

$ julia-1.6 dict_vs_Dictionaries_alloaction_test.jl true
  Activating environment at `/mnt/e/Dropbox (Personal)/Misc. Programs/Julia Code Testing/DictionaryTesting/Project.toml`
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Using Dictionaries.jl
  5.269116 seconds (519.20 k allocations: 7.478 GiB, 20.45% gc time, 2.21% compilation time)
Memory added from dictionary is 8.024260608 gigabytes
Memory freed from clearing half of dictionary is -0.001032192 gigabytes
Final memory usage as a percent of allocated is 100.01286339078982
Could not access index 1 as expected
Waiting 10 seconds to see if gc occurs
Final memory usage as a percent of allocated is 100.01929508618471 (ideally would be 50%)
_____________________________________________________________________________

In Linux the memory is not being freed and this occurs with removing the entire dictionary as well which was the previous discourse test case. Also I'm interested in hearing if for this type of use case there is a more efficient way to implement it with Dictionaries.jl

Should this package be in Base? Or as stdlib?

I noticed:

Simultaneously, we are pushing the performance of working with dictionaries to be closer to that of working with arrays.

I'm looking into replacing Dict with OrderedDict (already done successfully in Base) or other such variant.

My work should maybe be independent of yours, unless you know it might be helpful to incorporate it too. I'll look more closely laer at this package.

sizehint! invalidates values

Minimal broken example:

julia> dict = HashDictionary(["a", "b", "c"], [1, 2, 3])
3-element HashDictionary{String,Int64}
 "c" │ 3
 "b" │ 2
 "a" │ 1

julia> sizehint!(dict, 100)
3-element HashDictionary{String,Int64}
 "c" │ 3
 "b" │ #undef
 "a" │ #undef

julia> insert!(dict, "d", 4)
4-element HashDictionary{String,Int64}
 "c" │ 3
 "d" │ #undef
 "b" │ #undef
 "a" │ #undef

I noticed this when experimenting with a few benchmarks against Base.Dict for some of my use cases (mostly initializing an empty dict, and then doing a bunch of upserts).
HashDictionary objects seem to allocate once per insert!, while Base.Dict seem to have some growth factor.
I honestly don't know much about dictionary implementations, but was curious how they would compare for such use cases. sizehint! isn't actually that useful for me, because I rarely know in advance just how many elements the dict will have.

collect(::Dictionary) doesn't work

HashDictionary does work, but not Dictionary:

julia> d = Dictionaries.Dictionary([1], [:a])
1-element Dictionary{Int64,Symbol,Array{Int64,1},Array{Symbol,1}}
 1 │ :a

julia> collect(d)
ERROR: MethodError: no method matching size(::Dictionary{Int64,Symbol,Array{Int64,1},Array{Symbol,1}})

get() with a default of different type

For some reason, get(dict, k, default) tries to convert default to the dict value type. Clearly, this fails in the following typical case:

dict = Dictionary(Dict(:a => 1, :b => 2))
get(dict, :a, -1)       # == 1, ok
get(dict, :x, -1)       # ==-1, ok
get(dict, :x, nothing)  # MethodError: Cannot `convert` an object of type Nothing to an object of type Int64

segfault in `index`

julia> using CSV, DataFrames, SplitApplyCombine, Dictionaries

julia> data = CSV.read("minimal.csv", DataFrame);

julia> samples = index(last, pairs(groupcount(data.Date)));
zsh: segmentation fault  julia

This sometimes results in a StackOverflowError instead:


ERROR: StackOverflowError:
Stacktrace:
 [1] Array at ./boot.jl:406 [inlined]
 [2] similar at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/Dictionary.jl:339 [inlined]
 [3] similar at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/AbstractDictionary.jl:351 [inlined]
 [4] copy at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/AbstractDictionary.jl:480 [inlined]
 [5] __dictionary(::typeof(last), ::typeof(identity), ::Dictionary{Int64,Pair{Union{Missing, String},Int64}}, ::Dictionaries.PairDictionary{Union{Missing, String},Int64,Dictionary{Union{Missing, String},Int64}}, ::Int64) at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/Dictionary.jl:225
 [6] __dictionary(::typeof(last), ::typeof(identity), ::Dictionary{Int64,Pair{Union{Missing, String},Int64}}, ::Dictionaries.PairDictionary{Union{Missing, String},Int64,Dictionary{Union{Missing, String},Int64}}, ::Int64) at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/Dictionary.jl:230 (repeats 29087 times)
 [7] _dictionary(::typeof(last), ::typeof(identity), ::Type{Dictionary}, ::Dictionaries.PairDictionary{Union{Missing, String},Int64,Dictionary{Union{Missing, String},Int64}}) at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/Dictionary.jl:204
 [8] index(::Function, ::Dictionaries.PairDictionary{Union{Missing, String},Int64,Dictionary{Union{Missing, String},Int64}}) at /Users/yurivish/.julia/packages/Dictionaries/M6hc8/src/Dictionary.jl:265
 [9] top-level scope at REPL[20]:1

The file minimal.csv is here.

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

sizehint causing the app to block?

Hi,

I extracted the below snippet into a separate file, and insert it from REPL to validate/demonstrate the issue. I'm on a Win10 laptop running Julia 1.5.3, the issue is in my env 100% repeatable.

    using Dictionaries

    function unique_index_from_in_memory(len::Int)
        sizehint = floor(Int, len * 1.2)
        d = Dictionary{Int,Int}(; sizehint=sizehint)

        count = 1
        @show sizehint, len
        for value in 1:len # values(this, field)
            println(Base.stderr, "$count: $value")
            #insert!(d, value, count)
            set!(d, value, count)
            count += 1
        end

        return d
    end

    function test_fn()
        unique_index_from_in_memory(76525)
    end

    test_fn()

What is the issue: The app blocks/stops progressing forever at a certain point.
My output is as follows. I stopped the app after 5 mins, by hitting ctrl-c several times (hitting it ones is not enough). I don't get any stacktrace.
The app always stops at the same position: 153
It depends on the value of sizehint, where it stops.
Without any sizehint, it runs without issues.
The issue is independent from set!() and insert!()

     148: 148
     149: 149
     150: 150
     151: 151
     152: 152
     153: 153
     WARNING: Force throwing a SIGINT

Any idea why that might happen?

UPDATE: properly formatted the code

Shared indices not properely updated?

Awesome package!

I was trying out how sharing of the indices works, and came upon the following weird example:

julia> d1 = Dictionary([:a, :b], 1:2)
2-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 2

julia> d2 = d1 .+ 10
2-element Dictionary{Symbol, Int64}
 :a │ 11
 :b │ 12

julia> set!(d2, :c, 20)
3-element Dictionary{Symbol, Int64}
 :a │ 11
 :b │ 12
 :c │ 20

julia> d1
3-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 2
 :c │ #undef

julia> d1[:c]
0

julia> set!(d1, :c, 42)
3-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 2
 :c │ #undef

julia> d1
3-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 2
 :c │ #undef

julia> d1[:c]
42

What's going on there? Is it just printing?

Support for `mergewith` and `mergewith!`

julia> dict = Dictionary(["a", "b"], [rand(10), rand(10)])
2-element Dictionary{String, Vector{Float64}}
 "a" │ [0.6754523471882936, 0.9575574532471981, 0.6527535430527817, 0.898230973122484, 0.886…
 "b" │ [0.11071336894931672, 0.04832475875008868, 0.954036463626111, 0.6666379467912544, 0.5…

julia> d2 = Dictionary(["a", "b"], [rand(10), rand(10)])
2-element Dictionary{String, Vector{Float64}}
 "a" │ [0.15383191039124666, 0.985055480434822, 0.2125298315455063, 0.08767770837077182, 0.2…
 "b" │ [0.26587711060898767, 0.3254898431836074, 0.9454838456008, 0.1076792806156961, 0.4809…

julia> mergewith!(dict,d2)
ERROR: MethodError: no method matching mergewith!(::Dictionary{String, Vector{Float64}}, ::Dictionary{String, Vector{Float64}})

julia> mergewith(dict,d2)
ERROR: MethodError: no method matching mergewith(::Dictionary{String, Vector{Float64}}, ::Dictionary{String, Vector{Float64}})

FeatureRequest: consistent show

Hi,

I just updated Dictionaries and a couple of my jldoctest failed right away. This itself is annoying, and inspecting it further showed that the current printing of Dictionaries is inconsistent and will probably break again in the future...

julia> using Dictionaries

julia> [dictionary([:a => 1, :b => 2])]  # new in Dictionaries v0.3.10
1-element Vector{Dictionary{Symbol, Int64}}:
 {:a = 1, :b = 2}

julia> dictionary([:a => 1, :b => 2])  # old style, still present in Dictionaries v0.3.10
2-element Dictionary{Symbol, Int64}
 :a │ 1
 :b │ 2

EDIT: It is about using = or | for separating key and value.

consider unifying these two kinds of presentations
Please consider making show/print a stable part of the API, because it is really decisive when using jldoctest.

`dictionary` with mixture of key types

Hi Andy, thanks for the great package. We're currently using Dictionaries.jl in our dictionary-of-keys data structure in our block sparse tensor type inside ITensors.jl since I found that it is much faster than Base.Dict for iteration. I also like that the interface is closer to the Array interface, once I started using this package it's tough going back to using Base.Dict without getting totally confused by the interface for iteration and other operations.

A small issue I came across is the following:

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) E-2176M  CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = vim

(@v1.7) pkg> st Dictionaries
      Status `~/.julia/environments/v1.7/Project.toml`
  [85a47980] Dictionaries v0.3.15

julia> d = dictionary(["A" => 1, :B => 2])
ERROR: MethodError: Cannot `convert` an object of type Symbol to an object of type String
Closest candidates are:
  convert(::Type{String}, ::String) at ~/software/julia-1.7.0/share/julia/base/essentials.jl:223
  convert(::Type{T}, ::T) where T<:AbstractString at ~/software/julia-1.7.0/share/julia/base/strings/basic.jl:231
  convert(::Type{T}, ::AbstractString) where T<:AbstractString at ~/software/julia-1.7.0/share/julia/base/strings/basic.jl:232
  ...
Stacktrace:
 [1] gettoken!(d::Dictionary{String, Int64}, i::Symbol)
   @ Dictionaries ~/.julia/packages/Dictionaries/mxSnY/src/insertion.jl:46
 [2] __dictionary(key::typeof(first), value::typeof(last), dict::Dictionary{String, Int64}, iter::Vector{Pair{Any, Int64}}, s::Int64)
   @ Dictionaries ~/.julia/packages/Dictionaries/mxSnY/src/Dictionary.jl:225
 [3] _dictionary(key::typeof(first), value::typeof(last), #unused#::Type{Dictionary}, iter::Vector{Pair{Any, Int64}})
   @ Dictionaries ~/.julia/packages/Dictionaries/mxSnY/src/Dictionary.jl:210
 [4] dictionary(iter::Vector{Pair{Any, Int64}})
   @ Dictionaries ~/.julia/packages/Dictionaries/mxSnY/src/Dictionary.jl:193
 [5] top-level scope
   @ REPL[10]:1

I was hoping this would promote the keys to a common data type (in this case Any). Instead it looks like it is basing the key type on the first key it sees.

(HashIndices([1, missing]) == HashIndices([1, missing])) == missing?

Writing JuliaLang/julia#34744 (comment), I started wonder if == on your set types (AbstractIndices) would be defined in terms of strong_equal(a, b) = isequal(a, b) && a == b as they are the mapping to itself? Also, wouldn't this solve the O(N^2) problem for defining == you mentioned?

Register v0.2.0

@JuliaRegistrator register master

andyferris / dictionaries.jl Goto Github PK

dictionaries.jl's People

Contributors

Stargazers

Watchers

Forkers

dictionaries.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org