Git Product home page Git Product logo

strs.jl's People

Contributors

juliatagbot avatar oxinabox avatar pallharaldsson avatar rfourquet avatar scottpjones avatar sschelm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

strs.jl's Issues

specialize join and *

Currently join in base uses sprint. We should have a special method for UniStr for it.
Similar for * which uses string.

Strs clashes with Compat

As follows:

julia> using Strs

julia> using Compat
WARNING: Method definition in(Any) in module Strs overwritten in module Compat
WARNING: Method definition ==(Any) in module Strs overwritten in module Compat
WARNING: Method definition contains(AbstractString, Base.Regex) in module Strs overwritten in module Compat
WARNING: Method definition isequal(Any) in module Strsoverwritten in module Compat

Roadmap for Strs.jl

  • Add built-in SubStr support
  • Add hashed string support
  • Write PCRE2.jl package using BinaryBuilder/BinaryProvider, creating all 3 PCRE2 libraries
  • Move regex support to StrRegex package
  • Change the ranges returned to end with the next codeunit index - 1, not need constant thisind/prevind
  • Possibly move optimized utf8 and utf16 support into StrUTF8 and StrUTF16 packages, to streamline main package
  • Add in functionality from StringLiterals, but in such a way that the entity tables can be dynamically loaded, as well as the formatting support, and make it extensible, so that other extensions can be added.
  • Support short strings
  • Support storing type information at end of string, have single concrete type instead of UniStr.

Caching of hash

Do you have a clear idea how you want to implement caching of hashes?

I see two options (both have pros and cons - which we could discuss, but maybe there is a third way):

  1. actually mutate immutable struct (this is possible as I am sure you know ๐Ÿ˜„)
  2. make a field containing both cached and hash a mutable container (there are several options for this - probably the simplest is a mutable struct)

Constructing Str from SubString fails

Currently Str(substring) falls back to a convert method which fails.

Additionally currently recommended design in Julia (AFAIK) is to define constructors and make convert methods call them - not the other way around (but probably this is not so crucial).

slow hash of custom string types

Currently hash for custom strings falls back to hash of String which is painfully slow.
This is a crucial thing to fix if we want dictionaries or sets work fast with Str-family.
Of course we should ensure that we return the same hash values (which might be hard - I have not thought about it in detail).

Julia 0.6.2 compatibility

Current master fails to load on Julia 0.6.2 due to problem in line 20 of src/compat.jl (missing Base.SamplerType).

Printing of UniStr

Currently on 0.7 and 0.6.2 the following fail to print:

UniStr(string(Char(0xb5)))

and

UniStr(string(Char(0x00010000)))

(for different reasons)

Hashing of Strs.AbstractChar

Here is an example of problematic behavior:

julia> x = Str("12")
"12"

julia> c
1

julia> hash(c)
ERROR: UndefVarError: hash_uint64 not defined
Stacktrace:
 [1] hash(::Strs.ASCIIChr) at .\hashing.jl:5

as it calls hash_uint64 which is not imported or qualified with Base. (and the method is called with two arguments which is invalid in ).

Whatever we do I think that hash should return the same hash as for corresponding Char.

Conversions between Char and Strs.CodePoint

Currently direct conversions between Char and Strs.CodePoint concrete types seem not to be supported both ways (you have to go through UInt32 in the middle). Is this intentional?

Design of Str

I understand why Str needs the internal fields like cache etc. (100% support ๐Ÿ‘) but I do not understand why their types have to be in a signature and not be fixed - what is the value of this flexibility?

In short why the signature struct Str{T} <: AbstractString is not enough for this type?
`

UniStr type performance

I like UniStr type. I have small performance issues with it.

First is benchmarking:

  • do benchmarks really show that union of 4 types on 0.7 can be actually considered small ๐Ÿ˜„ (I know 2 is small but not have seen benchmarks for 4)
  • actually this union might be 5 not 4 - how does it behave then: Union{UniStr, Missing} and Union{UniStr, Nothing}

The second is broadcasting and mapping. The compiler does not properly detect the required Union-type:

julia> UniStr.(["a","ฤ…","โˆ€"])
3-element Array{Strs.Str{T,Void,Void,Void} where T,1}:
 "a"
 "ฤ…"
 "โˆ€"

julia> map(UniStr, ["a","ฤ…","โˆ€"])
3-element Array{Strs.Str{T,Void,Void,Void} where T,1}:
 "a"
 "ฤ…"
 "โˆ€"

julia> UniStr[UniStr(s) for s in ["a","ฤ…","โˆ€"]]
3-element Array{UniStr,1}:
 "a"
 "ฤ…"
 "โˆ€"

Any thoughts of fixing it (maybe some promotion rules should be added). Additionally such promotion rules should take Missing into account (but maybe this will be handled automatically on 0.7).

Interestingly Set works correctly:

julia> Set{UniStr}(UniStr[UniStr(s) for s in ["a","ฤ…","โˆ€"]])
Set(Union{ASCIIStr, _LatinStr, _UCS2Str, _UTF32Str}["ฤ…", "a", "โˆ€"])

julia> Set{UniStr}(UniStr.(["a","ฤ…","โˆ€"]))
Set(Union{ASCIIStr, _LatinStr, _UCS2Str, _UTF32Str}["ฤ…", "a", "โˆ€"])

(although the type signature is lost)

broadcast on Str for mixed concrete types fails

Here is the problem:

julia> x = ["1", "โˆ€"]
2-element Array{String,1}:
 "1"
 "โˆ€"

julia> Str.(x)
ERROR: InexactError()
Stacktrace:
 [1] & at .\promotion.jl:286 [inlined]
 [2] _str_encode(::_UCS2Str, ::Int64, ::UInt64) at D:\DEV\Julia\Strs.jl\src\encode.jl:90
 [3] convert(::Type{Strs.Str{T,Void,Void,Void} where T}, ::_UCS2Str) at D:\DEV\Julia\Strs.jl\src\encode.jl:103
 [4] setindex!(::Array{Strs.Str{T,Void,Void,Void} where T,1}, ::_UCS2Str, ::Int64) at .\array.jl:583
 [5] setindex! at .\multidimensional.jl:300 [inlined]
 [6] macro expansion at .\broadcast.jl:243 [inlined]
 [7] _broadcast!(::Type{Strs.Str}, ::Array{ASCIIStr,1}, ::Tuple{Tuple{Bool}}, ::Tuple{Tuple{Int64}}, ::Tuple{Array{String,1}}, ::Type{Val{1}}, ::CartesianRange{CartesianIndex{1}}, ::CartesianIndex{1}, ::Int64) at .\broadcast.jl:219
 [8] broadcast_t(::Type{T} where T, ::Type{Any}, ::Tuple{Base.OneTo{Int64}}, ::CartesianRange{CartesianIndex{1}}, ::Array{String,1}) at .\broadcast.jl:265
 [9] broadcast_c at .\broadcast.jl:321 [inlined]
 [10] broadcast(::Type{T} where T, ::Array{String,1}) at .\broadcast.jl:455

The issue is that we are mixing ASCIIString (which is inferred from the first element of array by broadcast) with the following _UCS2Str.

vcat on mixed Str fails

Example:

julia> vcat(Str("1"), Str("ฤ…"))
ERROR: TypeError: setindex!: in typeassert, expected String, got Strs.Str{Strs.CSE{CharSet{UniPlus},Encoding{UTF8}()},Void,Void,Void}
Stacktrace:
 [1] setindex!(::Array{String,1}, ::_UCS2Str, ::UnitRange{Int64}) at .\array.jl:591
 [2] _cat(::Array{String,1}, ::Tuple{Int64}, ::Tuple{Bool}, ::ASCIIStr, ::Vararg{Any,N} where N) at .\abstractarray.jl:1225
 [3] cat_t(::Type{T} where T, ::Type{T} where T, ::ASCIIStr, ::Vararg{Any,N} where N) at .\abstractarray.jl:1208
 [4] vcat(::ASCIIStr, ::_UCS2Str) at .\abstractarray.jl:1260

The reason is missing promotion rule:

julia> promote_rule(typeof(Str("1")), typeof(Str("ฤ…")))
String

How is this collection of packages intended to be used?

Hello!

This seems like a great collection of utilities to fill the gap in the standard library, but I can't figure out how you intended for it to be used? Is Strs.jl the top level package to add, or is StrAPI.jl the one to use? Is there any module level documentation? Poking around on juliahub I wasn't seeing anything in the Strs or StrAPI packages. Are there any examples anywhere?

Also, the website link just redirects to the Github org page: http://juliastring.org/.

Lastly, I see this roadmap from 2018: #97, with no updates. Is this set of packages still under active development or in maintenance only mode? (In other words, do you want big PRs that change APIs, or should these be forked?)

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Make test/bench.jl user independent

Currently test/bench.jl relies on a concrete user machine configuration, e.g.:

const userdir = "/Users/scott/"

It would be good to make it runnable without modifying sources in the future.

Interoperation of UniStr and String

A small issue when mixing UniStr and String:

julia> y = [Strs.UniStr("ฤ…ฤ™ล‚")]
1-element Array{_UCS2Str,1}:
 "ฤ…ฤ™ล‚"

julia> x = ["ael"]
1-element Array{String,1}:
 "ael"

julia> [x y]
ERROR: TypeError: in copyto!, in typeassert, expected String, got UTF8Str
Stacktrace:
 [1] setindex! at .\array.jl:688 [inlined]
 [2] copyto!(::Array{String,2}, ::Int64, ::Array{_UCS2Str,1}, ::Int64, ::Int64) at .\abstractarray.jl:729
 [3] typed_hcat(::Type{String}, ::Array{String,1}, ::Array{_UCS2Str,1}) at .\abstractarray.jl:1162
 [4] hcat(::Array{String,1}, ::Array{_UCS2Str,1}) at \sparsevector.jl:1046
 [5] top-level scope

UniStr comparison

  1. when hash caching is introduced we could use it to make a first-pass of string comparison try to use it if it is set and do more fancy comparison only if hashes match
  2. When comparing the same string, but of different type (eg. ASCIIStr and _UCS2Str - this is in general possible to get it) they compare using === as false. This is probably what we want (though normally identical strings compare as true under Julia 0.7 even if they have a different memory location), but I just want to make sure that this is intended.

join fails to run

If you run a benchmark at https://github.com/bkamins/JuliaStrBenchmark

You see that join fails to run. The reason is that lines 411 and 413 have a typo in io.jl (d instead of delim).

But additionally - if you fix it - at least on my machine they still fail because wmemcpy function is not found (but maybe I do not have a proper version of Julia as it is a target that is moving fast).

PS. @ScottPJones Apart from this - again on my machine - Strs.jl is sometimes slower than strings from Base (I have noted in the test file where I find which case). This is an initial implementation of the benchmark - I have stopped here because of join bug.

Regex and UniStr

Working with regexes on UniStr will be slow as currently all has to be converted to String to work (regex as well as the string in which we look for it).

UniStr constructor is not type stable

This probably will be a problem when dynamically constructing UniStr:

julia> @code_warntype Strs.UniStr("1")
Variables:
  str::String

Body:
  begin
      # meta: location D:\DEV\Julia\Strs.jl\src\encode.jl convert 82
      Core.SSAValue(1) = $(Expr(:invoke, MethodInstance for _str(::String), :(Main.Strs._str), :(str)))::Any
      # meta: pop location
      return Core.SSAValue(1)
  end::Any

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.