Git Product home page Git Product logo

Comments (19)

kkos avatar kkos commented on May 30, 2024
  1. Outside character classes, characters that must be escaped to be used literally are: ^$.?*+()[|
    Yes.
  2. In character classes, characters that must be escaped to be used literally are: ^-]
    plus [
  3. But in some other implementation (Atom) that use Oniguruma, [ can be used unescaped in classes but ] can't.
    No.
    I don't know what syntax Atom uses, but I suppose that is ONIG_SYNTAX_DEFAULT.
    [ in char-class means start of nested char-class. You must escape [ as normal character.
    You must escape ] in char-class unless it appears at the top of char-class.
static void warn_func(const char* s)
{
  fprintf(stderr, "WARN: %s\n", s);
}

extern int main(int argc, char* argv[])
{
  onig_set_warn_func(warn_func);

  exec(ONIG_ENCODING_UTF8, ONIG_ENCODING_UTF8, ONIG_OPTION_NONE,
       "[]a]", "]");

  return 0;
}

(* omit exec() code)

result:
WARN: character class has ']' without escape: /[]a]/
match at 0  (UTF-8)
0: (0-1)

from oniguruma.

hediyi avatar hediyi commented on May 30, 2024

✨ Thanks very much for the clarification! ✨

https://github.com/kkos/oniguruma/wiki/Characters-That-Must-Be-Escaped-to-Be-Used-Literally Did I miss anything?

And could you keep this open just in case I have further questions? :))

from oniguruma.

kkos avatar kkos commented on May 30, 2024

Thank you.
The content is correct.

from oniguruma.

hediyi avatar hediyi commented on May 30, 2024

Isn't \w equivalent to [0-9A-Z_a-z]? I thought it was, but I found it can match much much more than that.

from oniguruma.

kkos avatar kkos commented on May 30, 2024

\w is equivalent to [0-9A-Z_a-z] if you are using ASCII encoding.
\w matches many code points in Unicode(UTF-8, UTF-16, UTF-32) encoding.
Unicode word code points data is defined at CR_Word[] in src/unicode_property_data.c

from oniguruma.

hediyi avatar hediyi commented on May 30, 2024

So in UTF-8 encoding, \w can match the characters that are mapped to the code points defined in CR_Word[], right?

from oniguruma.

kkos avatar kkos commented on May 30, 2024

Yes. And 654 is the number of code ranges.

from oniguruma.

hediyi avatar hediyi commented on May 30, 2024

Thanks for your replies. They really helped me a lot!

So similarly, in UTF-8 encoding, \d can match stuff specified in CR_Digit[], right? And can I assume [0-9_A-Za-z] is "faster" than \w in UTF-8?

Do you mind if I help edit the doc of RE? It would just be some small refinements to make it easier to understand.

from oniguruma.

kkos avatar kkos commented on May 30, 2024

I do not know the difference of the speed�.
Because single byte code range and multi byte code range are separated in compiled regexp code, it may be such not different.

I have add you as collaborator.
Please edit files in doc.

from oniguruma.

kkos avatar kkos commented on May 30, 2024

And please edit develop branch not master branch.

from oniguruma.

hediyi avatar hediyi commented on May 30, 2024

👌 😉

from oniguruma.

hediyi avatar hediyi commented on May 30, 2024

About \G, does it mean "where the current match attempt begins", i.e., either \A or "where the last match left off"?

from oniguruma.

kkos avatar kkos commented on May 30, 2024

onig_search() has search-range argument (start, range) and string argument (str, end).
Search-range is matching start position range.

\G mean start position of search-range.
\A mean start position of the string.
In most cases, search-range and string are same value, then \G == \A.

from oniguruma.

hediyi avatar hediyi commented on May 30, 2024

Tried digging around the definition of onig_search, I could understand only a part of it.

I was trying to expand on/clarify some definitions in doc/RE, one of them is of \G:

\G matching start position

\G is useful in a regexp applied to the same string more than once. To define it from the user's standpoint. The first time the regexp applied, it matches the beginning of string; later it matches where the last match ends, right? So I wanna change the definition to something like

beginning of the current search attempt

Does it look correct to you?

from oniguruma.

kkos avatar kkos commented on May 30, 2024

Yes. You are right.

from oniguruma.

hediyi avatar hediyi commented on May 30, 2024

I was wondering, apart from that [:...:] is only available in character class, what really is different between Unicode properties and the POSIX notation, why do we need them both? I mean, Unicode properties are so much more powerful, why do we still need the POSIX notation?

from oniguruma.

kkos avatar kkos commented on May 30, 2024

Thank you for improved document.

POSIX bracket is poor than Unicode property.
It is not necessary anymore now.
But it was included in GNU regex library.
My first goal was to make GNU regex compatible library, and thereafter I was introduced character property function etc.. from Perl.

from oniguruma.

hediyi avatar hediyi commented on May 30, 2024

👌 No problem, and I can get to know more about how Oniguruma processes regexps along the way. 😌 So are you planning to deprecate POSIX brackets?

New question: in doc/RE,

In the back reference by the multiplex definition name,

What is a multiplex definition name?

Mind if I email you with the questions instead?

from oniguruma.

kkos avatar kkos commented on May 30, 2024

I will remove POSIX bracket from version 7.0 if it is removed in Perl6.
But I don't know it is removed or not.

You can assign one name to the more than two groups.

from oniguruma.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.