Git Product home page Git Product logo

Comments (12)

gilmoreorless avatar gilmoreorless commented on May 22, 2024 12

Given the generally-fluid answer to "what exactly is an emoji?" (official answer: it depends), I think following the spec was the only sensible course of action for this project.

My frustration has been with tr51 defining the keycap base characters (0-9, * and #) as having the Emoji=Yes property. I understand why they did it, not least because it makes defining the formal grammar much easier and more consistent. That doesn't stop me being frustrated about it though, since even with the U+FE0F presentation selector, no system displays those characters as "colorful and perhaps whimsical shapes".

@mathiasbynens I wonder if it would be worth creating a separate "loose" regex for this sort of use case. I'm thinking of a version of the text regex which excludes any standalone characters with the property Emoji_Component=Yes. Specifically that would mean these characters (from the 11.0 emoji-data.txt):

0023          ; Emoji_Component      #  1.1  [1] (#️)       number sign
002A          ; Emoji_Component      #  1.1  [1] (*️)       asterisk
0030..0039    ; Emoji_Component      #  1.1 [10] (0️..9️)    digit zero..digit nine
200D          ; Emoji_Component      #  1.1  [1] (‍)        zero width joiner
20E3          ; Emoji_Component      #  3.0  [1] (⃣)       combining enclosing keycap
FE0F          ; Emoji_Component      #  3.2  [1] ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; Emoji_Component      #  6.0 [26] (🇦..🇿)    regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; Emoji_Component      #  8.0  [5] (🏻..🏿)    light skin tone..dark skin tone
1F9B0..1F9B3  ; Emoji_Component      # 11.0  [4] (🦰..🦳)    red-haired..white-haired
E0020..E007F  ; Emoji_Component      #  3.1 [96] (󠀠..󠁿)      tag space..cancel tag

Those characters would still be correctly matched in their respective sequences.

Additionally, a "loose" regex could define the flag sequences as just [\u{1F1E6}-\u{1F1FF}]{2} (or even \p{Regional_Indicator}{2}), which would cut down the regex size at the cost of potentially matching invalid sequences.

I haven't actually tested this idea, mainly just thinking out loud.

(Edit: After looking at the proposed changes for the 11.0 spec, it seems that the new Extended_Pictographic=Yes property covers my use case rather neatly. "The Extended_Pictographic characters contain all the Emoji characters except for some Emoji_Components.")

from emoji-regex.

mathiasbynens avatar mathiasbynens commented on May 22, 2024 4

This is not a bug. # and 0-9 are Emoji characters with a text representation by default, per the Unicode Standard.

from emoji-regex.

astoilkov avatar astoilkov commented on May 22, 2024

It also fails with special characters like #

from emoji-regex.

astoilkov avatar astoilkov commented on May 22, 2024

@gilmoreorless Do you have any suggestions or ideas?

from emoji-regex.

astoilkov avatar astoilkov commented on May 22, 2024

I first want to thank for the amazing library. We are using it for some time and it helps us a lot. Keep up the good work.

Without negative feelings I am asking this question - Don't you think the library should have support for characters that humans say are emojis not a specification? For example, a person would say this is an emoji 🗡 but not this # or numbers. In our case we should hardcode some additional rules in order to fix this behavior as we can't say to our users that # and 0-9 are emojis.

We would probably find a way to workaround such issues by making an extra layer of emoji-regex. However, I just wanted to tell what I am thinking in order to help the library become better and more famous as it deserves.

from emoji-regex.

mathiasbynens avatar mathiasbynens commented on May 22, 2024

@astoilkov What people consider to be an emoji depends on their operating system and the fonts they have installed. It’s impossible to create a static regular expression that takes the user’s environment into consideration.

So this project does the next best thing: it uses the Unicode Standard as the single source of truth. Whenever implementations (e.g. emoji on macOS) deviate from the standard (e.g. #28), there will always be a mismatch between what is matched and what you’d expect based on the OS behavior. There is no way around this.

We could apply the hacky workaround from #28 (comment) to emoji-regex, and it would make such mismatches less common at the cost of being less technically correct — but it would still not fully solve the problem.

from emoji-regex.

astoilkov avatar astoilkov commented on May 22, 2024

@mathiasbynens Thanks for the lengthy explanation. I now understand the problem in more detail. I think we can close this issue.

If I was in your position I would probably create another file(like text.js) that captures such scenarios but doesn't follow the specification and then I would describe that in the readme. This way you could fine tune it little by little.

from emoji-regex.

josephrocca avatar josephrocca commented on May 22, 2024

Perhaps the readme could be updated to warn people about the counter-intuitive parts of this module? Just something like "watch out for these weird things about the unicode spec: ..."

In any case, here are all the symbols that emoji-regex/text misses (whether on purpose or not), in case there are any here which it is supposed to match:

⚲⚨⚮⚭⚥⚬⚢⚤⚯⚘⚦⚚⚩⚣⚐⚍⚎⚊⚌⚏⚋⚑⚇⚄♶♽♸☖♼⚉⚃⚆⚂♷♳♺⚈⚁♴⚀♹☗⚅♲♵☙♱♰☟☬♖✐♩☜♆☱☞♘☴♬☾☤♃☇☏☥♪♇☛☌☧★♚♞☒♯♜☚☋♄☶♧❦☼♗☽☍♁☡☷☰♫☲☭♙♭♕♔☐☓♛☨☳☻♅♤☵☩☊♡☈☫❧♮✎❥☉♢♝ ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍󠀯󠁃󠁀󠀠󠁈󠁂󠁌󠁕󠀮󠁄󠁾󠀧󠀡󠀾󠁋󠁗󠁍󠁚󠁎️󠁉󠁖󠀥󠁽󠀿󠁓󠁁󠀻󠁊󠀭󠁏󠁠󠁟⃣󠁇󠀽󠀦󠁅󠀼‍󠁝󠀪󠀨󠁻󠁒󠁜󠁞󠁑󠀩󠁙󠁆󠀤󠁼󠀺󠁘󠀫󠀢󠀣󠁐󠁔󠀬󠁛‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍󠁧󠁢󠁥󠁮󠁧󠁿󠁧󠁢󠁳󠁣󠁴󠁿󠁧󠁢󠁷󠁬󠁳󠁿󠀷󠁳󠁢󠁵󠁷󠁬󠁥󠀲󠁹󠁤󠀶󠁴󠀵󠁨󠁲󠁮󠁰󠁱󠀰󠁡󠀳󠁩󠁯󠁭󠀹󠁸󠁧󠁣󠁺󠀴󠁶󠁫󠁿󠀸󠁪󠁦󠀱‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍⃣⃣️⃣‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍₿🕬🗔🗫🗮🗉🗠🖢🗀🗪🖈🕈🗬🖀🖗🛆🖟🗲🕫🖯🕇🕪🗅🖰🌢🗩🗴🔿🗰🕱🎘🗶🖽🗤🖿🖻🗕🕼🛨🎝🔾🖘🖠🖎🕩🖫🖬🗘🖸🛦🖡🖜🖷🛉🏲🛱🕨🗁🗈🗌🗢🖳🎕🕅🗗🗚🗱🗋🖞🕭🗙🗹🖵🗐🛪🖏🖙🗧🎔🌣🖉🖹🗦🖧🖛🖪🛧🖚🖮🖆🗸🖦🎜🗇🛈🗵🖃🖾🛇🖺🖓🛊🕻🏱🕄🕾🖄🖝🖒🕲🗆🏶🖅🗍🗟🗖🗛🖩🕽🖴🕿🖂🗥🖑📾🕮🖣🛲🖶🗎🖔🗊🕆🗷🗭🖭🗏🖁🕀🕂🕁🕃⛥⛢⛤⛦⛧⛻⛾⛚⛆⛙⛕⚿⛒⛉⛊⛫⛘⛛⛖⛮⛬⛨⚞⛿⛜⛗⛣⛋⛝⛟⛐⛯⛼⛌⛶⛍⛡⛠⛞⛇⛭⚟🀦🀜🀓⚴🀚⚶🀩🀝🀆🀐🀋🀨🀉🀀🀂🀖🀅🀗🀢⛂⛃🀊🀠🀤⚼🀛🀑🀈⚝🀔🀎⚻🀡⛁🀫⚹🀕🀘🀙⚸🀏🀣⛀⚷🀪🀍⚵🀒⚺🀞߷🀃⚳🀇🀥🀧🀌🀁🀟

I made a module that matches these and also incorporates @gilmoreorless's variation selector fix: https://github.com/josephrocca/emoji-and-symbol-regex

from emoji-regex.

ChurchTao avatar ChurchTao commented on May 22, 2024

Given the generally-fluid answer to "what exactly is an emoji?" (official answer: it depends), I think following the spec was the only sensible course of action for this project.

My frustration has been with tr51 defining the keycap base characters (0-9, * and #) as having the Emoji=Yes property. I understand why they did it, not least because it makes defining the formal grammar much easier and more consistent. That doesn't stop me being frustrated about it though, since even with the U+FE0F presentation selector, no system displays those characters as "colorful and perhaps whimsical shapes".

@mathiasbynens I wonder if it would be worth creating a separate "loose" regex for this sort of use case. I'm thinking of a version of the text regex which excludes any standalone characters with the property Emoji_Component=Yes. Specifically that would mean these characters (from the 11.0 emoji-data.txt):

0023          ; Emoji_Component      #  1.1  [1] (#️)       number sign
002A          ; Emoji_Component      #  1.1  [1] (*️)       asterisk
0030..0039    ; Emoji_Component      #  1.1 [10] (0️..9️)    digit zero..digit nine
200D          ; Emoji_Component      #  1.1  [1] (‍)        zero width joiner
20E3          ; Emoji_Component      #  3.0  [1] (⃣)       combining enclosing keycap
FE0F          ; Emoji_Component      #  3.2  [1] ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; Emoji_Component      #  6.0 [26] (🇦..🇿)    regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; Emoji_Component      #  8.0  [5] (🏻..🏿)    light skin tone..dark skin tone
1F9B0..1F9B3  ; Emoji_Component      # 11.0  [4] (🦰..🦳)    red-haired..white-haired
E0020..E007F  ; Emoji_Component      #  3.1 [96] (󠀠..󠁿)      tag space..cancel tag

I have the same idea as you, so I made a module of non Regex based on https://www.unicode.org/Public/emoji/13.0/emoji-test.txt.

https://github.com/ChurchTao/emoji-js

from emoji-regex.

dezren39 avatar dezren39 commented on May 22, 2024
import _emojiRegex from 'emoji-regex/es2015/text.js';
const emojiRegex = () => new RegExp('('+_emojiRegex().toString().replace(/#\\\*0-9/gu, '')+'|\uFE0F\u20E3|\uFE0F|\u20E3)', 'gu'),

I did this, it doesn't count 0-9, #, *, the part at the end nixes the enclosing boxes for actual number emoji, but keeps the numbers, which is what I wanted for my circumstance. Pretty sure '|\uFE0F\u20E3|\uFE0F may be unneeded and just |\u20E3 would be sufficient. There are better ways to solve and I thought of more complex ways, but this is one extra line without making a whole new package.

Open to suggestion for a better method to handle. 👍 I also added the 'non-emoji' symbols that are basically emoji, etc, in my case, but that is secondary to this number issue.


While researching, I also found: https://github.com/tonton-pixel/emoji-patterns
This package has each category split into it's own pattern, providing 2 larger patterns which join the categories together.
If one needed a more nuanced take, they could try something like this, which may be useful in some cases.

Though, I believe the real solution is for TC39 to accept something like this (currently at proposal): https://mths.be/emoji

from emoji-regex.

mathiasbynens avatar mathiasbynens commented on May 22, 2024

Is there anything left to do to resolve this issue? I'm closing it for now. If anyone wants to suggest a README improvement that calls out some of the Unicode weirdness we've discussed, please send a PR!

from emoji-regex.

say8425 avatar say8425 commented on May 22, 2024
import * as emojiPatterns from 'emoji-patterns';

const emojiRegex = new RegExp (emojiPatterns['Emoji_All'].replace(/\\u0023\\u002A\\u0030-\\u0039|\\u{1F1E6}-\\u{1F1FF}/gi, ''), 'gu');
emojiRegex.test(value);

Finally, I use a emoji-patterns package.

from emoji-regex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.