Comments (12)
Given the generally-fluid answer to "what exactly is an emoji?" (official answer: it depends), I think following the spec was the only sensible course of action for this project.
My frustration has been with tr51 defining the keycap base characters (0-9, * and #) as having the Emoji=Yes
property. I understand why they did it, not least because it makes defining the formal grammar much easier and more consistent. That doesn't stop me being frustrated about it though, since even with the U+FE0F
presentation selector, no system displays those characters as "colorful and perhaps whimsical shapes".
@mathiasbynens I wonder if it would be worth creating a separate "loose" regex for this sort of use case. I'm thinking of a version of the text
regex which excludes any standalone characters with the property Emoji_Component=Yes
. Specifically that would mean these characters (from the 11.0 emoji-data.txt):
0023 ; Emoji_Component # 1.1 [1] (#️) number sign
002A ; Emoji_Component # 1.1 [1] (*️) asterisk
0030..0039 ; Emoji_Component # 1.1 [10] (0️..9️) digit zero..digit nine
200D ; Emoji_Component # 1.1 [1] () zero width joiner
20E3 ; Emoji_Component # 3.0 [1] (⃣) combining enclosing keycap
FE0F ; Emoji_Component # 3.2 [1] () VARIATION SELECTOR-16
1F1E6..1F1FF ; Emoji_Component # 6.0 [26] (🇦..🇿) regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF ; Emoji_Component # 8.0 [5] (🏻..🏿) light skin tone..dark skin tone
1F9B0..1F9B3 ; Emoji_Component # 11.0 [4] (🦰..🦳) red-haired..white-haired
E0020..E007F ; Emoji_Component # 3.1 [96] (..) tag space..cancel tag
Those characters would still be correctly matched in their respective sequences.
Additionally, a "loose" regex could define the flag sequences as just [\u{1F1E6}-\u{1F1FF}]{2}
(or even \p{Regional_Indicator}{2}
), which would cut down the regex size at the cost of potentially matching invalid sequences.
I haven't actually tested this idea, mainly just thinking out loud.
(Edit: After looking at the proposed changes for the 11.0 spec, it seems that the new Extended_Pictographic=Yes
property covers my use case rather neatly. "The Extended_Pictographic characters contain all the Emoji characters except for some Emoji_Components.")
from emoji-regex.
This is not a bug. #
and 0
-9
are Emoji
characters with a text representation by default, per the Unicode Standard.
from emoji-regex.
It also fails with special characters like #
from emoji-regex.
@gilmoreorless Do you have any suggestions or ideas?
from emoji-regex.
I first want to thank for the amazing library. We are using it for some time and it helps us a lot. Keep up the good work.
Without negative feelings I am asking this question - Don't you think the library should have support for characters that humans say are emojis not a specification? For example, a person would say this is an emoji 🗡
but not this #
or numbers. In our case we should hardcode some additional rules in order to fix this behavior as we can't say to our users that #
and 0-9
are emojis.
We would probably find a way to workaround such issues by making an extra layer of emoji-regex
. However, I just wanted to tell what I am thinking in order to help the library become better and more famous as it deserves.
from emoji-regex.
@astoilkov What people consider to be an emoji depends on their operating system and the fonts they have installed. It’s impossible to create a static regular expression that takes the user’s environment into consideration.
So this project does the next best thing: it uses the Unicode Standard as the single source of truth. Whenever implementations (e.g. emoji on macOS) deviate from the standard (e.g. #28), there will always be a mismatch between what is matched and what you’d expect based on the OS behavior. There is no way around this.
We could apply the hacky workaround from #28 (comment) to emoji-regex, and it would make such mismatches less common at the cost of being less technically correct — but it would still not fully solve the problem.
from emoji-regex.
@mathiasbynens Thanks for the lengthy explanation. I now understand the problem in more detail. I think we can close this issue.
If I was in your position I would probably create another file(like text.js) that captures such scenarios but doesn't follow the specification and then I would describe that in the readme. This way you could fine tune it little by little.
from emoji-regex.
Perhaps the readme could be updated to warn people about the counter-intuitive parts of this module? Just something like "watch out for these weird things about the unicode spec: ..."
In any case, here are all the symbols that emoji-regex/text
misses (whether on purpose or not), in case there are any here which it is supposed to match:
⚲⚨⚮⚭⚥⚬⚢⚤⚯⚘⚦⚚⚩⚣⚐⚍⚎⚊⚌⚏⚋⚑⚇⚄♶♽♸☖♼⚉⚃⚆⚂♷♳♺⚈⚁♴⚀♹☗⚅♲♵☙♱♰☟☬♖✐♩☜♆☱☞♘☴♬☾☤♃☇☏☥♪♇☛☌☧★♚♞☒♯♜☚☋♄☶♧❦☼♗☽☍♁☡☷☰♫☲☭♙♭♕♔☐☓♛☨☳☻♅♤☵☩☊♡☈☫❧♮✎❥☉♢♝ ️⃣⃣⃣️⃣₿🕬🗔🗫🗮🗉🗠🖢🗀🗪🖈🕈🗬🖀🖗🛆🖟🗲🕫🖯🕇🕪🗅🖰🌢🗩🗴🔿🗰🕱🎘🗶🖽🗤🖿🖻🗕🕼🛨🎝🔾🖘🖠🖎🕩🖫🖬🗘🖸🛦🖡🖜🖷🛉🏲🛱🕨🗁🗈🗌🗢🖳🎕🕅🗗🗚🗱🗋🖞🕭🗙🗹🖵🗐🛪🖏🖙🗧🎔🌣🖉🖹🗦🖧🖛🖪🛧🖚🖮🖆🗸🖦🎜🗇🛈🗵🖃🖾🛇🖺🖓🛊🕻🏱🕄🕾🖄🖝🖒🕲🗆🏶🖅🗍🗟🗖🗛🖩🕽🖴🕿🖂🗥🖑📾🕮🖣🛲🖶🗎🖔🗊🕆🗷🗭🖭🗏🖁🕀🕂🕁🕃⛥⛢⛤⛦⛧⛻⛾⛚⛆⛙⛕⚿⛒⛉⛊⛫⛘⛛⛖⛮⛬⛨⚞⛿⛜⛗⛣⛋⛝⛟⛐⛯⛼⛌⛶⛍⛡⛠⛞⛇⛭⚟🀦🀜🀓⚴🀚⚶🀩🀝🀆🀐🀋🀨🀉🀀🀂🀖🀅🀗🀢⛂⛃🀊🀠🀤⚼🀛🀑🀈⚝🀔🀎⚻🀡⛁🀫⚹🀕🀘🀙⚸🀏🀣⛀⚷🀪🀍⚵🀒⚺🀞߷🀃⚳🀇🀥🀧🀌🀁🀟
I made a module that matches these and also incorporates @gilmoreorless's variation selector fix: https://github.com/josephrocca/emoji-and-symbol-regex
from emoji-regex.
Given the generally-fluid answer to "what exactly is an emoji?" (official answer: it depends), I think following the spec was the only sensible course of action for this project.
My frustration has been with tr51 defining the keycap base characters (0-9, * and #) as having the
Emoji=Yes
property. I understand why they did it, not least because it makes defining the formal grammar much easier and more consistent. That doesn't stop me being frustrated about it though, since even with theU+FE0F
presentation selector, no system displays those characters as "colorful and perhaps whimsical shapes".@mathiasbynens I wonder if it would be worth creating a separate "loose" regex for this sort of use case. I'm thinking of a version of the
text
regex which excludes any standalone characters with the propertyEmoji_Component=Yes
. Specifically that would mean these characters (from the 11.0 emoji-data.txt):0023 ; Emoji_Component # 1.1 [1] (#️) number sign 002A ; Emoji_Component # 1.1 [1] (*️) asterisk 0030..0039 ; Emoji_Component # 1.1 [10] (0️..9️) digit zero..digit nine 200D ; Emoji_Component # 1.1 [1] () zero width joiner 20E3 ; Emoji_Component # 3.0 [1] (⃣) combining enclosing keycap FE0F ; Emoji_Component # 3.2 [1] () VARIATION SELECTOR-16 1F1E6..1F1FF ; Emoji_Component # 6.0 [26] (🇦..🇿) regional indicator symbol letter a..regional indicator symbol letter z 1F3FB..1F3FF ; Emoji_Component # 8.0 [5] (🏻..🏿) light skin tone..dark skin tone 1F9B0..1F9B3 ; Emoji_Component # 11.0 [4] (🦰..🦳) red-haired..white-haired E0020..E007F ; Emoji_Component # 3.1 [96] (..) tag space..cancel tag
I have the same idea as you, so I made a module of non Regex based on https://www.unicode.org/Public/emoji/13.0/emoji-test.txt.
https://github.com/ChurchTao/emoji-js
from emoji-regex.
import _emojiRegex from 'emoji-regex/es2015/text.js';
const emojiRegex = () => new RegExp('('+_emojiRegex().toString().replace(/#\\\*0-9/gu, '')+'|\uFE0F\u20E3|\uFE0F|\u20E3)', 'gu'),
I did this, it doesn't count 0-9, #, *, the part at the end nixes the enclosing boxes for actual number emoji, but keeps the numbers, which is what I wanted for my circumstance. Pretty sure '|\uFE0F\u20E3|\uFE0F
may be unneeded and just |\u20E3
would be sufficient. There are better ways to solve and I thought of more complex ways, but this is one extra line without making a whole new package.
Open to suggestion for a better method to handle. 👍 I also added the 'non-emoji' symbols that are basically emoji, etc, in my case, but that is secondary to this number issue.
While researching, I also found: https://github.com/tonton-pixel/emoji-patterns
This package has each category split into it's own pattern, providing 2 larger patterns which join the categories together.
If one needed a more nuanced take, they could try something like this, which may be useful in some cases.
Though, I believe the real solution is for TC39 to accept something like this (currently at proposal): https://mths.be/emoji
from emoji-regex.
Is there anything left to do to resolve this issue? I'm closing it for now. If anyone wants to suggest a README improvement that calls out some of the Unicode weirdness we've discussed, please send a PR!
from emoji-regex.
import * as emojiPatterns from 'emoji-patterns';
const emojiRegex = new RegExp (emojiPatterns['Emoji_All'].replace(/\\u0023\\u002A\\u0030-\\u0039|\\u{1F1E6}-\\u{1F1FF}/gi, ''), 'gu');
emojiRegex.test(value);
Finally, I use a emoji-patterns package.
from emoji-regex.
Related Issues (20)
- © is recognised as an emoji HOT 2
- Typescript error when using require('emoji-regex') HOT 3
- Is this lib basically doing /\p{Emoji}|\p{Default_Ignorable_Code_Point}/gu ? HOT 3
- Question around choice of factory HOT 2
- Determine emoji type? HOT 2
- rename License file?
- some variations of emojis are not working with current version of emoji-regex library
- The face-exhaling emoji isn't matched correctly
- Node 20 LTS supports the new proposed RegExp flag linked in the source HOT 1
- Why does the second result return false? HOT 2
- Simplify build by leveraging rgi-emoji-regex-pattern HOT 1
- Certain emoji sequences are not recognized HOT 8
- Shopping Bag emoji doesn't match HOT 1
- typescript synthetic import HOT 2
- npm install emoji-regex returns MODULE_NOT_FOUND HOT 2
- Symbol # and Number 0-9 HOT 3
- .npmrc breaks local install HOT 1
- Does not match some emoji HOT 2
- How to use emoji-regex v10.0.0 with regexp unicode flag HOT 2
- Add changelog HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from emoji-regex.