Comments (14)
@timreichen In the state it was merged, that PR isn't a port of npm:slugify
, because it doesn't include any char mapping. But more to the point, why is it more important to have parity with a specific NPM package rather than to have slugify
be a general-purpose function for creating slugs, as implemented by sites like WordPress, Stack Overflow, Wikipedia, GitHub, Medium, and Tumblr? npm:slugify
has multiple open issues about its lack of support for various languages.
And if toKebabCase
already works reliably as a general-purpose slugify function (which I'm not sure about, haven't tested it), why expose a separate slugify
that only works for a subset of use cases?
from deno_std.
I'm happy to defer to the consensus of others with this issue, but if we go down the route of having a character map, best it be a Map<string, string>
.
from deno_std.
Here's a rundown of how various platforms handle non-ASCII text in slugs:
Site | Diacritics | Non-Latin | Notes |
---|---|---|---|
stackoverflow.com |
cartão de credito → cartão-de-credito Unchanged |
Python で特定の文字 → python-で特定の文字 Unchanged |
|
wikipedia.org |
Maria Angélica Beraldo → Maria_Angélica_Beraldo Unchanged |
佩通坦·钦那瓦 → 佩通坦·钦那瓦 Unchanged |
Wikipedia uses _ instead of - for spaces, but it's still a slug of sorts |
tumblr.com |
wüst → wüst Unchanged |
Word of the Day: 久违 (Chinese) → word-of-the-day-久违-chinese Unchanged |
|
wordpress.org |
Actualización de mantenimiento → actualizacion-de-mantenimiento Stripped |
WordPress 6.6「Dorsey」发布 → wordpress-6-6dorsey发布 Unchanged |
|
github.com |
Introducción → introducción Unchanged |
7. 兩岸詞典 /c/ → 7-兩岸詞典-c Unchanged |
Slugified section titles in README files in URL hash |
medium.com |
Cómo salirse -> como-salirse Stripped |
跟我学中文! -> 跟我学中文 Unchanged |
|
dev.to |
2 años como Front-End Developer → 2-anos-como-front-end-developer Stripped |
データ・ストリーミング技術の概要 → detasutoriminguji-shu-nogai-yao Transliterated |
This illustrates some of the problems with stripping diacritics and transliterating — 2 años means "2 years" whereas 2-anos means 2-anuses ; meanwhile 技術 should be gijutsu in Japanese, not ji-shu |
The transliteration option has one big advantage, namely that the URL remains legible in any context: plaintext files, IM platforms with limited rich-text features, etc. It also typically leads to shorter URLs compared to the percent-encoded version. Still, it's strictly worse when viewed in a browser address bar, adds a massive amount of complexity, including mappings for thousands of CJK characters, and often still leads to suboptimal results (as seen in the dev.to
examples). That's probably why only dev.to
uses it out of the 7 platforms I looked at.
As for diacritics, 3 of the 7 platforms strip them from Latin-script text, while the other 4 keep them. As with transliteration, stripping leads to more plaintext-friendly URLs; however, diacritics can be semantically important, also illustrated by the dev.to
example.
from deno_std.
In the initial implementation it was discussed to port npm:slugify
, so slugify('三人行,必有我师焉') === ""
is actually expected.
The behavior you describe is probably better handled with @std/text/to-kebab-case
import { toKebabCase } from "@std/text/to-kebab-case";
console.log(toKebabCase("三人行-必有我师焉")); // "三人行-必有我师焉"
from deno_std.
@timreichen In the state it was merged, that PR isn't a port of
npm:slugify
, because it doesn't include any char mapping. But more to the point, why is it more important to have parity with a specific NPM package rather than to haveslugify
be a general-purpose function for creating slugs, as implemented by sites like WordPress, Stack Overflow, Wikipedia, GitHub, Medium, and Tumblr?npm:slugify
has multiple open issues about its lack of support for various languages.
We removed the char mapping because the list was random. The problem is as you pointed out that there is no standard and the slugify functionality varies depending on the implementation.
And if
toKebabCase
already works reliably as a general-purpose slugify function (which I'm not sure about, haven't tested it), why expose a separateslugify
that only works for a subset of use cases?
Every slugify function will be only work on a subset of use cases. That is why for example npm:slugify
has so many options and one can add custom replacements etc.
I think if there is a clean way to support other languages, that should be added.
However, the slug must match [a-zA-Z0-9-]*
.
from deno_std.
However, the slug must match [a-zA-Z0-9-]*.
Why? Again, massive platforms like WordPress, Stack Overflow, Wikipedia, GitHub, Medium, and Tumblr don't obey that rule, and browsers and web APIs handle non-ASCII URL components perfectly fine. Allowing ""
as a slug is far more risky sanitization-wise than allowing arbitrary non-ASCII text, because the path /a//b
normalizes to /a/b
(and additionally, /a/
is often normalized to /a
). Meanwhile, non-ASCII text can never clash with reserved characters, which always fall within the printable ASCII range.
from deno_std.
Maybe we could just change the signature of slugify so users can provide their own strip regex ?
function slugify(input: string, strip = /[^a-zA-Z0-9\s-]/g): string
This way it doesn't really add much more complexity while offering a bit more liberty to end users (which would know best which charset they'd like to support) ?
slugify("déjà-vu", /[^a-zA-Z0-9\s-À-ÖØ-öø-ÿ]/g) // "déjà-vu"
from deno_std.
@lowlighter That seems to me like it's simultaneously too granular and not customizable enough. Too granular because I can't see any good reason why you'd want to allow some non-ASCII but not others; not customizable enough because it still doesn't provide any way of mapping.
Something like this could work:
// slugify.ts
export type SlugifyOptions = {
/** @default {undefined} */
charMap: Record<string, string> | undefined,
/** @default {Boolean(options.charMap)} */
stripUnknown: boolean,
/** @default {Boolean(options.charMap || options.stripUnknown)} */
stripDiacritics: boolean,
}
export function slugify(input: string, options?: Partial<SlugifyOptions>): string
// slugify_char_map.ts
// A comprehensive char mapping (transliteration) from some decently authoritative source
export const charMap = {
// ...
я: "ya",
// ...
鼎: "ding",
// ...
}
If you really want to opt-in to the "nuke everything other than Basic Latin" option for some reason, you could still do that with slugify(..., { stripUnknown: true })
or slugify(..., { charMap: {} })
.
As for "decently authoritative source" for the char map, I'm not sure what that would be. https://unicode-org.github.io/icu/userguide/transforms/general/ provides some notes on transliteration, which suggest that a simple charMap
isn't really sufficient, but it looks like implementing proper transliteration is pretty complicated, so a char map could end up being the least-worst option (other than the actual least-worst option, which is just relying on percent-encoding to do its thing 😜)
from deno_std.
Looks like the requisite ICU data for transliteration is here: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/translit/. With some truly disgusting regex-based """parsing""" of the data files, Intl.Segmenter
-based word segmentation, and some gibberish strings of words in various languages, I can get decent-ish results (this uses ~500kb of very un-optimized mapping data for the char map):
[vi]
Bằng khác byte phần bảng ký sun hợp của tự.
bang-khac-byte-phan-bang-ky-sun-hop-cua-tu
[zh]
解决始终这些统一,大部分,既成事实节字体编码的。
jiejue-shizhong-zhexie-tongyi-dabufen-chengshishi-jie-ziti-bianma-de
[de]
Ging loslassen Steuerzeichen in auf übersetzt, Phaistos Pau und Ugaritisch.
ging-loslassen-steuerzeichen-in-auf-ubersetzt-phaistos-pau-und-ugaritisch
[es]
Modificaciones equivalencia que esquemas alquímicos ha bits vez de una.
modificaciones-equivalencia-que-esquemas-alquimicos-ha-bits-vez-de-una
[ru]
Бит данных частично характерный текст, на в с клавиатуры из.
bit-dannyx-chastichno-xarakternyi-tekst-na-v-s-klaviatury-iz
[ar]
رمز والأسلوب العالم، متناسقة يتكون الراغبون، التفريق كترميزات في النصوص.
rmz-walaslwb-alalm-mtnasqh-ytkwn-alraghbwn-altfryq-ktrmyzat-fy-alnsws
[ja]
韓国利用運用文字のをれドキュメントマッピングと。
hanguo-liyong-yunyong-wenzi-no-o-re-tokyumentomahin-ku-to
[el]
Υπολογιστή ωστόσο αφήνει όλες βασίζονται προβλήματα στο κωδικοποίησης την για.
ypologiste-ostoso-apheni-oles-basizondai-problemata-sto-kodikopoieses-ten-gia
Limitations: Japanese will always give bad results for Kanji; Arabic lacks most vowels (I think that's due to the vowels not being indicated in the first place, so no way round that); the Greek is currently based on Ancient Greek transliteration, but
I think that can be fixed by ignoring certain input files.
from deno_std.
OK, switched to using a custom peggy grammar to parse the ICU mappings, and now getting what seem to be decent results now for all the languages I've tested (as good/better than results from other general-purpose transliteration libraries I've looked at).
Comparison:
Language | Example | @std/slugify@next (with charMap option) |
npm:slugify | npm:transliterate |
---|---|---|---|---|
Amharic | በርበሬን ከላመ ከሞተ አግኝተሽው ዋጥ ስልቅጥ አድርገሽ ከምኔው ጨረሽው | beribereni-kelame-kemote-ginyiteshiwi-wati-silikiti-dirigeshi-keminewi-chereshiwi | barebareene-kalaama-kamota-agenyetashewe-waathe-seleqethe-aderegashe-kameneewe-charashewe | |
Arabic | الحركة الدولية للدفاع عن الأطفال الفلسطينين ضد بايدن | alhrkh-aldwlyh-lldfa-n-alatfal-alflstynyn-zd-baydn | alhrkh-aldwlyh-lldfaa-an-alatfal-alflstynyn-dhd-baydn | lhrk-ldwly-lldfaa-aan-l-tfl-lflstynyn-dd-bydn |
German | Leichtathletik-Weltmeisterschaften 2007/Teilnehmer (Liechtenstein) | leichtathletik-weltmeisterschaften-2007-teilnehmer-liechtenstein | Leichtathletik-Weltmeisterschaften-2007Teilnehmer-(Liechtenstein) | leichtathletik-weltmeisterschaften-2007-teilnehmer-liechtenstein |
Greek | Βραβείο Καλύτερου Διευθυντή Φωτογραφίας της Ένωσης Διαδικτυακών Κριτικών Κινηματογράφου | vravio-kaliterou-dhievthindi-fotografias-tis-enosis-dhiadhiktiakon-kritikon-kinimatografou | Brabeio-Kalyteroy-Diey8ynth-Fwtografias-ths-Enwshs-Diadiktyakwn-Kritikwn-Kinhmatografoy | vraveio-kalyteroy-dieythynti-fotografias-tis-enosis-diadiktyakon-kritikon-kinimatografoy |
Spanish | Temporada 2018 del Campeonato Brasileño de Motovelocidade | temporada-2018-del-campeonato-brasileno-de-motovelocidade | Temporada-2018-del-Campeonato-Brasileno-de-Motovelocidade | temporada-2018-del-campeonato-brasileno-de-motovelocidade |
Hindi | संयुक्त अरब अमीरात क्रिकेट टीम का स्कॉटलैंड दौरा 2016 | samyaukata-araba-amairaata-karaikaeta-taima-kaa-sakaotalaaimda-daauraa-2016 | 2016 | snyukt-arb-amiiraat-krikett-ttiim-kaa-skonttlaindd-dauraa-2016 |
Icelandic | Alfreð Clausen syngur lög eftir Jenna Jónsson | alfred-clausen-syngur-log-eftir-jenna-jonsson | Alfred-Clausen-syngur-log-eftir-Jenna-Jonsson | alfred-clausen-syngur-log-eftir-jenna-jonsson |
Japanese | コンティニュイング・ケア・リタイアメント・コミュニティ | konte-ni-ingu-kea-ritaiamento-komyunite | konteiniyuingukearitaiamentokomiyunitei | |
Russian | 500 величайших альбомов всех времён по версии журнала Rolling Stone | 500-velichayshix-albomov-vsex-vremyon-po-versii-jurnala-rolling-stone | 500-velichajshih-albomov-vseh-vremyon-po-versii-zhurnala-Rolling-Stone | 500-velichayshih-albomov-vseh-vremyon-po-versii-zhurnala-rolling-stone |
Thai | จังหวัดมุกดาหารในการเลือกตั้งสมาชิกสภาผู้แทนราษฎรไทยเป็นการทั่วไป พ.ศ. 2562 | canghwad-mukdahar-in-kar-eluxk-tang-smachik-spha-phu-aethn-radr-ithy-epnkar-thawip-ph-s-2562 | ..-2562 | cchanghwadmukdaahaarainkaareluue-ktangsmaachiksphaaphuuaethnraasdraithyepnkaarthawaip-ph.s.-2562 |
Vietnamese | Cục Phát thanh, truyền hình và thông tin điện tử (Việt Nam) | cuc-phat-thanh-truyen-hinh-va-thong-tin-dien-tu-viet-nam | Cuc-Phat-thanh-truyen-hinh-va-thong-tin-djien-tu-(Viet-Nam) | cuc-phat-thanh-truyen-hinh-va-thong-tin-dien-tu-viet-nam |
Chinese | 2020年夏季奧林匹克**會輕艇女子500公尺單人愛斯基摩艇比賽 | 2020-nian-xiaji-aolinpike-yundonghui-qing-ting-nuzi-500-gongchi-danren-aisijimo-ting-bisai | 2020500 | 2020nian-xia-ji-ao-lin-pi-ke-yun-dong-hui-qing-ting-nu-zi-500gong-chi-dan-ren-ai-si-ji-mo-ting-bi-sai |
npm:slugify
gives empty results for Amharic, Hindi, Thai, Chinese, and Japanese, and also has some extremely questionable choices for Greek (θ → 8, ω → w, η → h). npm:transliterate
gives more concise (possibly better?) results for Hindi but gives even less vowel-ey results for Arabic, lacks spacing for Japanese/Thai (I added ZWSPs to stop the table breaking the layout), and has suboptimal spacing for Chinese.
Char map is ~213KB (un-minified, un-gzipped).
IMO those are "good enough" results at this stage (given that the default will be not to transliterate), but it'd be good to get some input from speakers of a few more of these languages. You can also try it out with other languages here: https://dash.deno.com/playground/slugify
from deno_std.
Upon testing more languages (list taken from npm:any-ascii
's examples), we're still missing Braile (which I think is safe to omit as I think web content written in Braile must be vanishingly rare? Someone please correct if I'm wrong) and at least 3 South-East Asian languages (Burmese/Myanmar, Khmer, Lao). Also the Korean example looks a bit sus compared to the other versions. Also npm:any-ascii
now seems to be best-in-class for JS transliteration, at least from the ones I've found.
from deno_std.
While I personally found these researches interesting, I think it's difficult to do these transliterations in an unopinionated way. Also it seems difficult to maintain them as the maintainers are not knowledgeable about many of these languages. I'd consider the handlings of non-latin alphabet languages are out of scope of this API.
from deno_std.
I'd consider the handlings of non-latin alphabet languages are out of scope of this API.
@kt3k My main concern isn't that non-Latin script should have special handling, rather that it should be passed through rather than removed (and especially that it shouldn't be removed as a default option). It's worth mentioning that in its current state, slugify
doesn't even handle fully-Latin-alphabet text properly — for example, various alphabetic chars like [ßĐæø]
are removed (Blöße becomes bloe
, Trần Hưng Đạo becomes tran-hung-ao
, Nærøy becomes nry
).
I only started looking into transliteration, which IMO is a less-good option compared to pass-through (not to mention significantly less common in-the-wild), as an alternative.
With all that said... I'm inclined to think you're probably right. Further, @lowlighter 's suggestion of a strip regex is a useful option after all, but with suggested regexes being exported from the package itself (roll-your-own is probably less useful).
With that option, you can easily implement pass-through (default), strip, strip-diacritics, or even strip-only-ascii-diacritics behavior. The regex would be run against the NFD
form so it could easily deal with diacritics:
export const NON_WORD = /[^\p{L}\p{M}\p{N}\-]+/gu;
export const DIACRITICS = /[^\p{L}\p{N}\-]+/gu;
export const ASCII_DIACRITICS = /(?<=[a-zA-Z])\p{M}+|[^\p{L}\p{M}\p{N}\-]+/gu;
export const NON_ASCII = /[^0-9a-zA-Z\-]/g;
// NON_WORD
assertEquals(slugify("déjà-vu"), "déjà-vu");
assertEquals(slugify("Συστημάτων Γραφής"), "συστημάτων-γραφής");
assertEquals(slugify("déjà-vu", { strip: DIACRITICS }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: DIACRITICS }), "συστηματων-γραφης");
assertEquals(slugify("déjà-vu", { strip: ASCII_DIACRITICS }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: ASCII_DIACRITICS }), "συστημάτων-γραφής");
assertEquals(slugify("déjà-vu", { strip: NON_ASCII }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: NON_ASCII }), "-");
Further, you could easily use a third-party transliteration library along with strip: NON_ASCII
:
import transliterate from 'npm:any-ascii'
assertEquals(slugify(transliterate("Συστημάτων Γραφής"), { strip: NON_ASCII }), "systimaton-grafis");
from deno_std.
Ah ok. strip
option sounds good to me. Looks like a balanced solution between added complexity and practicality.
from deno_std.
Related Issues (20)
- encodeBase64/decodeBase64 seems to be inefficient HOT 13
- bug(collection): Inconsistent behavior between `pick` and `omit`
- bug(log): `warn()` and others behave different depending on import
- `@std/fs` `exists[Sync]` Does Not Require `--allow-sys` Any More? HOT 1
- Return type of `http/getCookies` is `Record<string, string>` instead of `Record<string, string | undefined>` HOT 5
- experiment: BYO test coverage explorer HOT 9
- to-do: deprecate and remove `@std/archive`
- to-do: trim `@std/io` APIs
- `@std/http/route` is doesn't automatically route `HEAD` requests HOT 5
- LogLevel redundant setup
- `levenshteinDistance` doesn't correctly handle code points over U+FFFF
- Use of locale-sensitive methods with `undefined` locale may cause environment-sensitive bugs HOT 1
- bug: simple `expect().toReturnWith()` works incorrectly HOT 2
- @std/tar hangs indefinitely unlike @std/archive HOT 8
- [FR] spatial data-structures
- std/yaml no longer exports Schema in a usable way? HOT 9
- STD doesn't pass fmt linting
- to-do: archive deprecated packages HOT 1
- suggestion: change `LimitedBytesTransformStream` behavior HOT 1
- The deno test --filter CLI option does not filter individual BDD tests
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deno_std.