Comments (6)
Thanks for filing this issue. This is a bit of an edge case, because characters like |
and ,
are valid and even common in URLs, but they're also very common in plaintext.
The strategy I ended up with is to accept them as "middle" characters in an URL, but not as the last character. For example:
$ echo "http://foo| bar" | xurls
http://foo
$ echo "http://foo|bar" | xurls
http://foo|bar
$ echo "http://foo, bar" | xurls
http://foo
$ echo "http://foo,bar" | xurls
http://foo,bar
I could make the URL matching more conservative, and to always cut off commas and vertical bars. However, that would break perfectly valid URLs like https://en.wikipedia.org/wiki/Colma,_California, for example. As I'm writing this, I wonder if GitHub will auto-link that properly :)
You can also find real examples using vertical bars, like https://fonts.googleapis.com/css?family=Lato:400,700,400italic,700italic%7CRoboto+Slab:400,700%7CInconsolata:400,700. In both cases, note how modern browsers don't escape the character.
I think the current mechanism is an OK middle ground. If you can provide a real example, perhaps there's some tweak we could make. With the http://google.com|google.com
example you gave above, I don't think there's anything we can do without breaking perfectly valid URLs.
from xurls.
I forgot to say - if you would have plaintext that you know makes heavy use of certain characters like |
, what you could do is use strings.Split
first, then pass each part through xurls
later.
from xurls.
Ping @jlory - any thoughts?
from xurls.
So the story behind my findings with the "|" is I'm using the Slack API to parse some messages and I found out that they rewrite URL links in messages using | sometimes: https://api.slack.com/docs/message-formatting#linking_to_urls
In your example the pipe character is escaped and replaced by the proper value, as I'm reading this: https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid it's still unclear if we should exclude it, I've never seen a URL / URI with | in it.
As for now I'm actually doing a string split with | and discarding the rest.
from xurls.
Hmm, you're right - that's a good example of |
being used to separate URLs. I haven't myself found a use case for vertical bars to be part of a URL, so I'll make this change.
If anyone runs into regressions because of it, they can file an issue, and we can reconsider reverting the commit at that point.
from xurls.
Ah, this was never intended to work like this. I simply added a |
in the wrong place - between [
and ]
, effectively adding it to a character set by mistake.
from xurls.
Related Issues (20)
- generate: concurrent map write
- Managing duplicate URLs? HOT 4
- Email support HOT 6
- go: error loading module requirements HOT 3
- cmd/xurls: -fix eats input when URLs get longer
- avoid Relaxed from matching trailing TLDs without a word break HOT 2
- tag request HOT 1
- Static Urls HOT 1
- Issue with Email Addresses HOT 7
- [Bug] - Identifying tel:654654 as URL HOT 1
- go get -u ? HOT 1
- Relaxed mode is too relaxed HOT 3
- xurls does not recognize valid IRIs HOT 1
- authority component parsing does not align with RFC 3987 HOT 3
- make a deterministic variant of "go generate" and have CI check it's up to date HOT 1
- character ranges can optionally be limited to ascii HOT 2
- add a mode to only get relative urls? HOT 4
- Matches \r and \n as part of URL HOT 2
- Trailing colon not trimmed when using relaxed mode and no scheme HOT 1
- Relqxed does not recognized @! in the end of url
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xurls.