mvdan / xurls Goto Github PK
View Code? Open in Web Editor NEWExtract urls from text
License: BSD 3-Clause "New" or "Revised" License
Extract urls from text
License: BSD 3-Clause "New" or "Revised" License
Hi mate,
Hope you are doing all well !
I was playing with xurls with some tricky urls and I noticed that postresql scheme is missing, for eg:
postgres://user:[email protected]:5432/path?k=v#f
Can you add it plz ? :-)
Cheers,
Luc
The certificate for mvdan.cc has expired today resulting in the following error:
github.com/golang/dep
The following errors occurred while deducing packages:
* "mvdan.cc/xurls": unable to deduce repository and source type for "mvdan.cc/xurls": unable to read metadata: unable to fetch raw metadata: failed HTTP request to URL "http://mvdan.cc/xurls?go-get=1": Get https://mvdan.cc/xurls?go-get=1: x509: certificate has expired or is not yet valid
Many of the IRC channels I frequent (one of the most common uses for this util for me) have bots which grep a webpage's title and print it (often while repeating a the domain and tld of the site). Resultingly, with this util, many of the links are actually repeated quite a bit.
It would be really handy if there were an option for only matching things that explicitly have a protocol rather than just things that “look like a URL.” This option also weeds out false positives which is helpful for any use case that is more dependent on accuracy.
Input
Hello User you have been chosen to win journey to City,Country for 7 day(s)
##enjoy.it
Please Visit our website shopping.com/profile/joe
Actual Detection
enjoy.it
shopping.com/profile/joe
Expected
shopping.com/profile/joe
More examples
echo "##google.com" | xurls -r
echo "##enjoy.it" | xurls -r
Hi,
Having a list of valid TLD is nice to limit false positives in relaxed mode.
However, the default list does contains by default only the punnycode version of the TLD but not the xn-- version of it (the IANA list contains only the xn-- version).
$ idn 联通
xn--8y0a063a
$
$ echo "test.联通" | xurls -r
test.联通
$ echo "test.xn--8y0a063a" | xurls -r
$
I would suggest having the default list composed with the 2 versions of IDN TLDs.
$ echo '[logged](https://freenode.logbot.info/foot-terminal) channel' | xurls -fix
[logged](https://freenode.logbot.info/foot-terminal/20210513/20210513
The expected output would be:
[logged](https://freenode.logbot.info/foot-terminal/20210513/20210513) channel
This is probably due to adding support for arbitrary protocols.
$ echo "systems.https://google.com" | xurls
systems.https://google.com
I am unsure of it, but is it actually ever valid for punctuation to exist in the protocol portion of a URL schema?
I know that xurls
wasn't really focusing on being a URL validator. But honestly, we aren't too far from accomplishing that and it would be helpful to know that matches are valid (in terms of the specification).
How do I use the known standard schemes as an argument to xurls.StrictMatchingScheme? What is the proper way to do this?
Either in Relaxed
or Strict
we may add the condition when we do not want to extract "plain" DNS hostnames, e.g.
IgnoreHostnames:
- http://foo.com -> false
- http://foo.com/123 -> true
- http://foo.com/?bar=123 -> true
When leveraging rxRelaxed.FindAllString
to find all urls in SMS text, we found out that when sentences were connected without a blank after a dot it might mislead the regex and be detected as a legitimate url.
For example: a SMS text like
ups has informed us that your equipment has been delivered.The ups tracking number is 1zraxxxxxxx to track the status of your delivery click here hxxps://abc.com/yz3e4a2p|70994
Reproduction steps
var test string = "ups has informed us that your equipment has been delivered.The ups tracking number is 1zraxxxxxxx to
track the status of your delivery click here hxxps://abc.com/yz3e4a2p|70994"
rxRelaxed := xurls.Relaxed()
fmt.Println(rxRelaxed.FindAllString(test, -1)) // [delivered.th https://abc.com/yz3e4a2p]
rxStrict := xurls.Strict()
fmt.Println(rxStrict.FindAllString(test, -1)) // [https://abc.com/yz3e4a2p]
What did you expect to happen?
Relax method works as strict method, only one result comes out
What actually happened?
[delivered.th https://abc.com/yz3e4a2p] for relax method
[hxxps://abc.com/yz3e4a2p] for strict method
Environment
d.lawrence
-> d.law
Since lawrence
is a full word, my expectation to split between word boundaries, not within.
This bug predates the arbitrary protocol string matching commit.
Currently, xurls
's regexps match ::
as a valid URL
Hi!
It seems like there is a problem with Cyrillic TLDs. Here an example:
echo "test.xyz" | xurls -r
test.xyz
echo "test.xyz test" | xurls -r
test.xyz
echo "test.бел" | xurls -r
test.бел
echo "test.бел test" | xurls -r
<empty response>
If there are any symbols, even whitespace after cyrillic domain - it's not match anymore.
I tried to solve that issue and found that it can be something in string but I don't sure
webURL := hostName + port + `(/|/` + pathCont + `?|\b|(?m)$)`
In \b
part. I tried to use |\b|\B
but some tests failed.
Thanks!
Here I have two small edge-cases:
<[email protected]>
yields []string{"some.gu", "domain.com"}
[cid:programmer-thumb-shield-32x32.v2_fe0f1423-2d7d-484b-b624-6b7545ab4311.png]
yields []sting{"fe0f1423-2d7d-484b-b624-6b7545ab4311.pn"}
I'm just wondering about the dropped character before the symbol. This is email, so I can cross-reference against the filenames of inline attachments and also double-check against a known list of TLDs, but dropping that last character makes this difficult.
Any ideas on why that last char is being dropped?
We only accept matching parenthesis so that e.g. markdown links match http://foo.bar
instead of http://foo.bar)
, taking the trailing closing parenthesis from the markdown syntax.
#10 added basic support for brackets, and we probably want to do the same to have [http://foo.bar]
match http://foo.bar
instead of http://foo.bar]
.
Hi, just wanted to suggest an edit for the Arch Linux PKGBUILDs.
Wouldn't it be better and cleaner if you separated the xurls
package in xurls
(uses the already compiled go
releases from the Releases tab) and xurls-git
for the latest upstream, please?
That way other users won't have to install go
just to use your pretty cool piece of software. :)
Thank you!
About a quarter of the times I run the program, I get a fatal error: concurrent map iteration and map write
.
I presume the race has been there for a long time, but the older versions of Go I used in the past didn't notice that.
Not a hugely pressing matter, since this is just a code generator I use, and the output still seems to be stable. But I should still fix this at some point.
I think there's a bug where it identifies tel:654654
as URL
If you were worried that people were accidentally compiling the regex on every function invocation - I just hit some code that was doing this, and as expected it had pretty poor performance (on the order of hundreds of gigabytes of memory allocated for a simple program).
It might be worthwhile to switch up the examples to create one Relaxed() and parse multiple URL's (or non URL's as the case may be) with it.
Would be great if there is ability to fetch relative URLs.
<!-- http://foobar.com -->
<a href="foobar.html">The Wonderful World of Foobar!</a>
<a href="http://google.com/foobar.html">The Wonderful World of Foobar!</a>
This should return both:
http://google.com/foobar.html
http://foobar.com/foobar.html
Would be great if xurls
could remove duplicate URLs if there are any.
I just have one example which was shared by someone, https://ja.wikipedia.org/wiki/日本語
where the URL cuts of after 日
.
Particularly IPv6 addresses, which must be bracketed as in https://[2001:db8::1]/
.
go get -u -v mvdan.cc/xurls/v2
get "mvdan.cc/xurls/v2": found meta tag get.metaImport{Prefix:"mvdan.cc/xurls", VCS:"git", RepoRoot:"https://github.com/mvdan/xurls"} at //mvdan.cc/xurls/v2?go-get=1
get "mvdan.cc/xurls/v2": verifying non-authoritative meta tag
mvdan.cc/xurls (download)
package mvdan.cc/xurls/v2: cannot find package "mvdan.cc/xurls/v2" in any of:
/Users/wzkun/.gvm/gos/go1.14beta1/src/mvdan.cc/xurls/v2 (from $GOROOT)
/Users/wzkun/.gvm/pkgsets/go1.14beta1/global/src/mvdan.cc/xurls/v2 (from $GOPATH)
This is not to say that this practice is a good one. But it is very common for URLs to include a section identifier or something similar following a #
.
At the moment, xurls
ignores these parts of the URL. So, if I pipe the text "https://google.com/#testingthings" through xurls
, xurls
only matches https://google.com/
.
The correct solution is to recognize and respect these fragments.
I'm playing with the library and tried it with a simple example, I'm surprised about the result: https://play.golang.org/p/4BF3UXE4x87
Is it expected? Shouldn't "|" be treated as a wrong character?
I am using the xurls code to pull out possible urls from a message body string. The urls can be in either strict or relaxed format so I need to use the relaxed method of xurls to find the possible urls in the string. The issue is that email addresses can also be in the string and the relaxed method of xurls is pulling those out too.
For example my string might be:
"Hello from http://www.google.com, please check the www.test.com webpage for further information. If you have any questions please email [email protected] or [email protected]"
What I would like xurls to do is just pull the http://www.google.com or www.test.com.
Instead is pulls the 2 urls, and John.Sm, test.com, test.com. Is there anything that can be done so that only urls are pulled?
Hi,
i have been recently using this library and figure it out that some links that extracted from string is useless in some cases
like images or js links that exists in sites
if it's ok i can make a pull request and make a feature that user can exclude all this types of links
Originally reported at keybase/client#22453 as a failure of the Keybase client to linkify https://en.wikipedia.org/wiki/Dunning–Kruger_effect .
The issue seems to stem from pathCont
being too narrowly defined; it does not include the full range specified in RFC 3987:
ipchar = iunreserved / pct-encoded / sub-delims / ":"
/ "@"
…
iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD
"https://en.wikipedia.org/wiki/Dunning–Kruger_effect" contains U+2013 EN DASH –
, which is in the %xA0-D7FF range but has a General_Category of Dash_Punctuation (Pd) (erroneously not included in xurls.go midChar
/endChar
/etc.).
To prevent issues like #67 in the future.
Two changes should be done:
Use clearer filenames for generated files, so they stand out in file change summaries. For example, schemes_gen.go
rather than schemes.go
.
Split go generate
into two phases; one to download the latest TLD and scheme lists from the internet and write them to files in the git repo (but outside the module zip), and another to take those files and generate the code. The default go generate
would do both, but we would add a go generate -tags=noupdate
to only do the second. CI would enforce the latter has an empty git diff
.
Hi,
if a website body contains a json string, I get garbage urls...
specific case...
string := `{"props":{"pageProps":{"theme":{"key":"leaf","mode":"light","colors":{"body":"palette.slate13","linkText":"#fff","linkBackground":"#39e09b","linkShadow":"#000"},"components":{"ProfileBackground":{"backgroundColor":"#fff","backgroundStyle":"flat"},"LinkContainer":{"borderType":"squared","styleType":"fill"},"SocialLink":{"fill":"linkBackground"},"Banner":{"default":{"backgroundColor":"linkBackground","color":"linkText"}}}},"username":"adrianphoto_bcn","pageTitle":"@adrianphoto_bcn","metaTitle":"@adrianphoto_bcn","metaDescription":"Linktree. Make your link do more.","profilePictureUrl":"https://d15mvavv27jnvy.cloudfront.net/zdKaK/660bb5ffef7d46960c5c1be349944840.jpg","description":null,"links":[{"id":"11987649","url":"https://onlyfans.com/adrianphotobcn","animation":null,"amazonAffiliate":null,"thumbnail":null,"title":"Onlyfans","type":"CLASSIC","context":{}},{"id":"7730208","url":"http://Photoproducer.manyvids.com","animation":null,"amazonAffiliate":null,"thumbnail":null,"title":"ManyVids","type":"CLASSIC","context":{}},{"id":"11994192","url":"https://www.suicidegirls.com/members/adrianphoto_bcn/","animation":null,"amazonAffiliate":null,"thumbnail":null,"title":"Suicidegirls","type":"CLASSIC","context":{}},{"id":"7730413","url":"https://mobile.twitter.com/adrianphoto_bcn","animation":null,"amazonAffiliate":null,"thumbnail":null,"title":"Twitter","type":"CLASSIC","context":{}},{"id":"7730346","url":"https://www.instagram.com/adrianphotobcn","animation":null,"amazonAffiliate":null,"thumbnail":null,"title":"Instagram","type":"CLASSIC","context":{}},{"id":"16064948","url":"https://www.instagram.com/afoto.bcn","animation":null,"amazonAffiliate":null,"thumbnail":null,"title":"Instagram sec","type":"CLASSIC","context":{}}],"socialLinks":[],"integrations":[],"leapLink":null,"isOwner":false,"isLogoVisible":true,"isProfileVerified":true,"hasConsentedToView":true,"account":{"id":1848934,"username":"adrianphoto_bcn","isActive":true,"profilePictureUrl":"https://d15mvavv27jnvy.cloudfront.net/zdKaK/660bb5ffef7d46960c5c1be349944840.jpg","pageTitle":"@adrianphoto_bcn","googleAnalyticsId":null,"facebookPixelId":null,"donationsActive":false,"contentWarning":null,"description":null,"isLogoVisible":true,"owner":{"id":2054277,"isEmailVerified":true},"pageMeta":null,"integrations":[],"links":[{"id":11987649,"type":"CLASSIC","title":"Onlyfans","url":"https://onlyfans.com/adrianphotobcn","formattedUrl":"https://onlyfans.com/adrianphotobcn","thumbnailUrl":null,"animation":null,"isLeapLink":false,"isLeapLinkActive":false,"amazonAffiliate":null,"context":null},{"id":7730208,"type":"CLASSIC","title":"ManyVids","url":"Photoproducer.manyvids.com","formattedUrl":"http://Photoproducer.manyvids.com","thumbnailUrl":null,"animation":null,"isLeapLink":false,"isLeapLinkActive":false,"amazonAffiliate":null,"context":null},{"id":11994192,"type":"CLASSIC","title":"Suicidegirls","url":"https://www.suicidegirls.com/members/adrianphoto_bcn/","formattedUrl":"https://www.suicidegirls.com/members/adrianphoto_bcn/","thumbnailUrl":null,"animation":null,"isLeapLink":false,"isLeapLinkActive":false,"amazonAffiliate":null,"context":null},{"id":7730413,"type":"CLASSIC","title":"Twitter","url":"https://mobile.twitter.com/adrianphoto_bcn","formattedUrl":"https://mobile.twitter.com/adrianphoto_bcn","thumbnailUrl":null,"animation":null,"isLeapLink":false,"isLeapLinkActive":false,"amazonAffiliate":null,"context":null},{"id":7730346,"type":"CLASSIC","title":"Instagram","url":"https://www.instagram.com/adrianphotobcn","formattedUrl":"https://www.instagram.com/adrianphotobcn","thumbnailUrl":null,"animation":null,"isLeapLink":false,"isLeapLinkActive":false,"amazonAffiliate":null,"context":null},{"id":16064948,"type":"CLASSIC","title":"Instagram sec","url":"https://www.instagram.com/afoto.bcn","formattedUrl":"https://www.instagram.com/afoto.bcn","thumbnailUrl":null,"animation":null,"isLeapLink":false,"isLeapLinkActive":false,"amazonAffiliate":null,"context":null}],"socialLinks":[],"theme":{"key":"leaf"}}},"__N_SSP":true},"page":"/[profile]","query":{"profile":"adrianphoto_bcn"}`
rxStrict := xurls.Strict()
urls := rxStrict.FindAllString(string, -1)
for _, url := range urls {
fmt.Printf("%s\n",url)
}
thanks
Hi,
The recommended import path mvdan.cc/xurls
does not connect.
✗ unable to deduce repository and source type for "mvdan.cc/xurls": unable to read metadata: unable to fetch raw metadata: failed HTTP request to URL "http://mvdan.cc/xurls?go-get=1": Get http://mvdan.cc/xurls?go-get=1: dial tcp 178.62.67.243:80: connect: connection refused
I have resolved this by using the link to this github repository.
Hi,
Thank's for providing us xurls.
I came across the following case:
$ echo "http://www.fakedomain.com/account/legitdomain.com" | bin/xurls -r
http://www.fakedomain.com/account/legitdomain.com
I wonder if there is a easy (still fast) way for xurls to identify there are 2 "URLs" inside ?
So this could possibly report something like:
$ echo "http://www.fakedomain.com/account/legitdomain.com/folder" | bin/xurls -r
http://www.fakedomain.com/account/legitdomain.com/folder
legitdomain.com/folder
$
Possibly by adding an additional option to support it on demand only.
If there is a space in the string, both are found fine (expected and fine)
echo "http://www.fakedomain.com/ account/legitdomain.com/folder" | bin/xurls -r
http://www.fakedomain.com/
legitdomain.com/folder
This is only suggestion. If this impact performances badly, this is probably better to not implement.
Is is possible to go get this library to use on project ? What is the url?
At the moment, xurls
matches against a list of TLDs from the IANA; however, as that list is ever-changing, and since there are plenty of domains that may be valid but are not administered by the IANA, I would propose an option to match arbitrary TLDs.
This would dramatically simplify the matching code and make it far more flexible.
$ xurls <<<'important url: *http://foo.com/bar*'
http://foo.com/bar*
$ xurls <<<'important url: _http://foo.com/bar_'
http://foo.com/bar_
We can probably do better here.
the xurls matches a lot of CJK characters
i need to control the Unicode range
can you provide an option to match only ascii?
Useful little util.
It would be nice if it, in addition to the stdin support, could work with file(s) as arguments.
Currently this works:
cat myfile.txt | xurls
This just hangs there, waiting:
xurls myfile.txt
An example of a Go tool that works as expected for a *nix CLI tool would be ccat
.
Hi!
Not sure how to get around this issue with dep ensure. We're using go 1.12 and dep ensure so trying to lock the repo down to 2.1.0 for 1.12 support. Running through a docker image on linux-amd64
grouped write of manifest, lock and vendor: error while writing out vendor tree: failed to write dep tree: failed to export mvdan.cc/xurls: remote repository at https://github.com/mvdan/xurls does not exist, or is inaccessible: : exit status 128
Thinking this line in our docker file is breaking things...
"RUN git config --global url.ssh://[email protected]/.insteadOf https://github.com/"
Any tips would be very helpful. Looked up solutions for dep and none of those solutions work.
Thank you,
-Laura
Gopkg.toml
[[constraint]]
name = "mvdan.cc/xurls"
version = "2.1.0"
Gopkg.lock:
digest = "1:bd1896d9d8de29f9656f936e2cc51b682f4ea0be9da662ec93571fec18d83f61"
name = "mvdan.cc/xurls"
packages = ["."]
pruneopts = ""
revision = "aca318f079078cc3677a81e7f7d89df859f4f4b2"
version = "v2.1.0"```
What do you think? It could work like StringMatchingScheme
but accept a hostname (or second level+ domains). The prevents one from having building an additional regexp to check URLs returned by xurls.FindAllString
.
$ echo 'http://graphemica.com/🐼' | xurls
http://graphemica.com/
Any chance you could tag the current master?
Mainly for the email in relaxed commit.
Hi!
For me, the install fails:
user@tools:/tmp/tmp.JsYwNTDgZX$ cd $(mktemp -d); go mod init tmp; GO111MODULE=on go get mvdan.cc/xurls/v2/cmd/xurls
go: creating new go.mod: module tmp
user@tools:/tmp/tmp.MFwKzoKsW4$ xurls
xurls: command not found
user@tools:/tmp/tmp.MFwKzoKsW4$ echo "Do gophers live in http://golang.org?" | xurls
xurls: command not found
I'm running go version go1.13.5 linux/amd64
installed, by following the official guide: https://golang.org/doc/install
Not sure what causes this issue.
Hi, thank's for providing xurls.
I came across the following error when input file contains quite long lines
$ printf 'tototutu%.0s' {1..9000} > /tmp/a
$ xurls -r /tmp/a
bufio.Scanner: token too long
$
$ printf 'tototutu%.0s' {1..5000} > /tmp/b
$ xurls -r /tmp/b
$
Just wanted to report such strange case with long line could happen...
As I'm not a good golang coder, It's better I'm not submitting PR.
As suggested by @bep, we could read and run the regex over files passed as arguments concurrently all at once, instead of one after the other. For regular files this doesn't make much sense in the general case, but it could make sense in files that cause blocking reads like named pipes or stuff that goes over the network.
The only downside I can see to this is that it's a bit overkill for the generic, simple case.
"Hi, this is my email [email protected]"
This extracts example.com
which isn't useful by its own. I would expect to have the complete email address or it's being skipped.
What can be done for email addresses?
Finding the links and then searching for them to replace them feels rather inefficient.
It would be great if the API would also allow for link replacements.
Or just return position information (start, stop/len) of the urls being found.
It looks like xurls
only checks for https?
which grabs http
and https
instead of allowing arbitrary protocols (e.g., file://
, ftp://
, steam://
, etc.).
Any interest in adding support for arbitrary protocols?
example: https://www.periscope.tv/w/aLtI0DExNjg3MjA3fDFaa0t6REFMck1XSnZLBVQUNYqqaSLxjZicBht_sUEx73i8mL_S7Q8adtkqNw==
(the ending ==
are left out)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.