mvdan / xurls Goto Github PK

View Code? Open in Web Editor NEW

1.2K 24.0 116.0 449 KB

Extract urls from text

License: BSD 3-Clause "New" or "Revised" License

Go 100.00%

go extract-urls tld

xurls's People

Contributors

Stargazers

Watchers

Forkers

halosghost boniface tylerarnold x140cc digideskio tcostam rjp oosidat andrewarrow areahq jeff885 bittenbydog vaginessa rapid7 dags- matiasinsaurralde gojuno ngopher liuzl wisesight deanlj lalloni lazyzhu mzack9999 mirza-s zhi6666 sahwar covenantsql hpidcock halfwit zeyiwu thedanielforum amkuperus 36labs pecuna pajlada codinglappen the-locksmith scramble-suit elwinar keybase justforkin andradeandrey edwindvinas backwardn benmcewan7e amasser isgasho w3ss theblackturtle markcs64 kfur 76428778fada zchee shahid1996 richardsonjf slooppe wwdxfa 5l1v3r1 puremourning ctrix ahmadissa neodigm sleepyeinstein holysoros malview bamzi myforkedrepo daniel-007 pratikfalke abcdefg4564 iamdimas navr4s alexxnica greenwolf hongchenker andytzeng c40yt overvoidjs developgo summercms tazuddinleton petriturunen seirdy icodein standardgalactic 88act gibson042 ajunlonglive gitart raphaelreyna dnkl hellodeveye manunio l3dlp-sandbox xmardus zfg88287508 ftpd excloudx6 delthas

xurls's Issues

SSL certificate for mvdan.cc has expired

The certificate for mvdan.cc has expired today resulting in the following error:

github.com/golang/dep
The following errors occurred while deducing packages:
  * "mvdan.cc/xurls": unable to deduce repository and source type for "mvdan.cc/xurls": unable to read metadata: unable to fetch raw metadata: failed HTTP request to URL "http://mvdan.cc/xurls?go-get=1": Get https://mvdan.cc/xurls?go-get=1: x509: certificate has expired or is not yet valid

Many of the IRC channels I frequent (one of the most common uses for this util for me) have bots which grep a webpage's title and print it (often while repeating a the domain and tld of the site). Resultingly, with this util, many of the links are actually repeated quite a bit.

It would be really handy if there were an option for only matching things that explicitly have a protocol rather than just things that “look like a URL.” This option also weeds out false positives which is helpful for any use case that is more dependent on accuracy.

Relaxed mode wrong matching

Description

Input

Hello User you have been chosen to win journey to City,Country for 7 day(s)
##enjoy.it
Please Visit our website shopping.com/profile/joe

Actual Detection
enjoy.it
shopping.com/profile/joe

Expected
shopping.com/profile/joe

More examples

echo "##google.com" | xurls -r
echo "##enjoy.it" | xurls -r

xn-- TLDs in relaxed mode

Hi,

Having a list of valid TLD is nice to limit false positives in relaxed mode.
However, the default list does contains by default only the punnycode version of the TLD but not the xn-- version of it (the IANA list contains only the xn-- version).

$ idn 联通
xn--8y0a063a
$
$ echo "test.联通" | xurls -r
test.联通
$ echo "test.xn--8y0a063a" | xurls -r
$

I would suggest having the default list composed with the 2 versions of IDN TLDs.

cmd/xurls: -fix eats input when URLs get longer

$ echo '[logged](https://freenode.logbot.info/foot-terminal) channel' | xurls -fix
[logged](https://freenode.logbot.info/foot-terminal/20210513/20210513

The expected output would be:

[logged](https://freenode.logbot.info/foot-terminal/20210513/20210513) channel

Invalid prefixes for URLs are matched

This is probably due to adding support for arbitrary protocols.

$ echo "systems.https://google.com" | xurls
systems.https://google.com

I am unsure of it, but is it actually ever valid for punctuation to exist in the protocol portion of a URL schema?

I know that xurls wasn't really focusing on being a URL validator. But honestly, we aren't too far from accomplishing that and it would be helpful to know that matches are valid (in terms of the specification).

Using standard schemes for matching

How do I use the known standard schemes as an argument to xurls.StrictMatchingScheme? What is the proper way to do this?

Add the option to ignore hostnames

Either in Relaxed or Strict we may add the condition when we do not want to extract "plain" DNS hostnames, e.g.

IgnoreHostnames:
- http://foo.com -> false
- http://foo.com/123 -> true
- http://foo.com/?bar=123 -> true

avoid Relaxed from matching trailing TLDs without a word break

When leveraging rxRelaxed.FindAllString to find all urls in SMS text, we found out that when sentences were connected without a blank after a dot it might mislead the regex and be detected as a legitimate url.
For example: a SMS text like

ups has informed us that your equipment has been delivered.The ups tracking number is 1zraxxxxxxx to track the status of your delivery click here hxxps://abc.com/yz3e4a2p|70994

Reproduction steps

        var test string = "ups has informed us that your equipment has been delivered.The ups tracking number is 1zraxxxxxxx to 
        track the status of your delivery click here hxxps://abc.com/yz3e4a2p|70994"
	rxRelaxed := xurls.Relaxed()
	fmt.Println(rxRelaxed.FindAllString(test, -1))  // [delivered.th https://abc.com/yz3e4a2p]

       	rxStrict := xurls.Strict()
	fmt.Println(rxStrict.FindAllString(test, -1)) // [https://abc.com/yz3e4a2p]

What did you expect to happen?
Relax method works as strict method, only one result comes out

What actually happened?
[delivered.th https://abc.com/yz3e4a2p] for relax method
[hxxps://abc.com/yz3e4a2p] for strict method

Environment

xurls: v2.2.0
golang: 1.16.4
os: macos

Relaxed mode is too relaxed

d.lawrence -> d.law

Since lawrence is a full word, my expectation to split between word boundaries, not within.

Incorrectly matches ::

This bug predates the arbitrary protocol string matching commit.

Currently, xurls's regexps match :: as a valid URL

Invalid matching with Cyrillic TLDs

Hi!
It seems like there is a problem with Cyrillic TLDs. Here an example:

echo "test.xyz" | xurls -r
test.xyz
echo "test.xyz test" | xurls -r
test.xyz
echo "test.бел" | xurls -r
test.бел
echo "test.бел test" | xurls -r 
<empty response>

If there are any symbols, even whitespace after cyrillic domain - it's not match anymore.
I tried to solve that issue and found that it can be something in string but I don't sure

webURL := hostName + port + `(/|/` + pathCont + `?|\b|(?m)$)`

In \b part. I tried to use |\b|\B but some tests failed.

Thanks!

Dangling dots, mid-string, are seen as domains

Here I have two small edge-cases:

<[email protected]> yields []string{"some.gu", "domain.com"}
[cid:programmer-thumb-shield-32x32.v2_fe0f1423-2d7d-484b-b624-6b7545ab4311.png] yields []sting{"fe0f1423-2d7d-484b-b624-6b7545ab4311.pn"}

I'm just wondering about the dropped character before the symbol. This is email, so I can cross-reference against the filenames of inline attachments and also double-check against a known list of TLDs, but dropping that last character makes this difficult.

Any ideas on why that last char is being dropped?

Better support for brackets

We only accept matching parenthesis so that e.g. markdown links match http://foo.bar instead of http://foo.bar), taking the trailing closing parenthesis from the markdown syntax.
#10 added basic support for brackets, and we probably want to do the same to have [http://foo.bar] match http://foo.bar instead of http://foo.bar].

Arch Linux PKGBUILDs separation

Hi, just wanted to suggest an edit for the Arch Linux PKGBUILDs.
Wouldn't it be better and cleaner if you separated the xurls package in xurls (uses the already compiled go releases from the Releases tab) and xurls-git for the latest upstream, please?
That way other users won't have to install go just to use your pretty cool piece of software. :)

Thank you!

generate: concurrent map write

About a quarter of the times I run the program, I get a fatal error: concurrent map iteration and map write.

I presume the race has been there for a long time, but the older versions of Go I used in the past didn't notice that.

Not a hugely pressing matter, since this is just a code generator I use, and the output still seems to be stable. But I should still fix this at some point.

[Bug] - Identifying tel:654654 as URL

I think there's a bug where it identifies tel:654654 as URL

README example is confusing to some users

If you were worried that people were accidentally compiling the regex on every function invocation - I just hit some code that was doing this, and as expected it had pretty poor performance (on the order of hundreds of gigabytes of memory allocated for a simple program).

It might be worthwhile to switch up the examples to create one Relaxed() and parse multiple URL's (or non URL's as the case may be) with it.

Ability to fetch relative URLs

Would be great if there is ability to fetch relative URLs.

<!-- http://foobar.com -->
 <a href="foobar.html">The Wonderful World of Foobar!</a>
 <a href="http://google.com/foobar.html">The Wonderful World of Foobar!</a>

This should return both:

http://google.com/foobar.html
http://foobar.com/foobar.html

Managing duplicate URLs?

Would be great if xurls could remove duplicate URLs if there are any.

go: error loading module requirements

I don't understand how to solve this problem. Help me please.

Japanese URL not matched fully

I just have one example which was shared by someone, https://ja.wikipedia.org/wiki/日本語 where the URL cuts of after 日.

authority component parsing does not align with RFC 3987

Particularly IPv6 addresses, which must be bracketed as in https://[2001:db8::1]/.

cannot find package "mvdan.cc/xurls/v2"

go get -u -v mvdan.cc/xurls/v2
get "mvdan.cc/xurls/v2": found meta tag get.metaImport{Prefix:"mvdan.cc/xurls", VCS:"git", RepoRoot:"https://github.com/mvdan/xurls"} at //mvdan.cc/xurls/v2?go-get=1
get "mvdan.cc/xurls/v2": verifying non-authoritative meta tag
mvdan.cc/xurls (download)
package mvdan.cc/xurls/v2: cannot find package "mvdan.cc/xurls/v2" in any of:
/Users/wzkun/.gvm/gos/go1.14beta1/src/mvdan.cc/xurls/v2 (from $GOROOT)
/Users/wzkun/.gvm/pkgsets/go1.14beta1/global/src/mvdan.cc/xurls/v2 (from $GOPATH)

Matching does not recognize fragments

This is not to say that this practice is a good one. But it is very common for URLs to include a section identifier or something similar following a #.

At the moment, xurls ignores these parts of the URL. So, if I pipe the text "https://google.com/#testingthings" through xurls, xurls only matches https://google.com/.

The correct solution is to recognize and respect these fragments.

Matching returns wrong url

I'm playing with the library and tried it with a simple example, I'm surprised about the result: https://play.golang.org/p/4BF3UXE4x87

Is it expected? Shouldn't "|" be treated as a wrong character?

Issue with Email Addresses

I am using the xurls code to pull out possible urls from a message body string. The urls can be in either strict or relaxed format so I need to use the relaxed method of xurls to find the possible urls in the string. The issue is that email addresses can also be in the string and the relaxed method of xurls is pulling those out too.

For example my string might be:
"Hello from http://www.google.com, please check the www.test.com webpage for further information. If you have any questions please email [email protected] or [email protected]"

What I would like xurls to do is just pull the http://www.google.com or www.test.com.

Instead is pulls the 2 urls, and John.Sm, test.com, test.com. Is there anything that can be done so that only urls are pulled?

Static Urls

Hi,
i have been recently using this library and figure it out that some links that extracted from string is useless in some cases
like images or js links that exists in sites
if it's ok i can make a pull request and make a feature that user can exclude all this types of links

xurls does not recognize valid IRIs

Originally reported at keybase/client#22453 as a failure of the Keybase client to linkify https://en.wikipedia.org/wiki/Dunning–Kruger_effect .

The issue seems to stem from pathCont being too narrowly defined; it does not include the full range specified in RFC 3987:

   ipchar         = iunreserved / pct-encoded / sub-delims / ":"
                  / "@"

   …

   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

"https://en.wikipedia.org/wiki/Dunning–Kruger_effect" contains U+2013 EN DASH –, which is in the %xA0-D7FF range but has a General_Category of Dash_Punctuation (Pd) (erroneously not included in xurls.go midChar/endChar/etc.).

make a deterministic variant of "go generate" and have CI check it's up to date

To prevent issues like #67 in the future.

Two changes should be done:

Use clearer filenames for generated files, so they stand out in file change summaries. For example, schemes_gen.go rather than schemes.go.
Split go generate into two phases; one to download the latest TLD and scheme lists from the internet and write them to files in the git repo (but outside the module zip), and another to take those files and generate the code. The default go generate would do both, but we would add a go generate -tags=noupdate to only do the second. CI would enforce the latter has an empty git diff.

parsing issue with json file

Hi,

if a website body contains a json string, I get garbage urls...

specific case...

	string := `{"props":{"pageProps":{"theme":{"key":"leaf","mode":"light","colors":{"body":"palette.slate13","linkText":"#fff","linkBackground":"#39e09b","linkShadow":"#000"},"components":{"ProfileBackground":{"backgroundColor":"#fff","backgroundStyle":"flat"},"LinkContainer":{"borderType":"squared","styleType":"fill"},"SocialLink":{"fill":"linkBackground"},"Banner":{"default":{"backgroundColor":"linkBackground","color":"linkText"}}}},"username":"adrianphoto_bcn","pageTitle":"@adrianphoto_bcn","metaTitle":"@adrianphoto_bcn","metaDescription":"Linktree. Make your link do more.","profilePictureUrl":"https://d15mvavv27jnvy.cloudfront.net/zdKaK/660bb5ffef7d46960c5c1be349944840.jpg","description":null,"links":[{"id":"11987649","url":"https://onlyfans.com/adrianphotobcn","animation":null,"amazonAffiliate":null,"thumbnail":null,"title":"Onlyfans","type":"CLASSIC","context":{}},{"id":"7730208","url":"http://Photoproducer.manyvids.com","animation":null,"amazonAffiliate":null,"thumbnail":null,"title":"ManyVids","type":"CLASSIC","context":{}},{"id":"11994192","url":"https://www.suicidegirls.com/members/adrianphoto_bcn/","animation":null,"amazonAffiliate":null,"thumbnail":null,"title":"Suicidegirls","type":"CLASSIC","context":{}},{"id":"7730413","url":"https://mobile.twitter.com/adrianphoto_bcn","animation":null,"amazonAffiliate":null,"thumbnail":null,"title":"Twitter","type":"CLASSIC","context":{}},{"id":"7730346","url":"https://www.instagram.com/adrianphotobcn","animation":null,"amazonAffiliate":null,"thumbnail":null,"title":"Instagram","type":"CLASSIC","context":{}},{"id":"16064948","url":"https://www.instagram.com/afoto.bcn","animation":null,"amazonAffiliate":null,"thumbnail":null,"title":"Instagram sec","type":"CLASSIC","context":{}}],"socialLinks":[],"integrations":[],"leapLink":null,"isOwner":false,"isLogoVisible":true,"isProfileVerified":true,"hasConsentedToView":true,"account":{"id":1848934,"username":"adrianphoto_bcn","isActive":true,"profilePictureUrl":"https://d15mvavv27jnvy.cloudfront.net/zdKaK/660bb5ffef7d46960c5c1be349944840.jpg","pageTitle":"@adrianphoto_bcn","googleAnalyticsId":null,"facebookPixelId":null,"donationsActive":false,"contentWarning":null,"description":null,"isLogoVisible":true,"owner":{"id":2054277,"isEmailVerified":true},"pageMeta":null,"integrations":[],"links":[{"id":11987649,"type":"CLASSIC","title":"Onlyfans","url":"https://onlyfans.com/adrianphotobcn","formattedUrl":"https://onlyfans.com/adrianphotobcn","thumbnailUrl":null,"animation":null,"isLeapLink":false,"isLeapLinkActive":false,"amazonAffiliate":null,"context":null},{"id":7730208,"type":"CLASSIC","title":"ManyVids","url":"Photoproducer.manyvids.com","formattedUrl":"http://Photoproducer.manyvids.com","thumbnailUrl":null,"animation":null,"isLeapLink":false,"isLeapLinkActive":false,"amazonAffiliate":null,"context":null},{"id":11994192,"type":"CLASSIC","title":"Suicidegirls","url":"https://www.suicidegirls.com/members/adrianphoto_bcn/","formattedUrl":"https://www.suicidegirls.com/members/adrianphoto_bcn/","thumbnailUrl":null,"animation":null,"isLeapLink":false,"isLeapLinkActive":false,"amazonAffiliate":null,"context":null},{"id":7730413,"type":"CLASSIC","title":"Twitter","url":"https://mobile.twitter.com/adrianphoto_bcn","formattedUrl":"https://mobile.twitter.com/adrianphoto_bcn","thumbnailUrl":null,"animation":null,"isLeapLink":false,"isLeapLinkActive":false,"amazonAffiliate":null,"context":null},{"id":7730346,"type":"CLASSIC","title":"Instagram","url":"https://www.instagram.com/adrianphotobcn","formattedUrl":"https://www.instagram.com/adrianphotobcn","thumbnailUrl":null,"animation":null,"isLeapLink":false,"isLeapLinkActive":false,"amazonAffiliate":null,"context":null},{"id":16064948,"type":"CLASSIC","title":"Instagram sec","url":"https://www.instagram.com/afoto.bcn","formattedUrl":"https://www.instagram.com/afoto.bcn","thumbnailUrl":null,"animation":null,"isLeapLink":false,"isLeapLinkActive":false,"amazonAffiliate":null,"context":null}],"socialLinks":[],"theme":{"key":"leaf"}}},"__N_SSP":true},"page":"/[profile]","query":{"profile":"adrianphoto_bcn"}`
	rxStrict := xurls.Strict()
	urls := rxStrict.FindAllString(string, -1)
	for _, url := range urls {
		fmt.Printf("%s\n",url)
	}

thanks

vanity url connection failure

Hi,

The recommended import path mvdan.cc/xurls does not connect.


  ✗ unable to deduce repository and source type for "mvdan.cc/xurls": unable to read metadata: unable to fetch raw metadata: failed HTTP request to URL "http://mvdan.cc/xurls?go-get=1": Get http://mvdan.cc/xurls?go-get=1: dial tcp 178.62.67.243:80: connect: connection refused

I have resolved this by using the link to this github repository.

Improvement suggestion with multiple domains in one single URL.

Hi,
Thank's for providing us xurls.

I came across the following case:

$ echo "http://www.fakedomain.com/account/legitdomain.com" | bin/xurls -r
http://www.fakedomain.com/account/legitdomain.com

I wonder if there is a easy (still fast) way for xurls to identify there are 2 "URLs" inside ?
So this could possibly report something like:

$ echo "http://www.fakedomain.com/account/legitdomain.com/folder" | bin/xurls -r
http://www.fakedomain.com/account/legitdomain.com/folder
legitdomain.com/folder
$

Possibly by adding an additional option to support it on demand only.

If there is a space in the string, both are found fine (expected and fine)

echo "http://www.fakedomain.com/        account/legitdomain.com/folder" | bin/xurls -r
http://www.fakedomain.com/
legitdomain.com/folder

This is only suggestion. If this impact performances badly, this is probably better to not implement.

go get -u ?

Is is possible to go get this library to use on project ? What is the url?

Arbitrary domain matching

At the moment, xurls matches against a list of TLDs from the IANA; however, as that list is ever-changing, and since there are plenty of domains that may be valid but are not administered by the IANA, I would propose an option to match arbitrary TLDs.

This would dramatically simplify the matching code and make it far more flexible.

Better edge detection when markdown-like emphasis is involved

$ xurls <<<'important url: *http://foo.com/bar*'
http://foo.com/bar*
$ xurls <<<'important url: _http://foo.com/bar_'
http://foo.com/bar_

We can probably do better here.

character ranges can optionally be limited to ascii

the xurls matches a lot of CJK characters
i need to control the Unicode range
can you provide an option to match only ascii?

case：
https://en.wikipedia.org/wiki/Unicode介绍了Unicode知识

Add file support

Useful little util.

It would be nice if it, in addition to the stdin support, could work with file(s) as arguments.

Currently this works:

cat myfile.txt | xurls

This just hangs there, waiting:

xurls myfile.txt

An example of a Go tool that works as expected for a *nix CLI tool would be ccat.

/bin/sh -c dep ensure -vendor-only fails with remote repository at https://github.com/mvdan/xurls does not exist, or is inaccessible: : exit status 128`

Hi!

Not sure how to get around this issue with dep ensure. We're using go 1.12 and dep ensure so trying to lock the repo down to 2.1.0 for 1.12 support. Running through a docker image on linux-amd64

grouped write of manifest, lock and vendor: error while writing out vendor tree: failed to write dep tree: failed to export mvdan.cc/xurls: remote repository at https://github.com/mvdan/xurls does not exist, or is inaccessible: : exit status 128

Thinking this line in our docker file is breaking things...
"RUN git config --global url.ssh://[email protected]/.insteadOf https://github.com/"

Any tips would be very helpful. Looked up solutions for dep and none of those solutions work.

Thank you,
-Laura

Gopkg.toml

[[constraint]]
  name = "mvdan.cc/xurls"
  version = "2.1.0"

Gopkg.lock:

  digest = "1:bd1896d9d8de29f9656f936e2cc51b682f4ea0be9da662ec93571fec18d83f61"
  name = "mvdan.cc/xurls"
  packages = ["."]
  pruneopts = ""
  revision = "aca318f079078cc3677a81e7f7d89df859f4f4b2"
  version = "v2.1.0"```

Support For Extracting Based on Hostname?

What do you think? It could work like StringMatchingScheme but accept a hostname (or second level+ domains). The prevents one from having building an additional regexp to check URLs returned by xurls.FindAllString.

URLs containing unicode are not matched correctly

$ echo 'http://graphemica.com/🐼' | xurls
http://graphemica.com/

tag request

Any chance you could tag the current master?

Mainly for the email in relaxed commit.

xurls command not found

Hi!

For me, the install fails:

user@tools:/tmp/tmp.JsYwNTDgZX$ cd $(mktemp -d); go mod init tmp; GO111MODULE=on go get mvdan.cc/xurls/v2/cmd/xurls
go: creating new go.mod: module tmp
user@tools:/tmp/tmp.MFwKzoKsW4$ xurls
xurls: command not found
user@tools:/tmp/tmp.MFwKzoKsW4$ echo "Do gophers live in http://golang.org?" | xurls
xurls: command not found

I'm running go version go1.13.5 linux/amd64 installed, by following the official guide: https://golang.org/doc/install

Not sure what causes this issue.

Error with Input containing long lines

Hi, thank's for providing xurls.

I came across the following error when input file contains quite long lines

$ printf 'tototutu%.0s' {1..9000} > /tmp/a
$ xurls -r  /tmp/a
bufio.Scanner: token too long
$
$ printf 'tototutu%.0s' {1..5000} > /tmp/b
$ xurls -r /tmp/b
$

Just wanted to report such strange case with long line could happen...
As I'm not a good golang coder, It's better I'm not submitting PR.

Read from files concurrently

As suggested by @bep, we could read and run the regex over files passed as arguments concurrently all at once, instead of one after the other. For regular files this doesn't make much sense in the general case, but it could make sense in files that cause blocking reads like named pipes or stuff that goes over the network.

The only downside I can see to this is that it's a bit overkill for the generic, simple case.

Email support

"Hi, this is my email [email protected]"

This extracts example.com which isn't useful by its own. I would expect to have the complete email address or it's being skipped.

What can be done for email addresses?