linkedin / url-detector Goto Github PK

View Code? Open in Web Editor NEW

780.0 68.0 185.0 82 KB

A Java library to detect and normalize URLs in text

Java 100.00%

url-detector's Introduction

Url Detector

The url detector is a library created by the Linkedin Security Team to detect and extract urls in a long piece of text.

It is able to find and detect any urls such as:

HTML 5 Scheme - //www.linkedin.com
Usernames - user:[email protected]
Email - [email protected]
IPv4 Address - 192.168.1.1/hello.html
IPv4 Octets - 0x00.0x00.0x00.0x00
IPv4 Decimal - http://123123123123/
IPv6 Address - ftp://[::]/hello
IPv4-mapped IPv6 Address - http://[fe30:4:3:0:192.3.2.1]/

Note: Keep in mind that for security purposes, its better to overdetect urls and check more against blacklists than to not detect a url that was submitted. As such, some things that we detect might not be urls but somewhat look like urls. Also, instead of complying with RFC 3986 (http://www.ietf.org/rfc/rfc3986.txt), we try to detect based on browser behavior, optimizing detection for urls that are visitable through the address bar of Chrome, Firefox, Internet Explorer, and Safari.

It is also able to identify the parts of the identified urls. For example, for the url: http://[email protected]:39000/hello?boo=ff#frag

Scheme - "http"
Username - "user"
Password - null
Host - "linkedin.com"
Port - 39000
Path - "/hello"
Query - "?boo=ff"
Fragment - "#frag"

How to Use:

Using the URL detector library is simple. Simply import the UrlDetector object and give it some options. In response, you will get a list of urls which were detected.

For example, the following code will find the url linkedin.com

    UrlDetector parser = new UrlDetector("hello this is a url Linkedin.com", UrlDetectorOptions.Default);
    List<Url> found = parser.detect();

    for(Url url : found) {
        System.out.println("Scheme: " + url.getScheme());
        System.out.println("Host: " + url.getHost());
        System.out.println("Path: " + url.getPath());
    }

Quote Matching and HTML

Depending on your input string, you may want to handle certain characters in a special way. For example if you are parsing HTML, you probably want to break out of things like quotes and brackets. For example, if your input looks like

<a href="http://linkedin.com/abc">linkedin.com</a>

You probably want to make sure that the quotes and brackets are extracted. For that reason, using UrlDetectorOptions will allow you to change the sensitivity level of detection based on your expected input type. This way you can detect linkedin.com instead of linkedin.com</a>.

In code this looks like:

    UrlDetector parser = new UrlDetector("<a href="linkedin.com/abc">linkedin.com</a>", UrlDetectorOptions.HTML);
    List<Url> found = parser.detect();

About:

This library was written by the security team and Linkedin when other options did not exist. Some of the primary authors are:

Vlad Shlosberg ([email protected])
Tzu-Han Jan ([email protected])
Yulia Astakhova ([email protected])

Third Party Dependencies

####TestNG

http://testng.org/
Copyright © 2004-2014 Cédric Beust
License: Apache 2.0

####Apache CommonsLang3: org.apache.commons:commons-lang3:3.1

http://commons.apache.org/proper/commons-lang/
Copyright © 2001-2014 The Apache Software Foundation
License: Apache 2.0

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the license at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

url-detector's People

Contributors

Stargazers

Watchers

Forkers

putizl devenlu savety tspannhw trentsky qiuyukuhe is00hcw nboat zcfrank1st bingoyang xiaoshyang archer-christ xrogzu plutocyf jfinal yswang0927 fnet123 0312birdzhang seth1002 clwu88 nilopc-learning-android davidmr001 steem lwl8851206 yanjieshidai cnsky2016 yusongying amor-tsai zgliang88 liqingfei clverhack sbilly xiaoxinwt tigerqiu712 raghava2016 xiaoheike jotapemesquita mbijon wendelas fanglong yamingd jerry0226 gonglijian600 handong890 kuguobing qiushuizy alphadyz zhoudaqing skystmm yswheye yeetrack honeyflyfish zzxzz12345 kankankankan serhou zhaozhi3758 lerist zhi-ge zzw3239 taj3991 lfleal crazyharry driving ab212 vadivelansr jiyuefeng parkjonghyeob-fork askxionghu vivisidea mysterymask dimxenon myxyz sumitdeysc dejan2609 naxiwer xinyuez demianr huangshan108 nr2476 duguruiyuan saulo2 pologood zhldt2008 diffblue-benchmarks v2hack ebwi11 suleymangungormez laugha growthring williamtechnote sridhar-newsdistill fredrhae arcodergh sunil-kumar3 zxque capnbab radhikari54 hhzkzt lite0505 msasikanth

url-detector's Issues

Any consideration for a https://uima.apache.org/ plugin?

It might help get visibility as Uima has a lot of other nice plugins.

Regex you are using is detecting a number as a valid URL

Regex you are using is detecting a number as a valid URL, I try to test your regex with: this is a test 1
and the regex detected the number 1 as a valid url

On Android I cant build my project.

Error:PARSE ERROR:
Error:unsupported class file version 52.0
Error:...while parsing com/linkedin/urls/HostNormalizer.class
Error:1 error; aborting
Error:Execution failed for task ':app:transformClassesWithDexForDebug'.

com.android.build.api.transform.TransformException: com.android.ide.common.process.ProcessException: java.util.concurrent.ExecutionException: java.lang.UnsupportedOperationException

Update Artifact on Maven Central

Hey folks, this library is super useful. There is only one release available on Maven Central: https://search.maven.org/artifact/com.linkedin.urls/url-detector/0.1.17/jar

Any chance this could be updated?

Can we use \ ! $ etc - special character check

Default Scheme on url with no scheme

When parsing a URL like "linkedin.com", the url object will add a default scheme of 'http' if one is not detected: URL.getScheme()

I can understand why some defaults were included but it would be nice if this behavior could be configured. I need to know whether the original input text contained the scheme.

I can always do something like url.getOriginalUrl().startsWith(url.getScheme()) but I don't want to have to do that everywhere.

Upload latest verion to MavenCentral

Would it be possible to upload your latest version to MavenCentral? In particular, we would like to take advantage of ae214b7

Our project that includes URLDetector includes JUnit4 unit tests which seem to be causing problems because of testng.

Thanks

URL-Detector is abandoned?

There are several issues (including fixes in pull requests) that are unaddressed in a long time. Could this be handed over to other maintainers? @tzuhanjan can you comment?

Support for local Maven repository installation

In the interim, while issue #2 is being worked on, it would be ideal if it were possible to install the url-detector library in the local Maven repository (typically ~/.m2/repository/), so that other Maven-based build tools can consume the library.

Note that issue #2 is a vastly preferable solution to this problem, but allowing local installation (this issue) provides a short-term workaround.

String: 'http://user:[email protected] host.com' causes exception

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -2
at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:908)
at java.lang.StringBuilder.substring(StringBuilder.java:76)
at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:854)
at java.lang.StringBuilder.substring(StringBuilder.java:76)
at com.linkedin.urls.detection.UrlDetector.readDefault(UrlDetector.java:191)
at com.linkedin.urls.detection.UrlDetector.detect(UrlDetector.java:142)
at com.mycompany.url.UrlTest.main(UrlTest.java:26)

Error:(13, 32) error: package org.apache.commons.lang3 does not exist

Dependency of apache is not in build.gradle, really?

StringIndexOutOfBoundsException on particular string

This string (excluding the double quotes) triggers a StringIndexOutOfBoundsException:
"://VIVE MARINE LE PEN//:@."

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
	at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:908) ~[na:1.8.0_60]
	at java.lang.StringBuilder.substring(StringBuilder.java:76) ~[na:1.8.0_60]
	at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:854) ~[na:1.8.0_60]
	at java.lang.StringBuilder.substring(StringBuilder.java:76) ~[na:1.8.0_60]
	at com.linkedin.urls.detection.UrlDetector.readDefault(UrlDetector.java:191) ~[url-detector-0.1.17.jar!/:na]
	at com.linkedin.urls.detection.UrlDetector.detect(UrlDetector.java:142) ~[url-detector-0.1.17.jar!/:na]

[email protected] parsed as 2 urls

[email protected] parsed as:

How to ignore only e-mails ?

Long run of periods causes detect() to throw NegativeArraySizeException "Backtracked max amount of characters. Endless loop detected."

String text = ".............:::::::::::;;;;;;;;;;;;;;;::...............................................:::::::::::::::::::::::::::::...................."; UrlDetector d = new UrlDetector(text, UrlDetectorOptions.Default); d.detect();

Running this will throw
Exception in thread "main" java.lang.NegativeArraySizeException: Backtracked max amount of characters. Endless loop detected. Bad Text: ':...............................................:::::::::::::::::::::::::::::....................' at com.linkedin.urls.detection.InputTextReader.checkBacktrackLoop(InputTextReader.java:144) at com.linkedin.urls.detection.InputTextReader.seek(InputTextReader.java:120) at com.linkedin.urls.detection.UrlDetector.readUserPass(UrlDetector.java:511) at com.linkedin.urls.detection.UrlDetector.readScheme(UrlDetector.java:458) at com.linkedin.urls.detection.UrlDetector.processColon(UrlDetector.java:293) at com.linkedin.urls.detection.UrlDetector.readDefault(UrlDetector.java:253) at com.linkedin.urls.detection.UrlDetector.detect(UrlDetector.java:142) at Main.main(Main.java:82) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

Normalized Url Detected http://null/

Detecting the following url

www.foo1111111111aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.com

works ok for normal UrlDetector
fails when calling NormalizedUrl.create and returns it as "http://null/

Using url-detector causes maven projects to bring in TestNG compile dependency

I see that this is fixed in a8dec18 (not in 0.1.17 release on maven central), just creating the issue here to document.

Using URL-Detector with Maven: jitpack stopgap

While #6 and #2 is still being resolved, as a temporary workaround, you can use Jitpack:

add:

...

<repositories>
        <repository>
            <id>jitpack.io</id>
            <url>https://jitpack.io</url>
        </repository>
    </repositories>

....

<dependency>
        <groupId>com.github.linkedin</groupId>
        <artifactId>URL-Detector</artifactId>
        <version>2a0fede05e</version>
    </dependency>

to your pom.xml.

URL-DETECTOR fails to detect a valid URL

Executing the following code Url.create("http://013.xxx/");
is resolved with the following error:

java.net.MalformedURLException: We couldn't find any urls in string: http://013.xxx/
	at com.linkedin.urls.Url.create(Url.java:69)

It looks like as if the utility treats the xxx part as invalid ip instead of a valid suffix.
Excepted result:
Url should be created, host should be 013.xxx

Japanese Characters cause the entire string to be detected as a URL

If you run the detector in the text below, it thinks the whole text is a URL.

我进入你的主页很卡顿，也许是你的关注人数或者其他数据太多了，其他人主页没有这么卡顿。来自amethyst客户端

Characters 。 and ， are single characters and are not considered spaces in this library.

It would be really nice to have on Maven Central

Thanks for the logic! Any chance you're planning on deploying to Maven central?

Some valid schemes are ignored

Thanks for a very useful library.

I note that the list of valid schemes is fairly small and this means that a URL with a file: schema is not parsed correctly, giving back http as the default schema. Could you add file: to the list of valid schemas, or perhaps create an option that allows anything that looks like a schema to be returned but perhaps with the addition of something like boolean isKnownSchema()

Cheers.

False alarm detecting URLs

if i have a text contains 10.00hr, it is consider as a URL

runTest("10.00hr,", UrlDetectorOptions.Default);
it should return empty, but the results is [http://10.00hr]