Git Product home page Git Product logo

url-detector's Introduction

Url Detector

The url detector is a library created by the Linkedin Security Team to detect and extract urls in a long piece of text.

It is able to find and detect any urls such as:

Note: Keep in mind that for security purposes, its better to overdetect urls and check more against blacklists than to not detect a url that was submitted. As such, some things that we detect might not be urls but somewhat look like urls. Also, instead of complying with RFC 3986 (http://www.ietf.org/rfc/rfc3986.txt), we try to detect based on browser behavior, optimizing detection for urls that are visitable through the address bar of Chrome, Firefox, Internet Explorer, and Safari.

It is also able to identify the parts of the identified urls. For example, for the url: http://[email protected]:39000/hello?boo=ff#frag

  • Scheme - "http"
  • Username - "user"
  • Password - null
  • Host - "linkedin.com"
  • Port - 39000
  • Path - "/hello"
  • Query - "?boo=ff"
  • Fragment - "#frag"

How to Use:

Using the URL detector library is simple. Simply import the UrlDetector object and give it some options. In response, you will get a list of urls which were detected.

For example, the following code will find the url linkedin.com

    UrlDetector parser = new UrlDetector("hello this is a url Linkedin.com", UrlDetectorOptions.Default);
    List<Url> found = parser.detect();

    for(Url url : found) {
        System.out.println("Scheme: " + url.getScheme());
        System.out.println("Host: " + url.getHost());
        System.out.println("Path: " + url.getPath());
    }

Quote Matching and HTML

Depending on your input string, you may want to handle certain characters in a special way. For example if you are parsing HTML, you probably want to break out of things like quotes and brackets. For example, if your input looks like

<a href="http://linkedin.com/abc"&gt;linkedin.com&lt;/a>

You probably want to make sure that the quotes and brackets are extracted. For that reason, using UrlDetectorOptions will allow you to change the sensitivity level of detection based on your expected input type. This way you can detect linkedin.com instead of linkedin.com</a>.

In code this looks like:

    UrlDetector parser = new UrlDetector("<a href="linkedin.com/abc">linkedin.com</a>", UrlDetectorOptions.HTML);
    List<Url> found = parser.detect();

About:

This library was written by the security team and Linkedin when other options did not exist. Some of the primary authors are:


Third Party Dependencies

####TestNG

####Apache CommonsLang3: org.apache.commons:commons-lang3:3.1


License

Copyright 2015 LinkedIn Corp. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the license at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

url-detector's People

Contributors

jotomo avatar kanishkrastogi-lnkd avatar tzuhanjan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

url-detector's Issues

On Android I cant build my project.

Error:PARSE ERROR:
Error:unsupported class file version 52.0
Error:...while parsing com/linkedin/urls/HostNormalizer.class
Error:1 error; aborting
Error:Execution failed for task ':app:transformClassesWithDexForDebug'.

com.android.build.api.transform.TransformException: com.android.ide.common.process.ProcessException: java.util.concurrent.ExecutionException: java.lang.UnsupportedOperationException

Default Scheme on url with no scheme

When parsing a URL like "linkedin.com", the url object will add a default scheme of 'http' if one is not detected: URL.getScheme()

I can understand why some defaults were included but it would be nice if this behavior could be configured. I need to know whether the original input text contained the scheme.

I can always do something like url.getOriginalUrl().startsWith(url.getScheme()) but I don't want to have to do that everywhere.

Upload latest verion to MavenCentral

Would it be possible to upload your latest version to MavenCentral? In particular, we would like to take advantage of ae214b7

Our project that includes URLDetector includes JUnit4 unit tests which seem to be causing problems because of testng.

Thanks

URL-Detector is abandoned?

There are several issues (including fixes in pull requests) that are unaddressed in a long time. Could this be handed over to other maintainers? @tzuhanjan can you comment?

Support for local Maven repository installation

In the interim, while issue #2 is being worked on, it would be ideal if it were possible to install the url-detector library in the local Maven repository (typically ~/.m2/repository/), so that other Maven-based build tools can consume the library.

Note that issue #2 is a vastly preferable solution to this problem, but allowing local installation (this issue) provides a short-term workaround.

String: 'http://user:[email protected] host.com' causes exception

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -2
at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:908)
at java.lang.StringBuilder.substring(StringBuilder.java:76)
at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:854)
at java.lang.StringBuilder.substring(StringBuilder.java:76)
at com.linkedin.urls.detection.UrlDetector.readDefault(UrlDetector.java:191)
at com.linkedin.urls.detection.UrlDetector.detect(UrlDetector.java:142)
at com.mycompany.url.UrlTest.main(UrlTest.java:26)

StringIndexOutOfBoundsException on particular string

This string (excluding the double quotes) triggers a StringIndexOutOfBoundsException:
"://VIVE MARINE LE PEN//:@."

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
	at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:908) ~[na:1.8.0_60]
	at java.lang.StringBuilder.substring(StringBuilder.java:76) ~[na:1.8.0_60]
	at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:854) ~[na:1.8.0_60]
	at java.lang.StringBuilder.substring(StringBuilder.java:76) ~[na:1.8.0_60]
	at com.linkedin.urls.detection.UrlDetector.readDefault(UrlDetector.java:191) ~[url-detector-0.1.17.jar!/:na]
	at com.linkedin.urls.detection.UrlDetector.detect(UrlDetector.java:142) ~[url-detector-0.1.17.jar!/:na]

Long run of periods causes detect() to throw NegativeArraySizeException "Backtracked max amount of characters. Endless loop detected."

String text = ".............:::::::::::;;;;;;;;;;;;;;;::...............................................:::::::::::::::::::::::::::::...................."; UrlDetector d = new UrlDetector(text, UrlDetectorOptions.Default); d.detect();

Running this will throw
Exception in thread "main" java.lang.NegativeArraySizeException: Backtracked max amount of characters. Endless loop detected. Bad Text: ':...............................................:::::::::::::::::::::::::::::....................' at com.linkedin.urls.detection.InputTextReader.checkBacktrackLoop(InputTextReader.java:144) at com.linkedin.urls.detection.InputTextReader.seek(InputTextReader.java:120) at com.linkedin.urls.detection.UrlDetector.readUserPass(UrlDetector.java:511) at com.linkedin.urls.detection.UrlDetector.readScheme(UrlDetector.java:458) at com.linkedin.urls.detection.UrlDetector.processColon(UrlDetector.java:293) at com.linkedin.urls.detection.UrlDetector.readDefault(UrlDetector.java:253) at com.linkedin.urls.detection.UrlDetector.detect(UrlDetector.java:142) at Main.main(Main.java:82) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

Normalized Url Detected http://null/

Detecting the following url

www.foo1111111111aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.com
  • works ok for normal UrlDetector
  • fails when calling NormalizedUrl.create and returns it as "http://null/

Using URL-Detector with Maven: jitpack stopgap

While #6 and #2 is still being resolved, as a temporary workaround, you can use Jitpack:

add:

...

<repositories>
        <repository>
            <id>jitpack.io</id>
            <url>https://jitpack.io</url>
        </repository>
    </repositories>

....

<dependency>
        <groupId>com.github.linkedin</groupId>
        <artifactId>URL-Detector</artifactId>
        <version>2a0fede05e</version>
    </dependency>

to your pom.xml.

URL-DETECTOR fails to detect a valid URL

Executing the following code Url.create("http://013.xxx/");
is resolved with the following error:

java.net.MalformedURLException: We couldn't find any urls in string: http://013.xxx/
	at com.linkedin.urls.Url.create(Url.java:69)

It looks like as if the utility treats the xxx part as invalid ip instead of a valid suffix.
Excepted result:
Url should be created, host should be 013.xxx

Japanese Characters cause the entire string to be detected as a URL

If you run the detector in the text below, it thinks the whole text is a URL.

我进入你的主页很卡顿,也许是你的关注人数或者其他数据太多了,其他人主页没有这么卡顿。来自amethyst客户端

Characters 。 and , are single characters and are not considered spaces in this library.

Some valid schemes are ignored

Thanks for a very useful library.

I note that the list of valid schemes is fairly small and this means that a URL with a file: schema is not parsed correctly, giving back http as the default schema. Could you add file: to the list of valid schemas, or perhaps create an option that allows anything that looks like a schema to be returned but perhaps with the addition of something like boolean isKnownSchema()

Cheers.

False alarm detecting URLs

if i have a text contains 10.00hr, it is consider as a URL

runTest("10.00hr,", UrlDetectorOptions.Default);
it should return empty, but the results is [http://10.00hr]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.