smola / galimatias Goto Github PK

View Code? Open in Web Editor NEW

160.0 11.0 37.0 1.08 MB

galimatias is a URL parsing and normalization library written in Java.

Home Page: http://galimatias.mola.io

License: MIT License

Java 100.00%

url-parsing url url-parser validator

galimatias's Introduction

galimatias

galimatias is a URL parsing and normalization library written in Java.

Design goals

Parse URLs as browsers do, optionally enforcing compliance with old standards (i.e. RFC 3986, RFC 2396).
Stay as close as possible to WHATWG's URL Standard.
Convenient fluent API with immutable URL objects.
Interoperable with java.net.URL and java.net.URI.
Minimal dependencies.

Gotchas

galimatias is not a generic URI parser. It can parse any URI, but only schemes defined in the URL Standard (i.e. http, https, ftp, ws, wss, gopher, file) will be parsed as hierarchical URIs. For example, in git://github.com/smola/galimatias.git you'll be able to extract scheme (i.e. git) and scheme data (i.e. //github.com/smola/galimatias.git), but not host (i.e. github.com). This is intended. We cannot guarantee that applying a set of generic rules won't break certain kind of URIs, so we do not try with them. I will consider adding further support for other schemes if enough people provides solid use cases and testing. You can check this issue if you are interested.

But, why?

galimatias started out of frustration with java.net.URL and java.net.URI. Both of them are good for basic use cases, but severely broken for others:

java.net.URL.equals() is broken.
java.net.URI can parse only RFC 2396 URI syntax. java.net.URI will only parse a URI if it's strictly compliant with RFC 2396. Most URLs found in the wild do not comply with any syntax standard, and RFC 2396 is outdated anyway.
java.net.URI is not protocol-aware. http://example.com, http://example.com/ and http://example.com:80 are different entities.
Manipulation is a pain. I haven't seen any URL manipulation code using java.net.URL or java.net.URI that is simple and concise.
Not IDN ready. Java has IDN support with java.net.IDN, but this does not apply to java.net.URL or java.net.URI.

Setup with Maven

galimatias is available at Maven Central. Just add to your pom.xml <dependencies> section:

<dependency>
  <groupId>io.mola.galimatias</groupId>
  <artifactId>galimatias</artifactId>
  <version>0.2.1</version>
</dependency>

Development snapshots are also available at Sonatype OSS Snapshots repository.

Getting started

Parse a URL

// Parse
String urlString = //...
URL url;
try {
  url = URL.parse(urlString);
} catch (GalimatiasParseException ex) {
  // Do something with non-recoverable parsing error
}

Convert to java.net.URL

URL url = //...
java.net.URL javaURL;
try {
  javaURL = url.toJavaURL();
} catch (MalformedURLException ex) {
  // This can happen if scheme is not http, https, ftp, file or jar.
}

Convert to java.net.URI

URL url = //...
java.net.URI javaURI;
try {
  javaURI = url.toJavaURI();
} catch (URISyntaxException ex) {
  // This will happen in rare cases such as "foo://"
}

Parse a URL with strict error handling

You can use a strict error handler that will throw an exception on any invalid URL, even if it's a recovarable error.

URLParsingSettings settings = URLParsingSettings.create()
  .withErrorHandler(StrictErrorHandler.getInstance());
URL url = URL.parse(settings, urlString);

Documentation

Check out the Javadoc.

Contribute

Did you find a bug? Report it on GitHub.

Did you write a patch? Send a pull request.

Something else? Email me at [email protected].

License

galimatias is released under the terms of the MIT License.

galimatias's People

Contributors

Stargazers

Watchers

galimatias's Issues

Distinguish between different types of parse exceptions

I'm using your library to take URLs supplied by users, which may contain invalid syntax such as spaces, and convert them to valid URIs with URL.parse(it).toJavaURI().

One additional case that I'd like to cover is when the user leaves out the scheme. In this case, I would like to default it to http. Currently, I'm doing this by catching GalimatiasParseException, and checking to see if the exceptions's message is "Missing scheme".

This is working well for me, but checking the exception's message is very brittle. I'd like to suggest that there be a few subclasses of GalimatiasParseException, including something like MissingSchemeException, so that it can be captured in a safer way.

I'd be happy to submit a pull request if you think that this is a worthwhile enhancement.

Provide command line client

A command line client would be a useful tool for URL debugging, as well as a good showcase for galimatias.

Provide meaningful error messages

Right now a lot of them are just "PARSE ERROR".

Multiple encoding issues must be reviewed

Review all UTF-8 and alternative encodings stuff.

Optionally normalize empty path segments ("//" and traling slash in path)

Multiple slashes

Multiple slashes together are ok with standards and have a different meaning than just one slash. That is: "/foo/bar" should be translated to path segments "foo" and "bar", while "/foo//bar" is "foo", "" and "bar".

Some people uses significant empty segments in their paths (see this). However, the most common case is that multiple slashes are not significant and are produced as an unintended consequence of bad serialization.

Trailing slash

It's generally accepted that a trailing slash can be added to an URL path if there is no "file extension". (e.g. /foo -> /foo/ but not foo.html -> /foo.htnl/). However, that changes semantics according to RFC 3986 and might break well-formed URLs in lots of cases.

Further considerations

Both of these normalizations can break standard-compliant URLs. So they should be optional and the user should be warned. Also, when to perform this normalization (during parsing or after parsing) is important, since it can change the result of /../.

Proper processing of these cases (as Google seems to be doing) is normalizing according to the result of fetching the URL and processing redirects and <link rel="canonical">.

Because of all of this, I still doubt that providing these normalizations in Galimatias is a sane choice.

Add encoding/decoding utils

Utils that we should provide:

encodePathSegment / decodePathSegment
encodePath / decodePath
encodeQuery / decodeQuery
encodeFragment / decodeFragment
encodeSchemeData / decodeSchemeData ?

Host encoding/decoding is already provided through the Host class and its subclasses.

Create site and user manual

Regression in error reporting for bad fragments

For illegal characters in fragments, galimatias now unexpectedly reports Illegal character in path segment rather than Illegal character in fragment as expected.

v0.2.1

$ java -cp dependencies/galimatias-0.2.1.jar:dependencies/icu4j-53_1.jar io.mola.galimatias.cli.CLI http://foo/path#f#g
Base: http://example.org/foo/bar
Analyzing URL: http://foo/path#f#g
Parsing...
    Recoverable error found;
        Error: Illegal character in path segment: not a URL code point
        Position: 17

v0.1.0

$ java -cp dependencies/galimatias-0.1.0.jar:dependencies/icu4j-53_1.jar io.mola.galimatias.cli.CLI http://foo/path#f#g
Base: http://example.org/foo/bar
Analyzing URL: http://foo/path#f#g
Parsing...
    Recoverable error found;
        Error: Illegal character in fragment: not a URL code point
        Position: 17

The actual results of parsing for this case are the same with v0.2.1 as with v0.1.0; the only difference is the error message reporting "path segment" instead of "fragment".

Deploy releases to Maven Central

Percent-decode characters that do not require percent-encoding

In order to produce more normalized URLs, it might be a good idea to percent-decode characters that do not require percent-encoding.

See the corresponding issue in WHATWG URL's bug tracker: https://www.w3.org/Bugs/Public/show_bug.cgi?id=24164

Add canonicalization utils

Check special characters to be allowed in hostnames (_, +, $)

Some real subdomains use "invalid" symbols such as underscore (_) and it actually works on most browsers.

More info:
https://bugzilla.mozilla.org/show_bug.cgi?id=479520#c49

Add URL.toHumanString method

Add a method to convert a URL to a human-understandable String. That is, domains converted to Unicode and printable characters percent-decoded.

Release v0.2.2

I would like to use the new URL.searchParameters method, but it's not in 0.2.1. Could you make a new release?

Refine Host API

Checking if a host is a domain or IP shouldn't require instanceof.

Process hashbangs

A hashbang (#!) should be converted to / by a crawler that needs to fetch the page without using Javascript.

IP's are parsed (converted back to String) incorrectly (backwards).

Example: System.out.println(URL.parse("http://127.0.0.1").toString());

Output: http://1.0.0.127/

Affects all IP's. They are converted back to String backwards.

The same result for toJavaURL() and toJavaURI() methods.

IPv4 hosts handling

Currently, a Host can be either a Domain or an IPv6Address. Determine behaviour for IPv4Address.

Look into Guava implementation of IDN

http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/net/InternetDomainName.html

Is it faster than ICU4J? Is it up to date with standards?

IPv6Address.toString() should wrap address with []

IPv6 addresses are wrapped in [] when serialized as part of an URL, but they are not wrapped when printed as standalone entities.
As pointed out by @rubys, this is a deviation from the standard.

Replace old RFC parsing by normalizers

Special-casing the parser in order to normalize URLs to old RFCs is overkill. Let's move the old RFC parsing to separate normalizers. Togerther with this, toJavaURI should be changed to not throw URISyntaxException (by changing the constructor).

Optionally validate IDN domains according to each TLD rules.

Optionally validate IDN domains according to each TLD rules. This is very low priority but it would be nice to have, eventually.

coveralls integration is broken

[ERROR] Failed to execute goal org.eluder.coveralls:coveralls-maven-plugin:2.1.0:cobertura (default-cli) on project galimatias: IO operation failed: /home/travis/build/smola/galimatias/target/site/cobertura/coverage.xml (No such file or directory) -> [Help 1]

It was broken here:
57f699e

Make percent-encoding to upper-case normalization optional

Expose host parsing error position to the URL parser

Currently, any host error message is exposed to the ErrorHandler of a URLParser as a GalimatiasException with position to the beggining of the host. In order to get the actual position of the error, one must get the wrapped GalimatiasErrorException and calculate it. It might be nice to fix this in the future.

0.1.0 at Maven?

I see the previous releases at http://central.maven.org/maven2/io/mola/galimatias/galimatias/ but not the 0.1.0 release yet.

Should we support the WHATWG URLUtils interface?

https://url.spec.whatwg.org/#urlutils-and-urlutilsreadonly-members

As pointed out by @rubys, URL does not implement the URLUtils interface. It might be interesting to support it, although it might clutter the URL interface.

Decide on the behaviour of default/empty URL fields

I've changed to match WHATWG and W3C specs here, but that's an important deviation from java.net.URI and java.net.URL (including Android implementation).

java.net.URI, since it performs no normalization operation at all, accepts both empty and null userInfo. In our case, userInfo is never null. The same happens with authority and other fields.

We probably want this to stay this way, but some warnings to javadocs of the corresponding methods would be nice.

For error-reporting parser, if URL contains whitespace char, report more specific “… contains space character.” message

Currently when the error-reporting parser is turned on, if a URL contains a space character (or tab or newline), a generic “… contains invalid character” message is emitted. It would be much more helpful if instead a specific “… contains space character” message (or “contains tab character” or “contains newline”) were emitted instead.

I seem to recall you saying that for the next release you were already planning on having the error messages emit the invalid character in the message. So maybe this is already on your radar. If not I’d be happy to provide a patch.

Add URL manipulation methods

Currently, there is only URL.withScheme(String) as a proof-of-concept. This needs to be extended to all URL attributes.

Parsing U+10000 or above in username produces unexpected result

Test case: http://💩:[email protected]
(U+1F4A9 in username)

Results from galimatias:

$ java -cp dependencies/galimatias-0.2.1.jar:dependencies/icu4j-53_1.jar io.mola.galimatias.cli.CLI http://💩:[email protected]
Base: http://example.org/foo/bar
Analyzing URL: http://💩:[email protected]
Parsing...
    Recoverable error found;
        Error: Illegal character in user or password: not a URL code point
        Position: 13
    Result:
        URL: http://%F0%9F%92%A9%3F:[email protected]/
        URL type: hierarchical
        Scheme: http
        Scheme data: 
        Username: %F0%9F%92%A9%3F
        Password: fo
        Host: example.com
        Port: 80
        Path: /
Canonicalizing with RFC 3986 rules...
    Result identical to WHATWG rules
Canonicalizing with RFC 2396 rules...
    Result identical to RFC 3986 rules

Notice that is shows %F0%9F%92%A9%3F as the result for the username part, while it should just be %F0%9F%92%A9, and it shows fo instead of foo as the result for the password part.

I get expected results with code points in username less than U+FFFF; e.g., ￮ (U+FFEE).
But with U+10000 and higher, e.g., 𐀠 (U+10020), I get the same unexpected behavior as above.

Use safer java.net.URI constructor

We can minimize URISyntaxException in URL.toJavaURI by using a safer constructor such as:

URI(String scheme, String userInfo, String host, int port, String path, String query, String fragment)

Add URL.relativize() method

This will create a String with a relative reference of a URL relative to another URL.

Add method to convert InetAddress to IPv{4,6}Address

There are still gaps in the API for IP addresses. Notably, there's no fromInetAddress methods.

Change URL path() and pathString() to path() and pathSegments()

The default URL.path() should return a String with the path. While pathSegments() should return the List of segments.

Migrate from IDNA2003 to UTS 45

https://bugzilla.mozilla.org/show_bug.cgi?id=479520
https://codereview.chromium.org/23642003
http://blogs.msdn.com/b/shawnste/archive/2013/09/09/how-does-ie-handle-the-idn2008-rfcs.aspx

Implement support for relative URLs for unknown schemes per current spec

See whatwg/url@b266a43

Also see updated tests at web-platform-tests/wpt@00c62be and web-platform-tests/wpt@288f7ae

Check for DNS length limits

DNS imposes a 63-byte length on each label, maximum 127 labels, and 253 characters. We should check for these limits

This is where empty labels (excep the root label) should throw an error.

Add a way to produce IRI

See http://tools.ietf.org/html/rfc3987

This is really low priority anyway. I'm waiting to see a real use case for this.

Specification sync: Fix IPv6/IPv4 parsing bugs

Fix IPv6/IPv4 parsing bugs.

https://www.w3.org/Bugs/Public/show_bug.cgi?id=26360
https://www.w3.org/Bugs/Public/show_bug.cgi?id=26361
https://www.w3.org/Bugs/Public/show_bug.cgi?id=26363

whatwg/url@72e5848
whatwg/url@1c22aa1
whatwg/url@a0a8e32

Percent-decode domain before parsing it

WATWG specifies domain parsing as:

Let host be the result of running utf-8's decoder on the percent decoding of input.
Let domain be the result of splitting host on any domain label separators.
Return the result of running domain to ASCII on domain.

Although this behaviour does not seem consistent across browsers. At the moment, we'll just follow the spec here.

URL parsing does not convert characters: [ and ] to correct presentation.

Example: System.out.println(URL.parse("http://test.com/path=[test]").toString());

Output: http://test.com/path=[test]

However if used toJavaURI:
System.out.println(URL.parse("http://test.com/path=[test]").toJavaURI().toString());

Output: http://test.com/path=%5Btest%5D

As far as I understand the first case should return the same result as the second, since it should be the save converted string.

There is method "toHumanString()" which in fact should return (and returns) what is returned now incorrectly by "toString()".

Add API equivalent to URLSearchParams

We need an interface analogous to URLSearchParams:
https://url.spec.whatwg.org/#urlsearchparams

Deploy snapshots to Sonatype

https://issues.sonatype.org/browse/OSSRH-8262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Add API for application/x-www-form-urlencoded

https://url.spec.whatwg.org/#application/x-www-form-urlencoded

This also supersedes java.net.URLEncoder.

Determine encoding behaviour of #fragment

java.net.URL, Android's URL, Gecko and WebKit accept "#foo bar" as a valid fragment. RFC 3986 does not, and java.net.URI doesn't either. "#foo%20bar" is left as-is for WebKit, while it is decoded to "#foo bar" in Gecko.

Let's give this a properly defined behaviour for the different parsing versions (WHATWG, RFC...).

Add builder API

A builder API for URLs might be really useful.
At the moment, I'm delaying this to a version beyond 0.0.1.

Need feedback: Describe your use cases for non-HTTP(S) URIs

Galimatias born with the goal of parsing URLs that can be opened in a web browser. For my use cases, this included http, https and data. It soon became obvious that support for ftp, gopher, ws, wss and file would be sane and cheap to add since they are supported by most modern browsers and are specified in WHATWG's URL Standard.

Support for any other scheme is currently in place in a limited way. If there is double slash ("//") after the scheme (e.g. "git://"), the URL is parsed as a hierarchical URI. Otherwise, it is parsed an opaque URI. This is known to work with any URI except ed2k links (which is far from compliant with RFC 3986).

Before going on and overengineer anything, I would like to hear about your use cases with handling URIs other than http and https in Java. I will use this feedback to better define the scope of galimatias.

Thank you for your time!

Add a setting to use a default scheme for parsing

As discussed in #35, I'll try to add the possibility to use a default scheme via URLParsingSettings. This is useful for parsing URLs introduced by users where a full absolute URL is expected but the user misses the scheme (e.g. example.com, not http://example.com).

ICU4J dependency

Hi Santiago,

We are using galimatias for quite some time now without any issue.

You recently introduced a dependency to ICU4J in this commit 5ce2cb9 and I was wondering if there was a way to fix the issue without adding this dependency.

You might have missed it but ICU4J is a 10 MB jar which is quite huge.

Thanks for your work.