smola / galimatias Goto Github PK
View Code? Open in Web Editor NEWgalimatias is a URL parsing and normalization library written in Java.
Home Page: http://galimatias.mola.io
License: MIT License
galimatias is a URL parsing and normalization library written in Java.
Home Page: http://galimatias.mola.io
License: MIT License
Optionally validate IDN domains according to each TLD rules. This is very low priority but it would be nice to have, eventually.
A builder API for URLs might be really useful.
At the moment, I'm delaying this to a version beyond 0.0.1.
The default URL.path() should return a String with the path. While pathSegments() should return the List of segments.
Test case: http://💩:[email protected]
(U+1F4A9 in username)
Results from galimatias:
$ java -cp dependencies/galimatias-0.2.1.jar:dependencies/icu4j-53_1.jar io.mola.galimatias.cli.CLI http://💩:[email protected]
Base: http://example.org/foo/bar
Analyzing URL: http://💩:[email protected]
Parsing...
Recoverable error found;
Error: Illegal character in user or password: not a URL code point
Position: 13
Result:
URL: http://%F0%9F%92%A9%3F:[email protected]/
URL type: hierarchical
Scheme: http
Scheme data:
Username: %F0%9F%92%A9%3F
Password: fo
Host: example.com
Port: 80
Path: /
Canonicalizing with RFC 3986 rules...
Result identical to WHATWG rules
Canonicalizing with RFC 2396 rules...
Result identical to RFC 3986 rules
Notice that is shows %F0%9F%92%A9%3F
as the result for the username part, while it should just be %F0%9F%92%A9
, and it shows fo
instead of foo
as the result for the password part.
I get expected results with code points in username less than U+FFFF; e.g., ○ (U+FFEE).
But with U+10000 and higher, e.g., 𐀠 (U+10020), I get the same unexpected behavior as above.
We can minimize URISyntaxException in URL.toJavaURI by using a safer constructor such as:
URI(String scheme, String userInfo, String host, int port, String path, String query, String fragment)
Add a method to convert a URL to a human-understandable String. That is, domains converted to Unicode and printable characters percent-decoded.
I would like to use the new URL.searchParameters method, but it's not in 0.2.1. Could you make a new release?
Currently, there is only URL.withScheme(String)
as a proof-of-concept. This needs to be extended to all URL attributes.
Hi Santiago,
We are using galimatias for quite some time now without any issue.
You recently introduced a dependency to ICU4J in this commit 5ce2cb9 and I was wondering if there was a way to fix the issue without adding this dependency.
You might have missed it but ICU4J is a 10 MB jar which is quite huge.
Thanks for your work.
Currently, a Host
can be either a Domain
or an IPv6Address
. Determine behaviour for IPv4Address
.
java.net.URL, Android's URL, Gecko and WebKit accept "#foo bar" as a valid fragment. RFC 3986 does not, and java.net.URI doesn't either. "#foo%20bar" is left as-is for WebKit, while it is decoded to "#foo bar" in Gecko.
Let's give this a properly defined behaviour for the different parsing versions (WHATWG, RFC...).
Special-casing the parser in order to normalize URLs to old RFCs is overkill. Let's move the old RFC parsing to separate normalizers. Togerther with this, toJavaURI should be changed to not throw URISyntaxException (by changing the constructor).
Also see updated tests at web-platform-tests/wpt@00c62be and web-platform-tests/wpt@288f7ae
Currently when the error-reporting parser is turned on, if a URL contains a space character (or tab or newline), a generic “… contains invalid character” message is emitted. It would be much more helpful if instead a specific “… contains space character” message (or “contains tab character” or “contains newline”) were emitted instead.
I seem to recall you saying that for the next release you were already planning on having the error messages emit the invalid character in the message. So maybe this is already on your radar. If not I’d be happy to provide a patch.
I've changed to match WHATWG and W3C specs here, but that's an important deviation from java.net.URI and java.net.URL (including Android implementation).
java.net.URI, since it performs no normalization operation at all, accepts both empty and null userInfo. In our case, userInfo is never null. The same happens with authority and other fields.
We probably want this to stay this way, but some warnings to javadocs of the corresponding methods would be nice.
A command line client would be a useful tool for URL debugging, as well as a good showcase for galimatias.
[ERROR] Failed to execute goal org.eluder.coveralls:coveralls-maven-plugin:2.1.0:cobertura (default-cli) on project galimatias: IO operation failed: /home/travis/build/smola/galimatias/target/site/cobertura/coverage.xml (No such file or directory) -> [Help 1]
It was broken here:
57f699e
Galimatias born with the goal of parsing URLs that can be opened in a web browser. For my use cases, this included http, https and data. It soon became obvious that support for ftp, gopher, ws, wss and file would be sane and cheap to add since they are supported by most modern browsers and are specified in WHATWG's URL Standard.
Support for any other scheme is currently in place in a limited way. If there is double slash ("//") after the scheme (e.g. "git://"), the URL is parsed as a hierarchical URI. Otherwise, it is parsed an opaque URI. This is known to work with any URI except ed2k links (which is far from compliant with RFC 3986).
Before going on and overengineer anything, I would like to hear about your use cases with handling URIs other than http and https in Java. I will use this feedback to better define the scope of galimatias.
Thank you for your time!
https://url.spec.whatwg.org/#urlutils-and-urlutilsreadonly-members
As pointed out by @rubys, URL does not implement the URLUtils interface. It might be interesting to support it, although it might clutter the URL interface.
Right now a lot of them are just "PARSE ERROR".
https://url.spec.whatwg.org/#application/x-www-form-urlencoded
This also supersedes java.net.URLEncoder.
In order to produce more normalized URLs, it might be a good idea to percent-decode characters that do not require percent-encoding.
See the corresponding issue in WHATWG URL's bug tracker: https://www.w3.org/Bugs/Public/show_bug.cgi?id=24164
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/net/InternetDomainName.html
Is it faster than ICU4J? Is it up to date with standards?
Example: System.out.println(URL.parse("http://test.com/path=[test]").toString());
Output: http://test.com/path=[test]
However if used toJavaURI:
System.out.println(URL.parse("http://test.com/path=[test]").toJavaURI().toString());
Output: http://test.com/path=%5Btest%5D
As far as I understand the first case should return the same result as the second, since it should be the save converted string.
There is method "toHumanString()" which in fact should return (and returns) what is returned now incorrectly by "toString()".
A hashbang (#!
) should be converted to /
by a crawler that needs to fetch the page without using Javascript.
DNS imposes a 63-byte length on each label, maximum 127 labels, and 253 characters. We should check for these limits
This is where empty labels (excep the root label) should throw an error.
Multiple slashes together are ok with standards and have a different meaning than just one slash. That is: "/foo/bar" should be translated to path segments "foo" and "bar", while "/foo//bar" is "foo", "" and "bar".
Some people uses significant empty segments in their paths (see this). However, the most common case is that multiple slashes are not significant and are produced as an unintended consequence of bad serialization.
It's generally accepted that a trailing slash can be added to an URL path if there is no "file extension". (e.g. /foo -> /foo/
but not foo.html -> /foo.htnl/
). However, that changes semantics according to RFC 3986 and might break well-formed URLs in lots of cases.
Both of these normalizations can break standard-compliant URLs. So they should be optional and the user should be warned. Also, when to perform this normalization (during parsing or after parsing) is important, since it can change the result of /../
.
Proper processing of these cases (as Google seems to be doing) is normalizing according to the result of fetching the URL and processing redirects and <link rel="canonical">
.
Because of all of this, I still doubt that providing these normalizations in Galimatias is a sane choice.
Checking if a host is a domain or IP shouldn't require instanceof.
Review all UTF-8 and alternative encodings stuff.
See also:
IPv6 addresses are wrapped in [] when serialized as part of an URL, but they are not wrapped when printed as standalone entities.
As pointed out by @rubys, this is a deviation from the standard.
There are still gaps in the API for IP addresses. Notably, there's no fromInetAddress
methods.
Example: System.out.println(URL.parse("http://127.0.0.1").toString());
Output: http://1.0.0.127/
Affects all IP's. They are converted back to String backwards.
The same result for toJavaURL() and toJavaURI() methods.
I'm using your library to take URLs supplied by users, which may contain invalid syntax such as spaces, and convert them to valid URIs with URL.parse(it).toJavaURI()
.
One additional case that I'd like to cover is when the user leaves out the scheme. In this case, I would like to default it to http. Currently, I'm doing this by catching GalimatiasParseException
, and checking to see if the exceptions's message is "Missing scheme".
This is working well for me, but checking the exception's message is very brittle. I'd like to suggest that there be a few subclasses of GalimatiasParseException
, including something like MissingSchemeException
, so that it can be captured in a safer way.
I'd be happy to submit a pull request if you think that this is a worthwhile enhancement.
This will create a String with a relative reference of a URL relative to another URL.
I see the previous releases at http://central.maven.org/maven2/io/mola/galimatias/galimatias/ but not the 0.1.0 release yet.
See http://tools.ietf.org/html/rfc3987
This is really low priority anyway. I'm waiting to see a real use case for this.
Currently, any host error message is exposed to the ErrorHandler of a URLParser as a GalimatiasException with position to the beggining of the host. In order to get the actual position of the error, one must get the wrapped GalimatiasErrorException and calculate it. It might be nice to fix this in the future.
We need an interface analogous to URLSearchParams:
https://url.spec.whatwg.org/#urlsearchparams
For illegal characters in fragments, galimatias now unexpectedly reports Illegal character in path segment
rather than Illegal character in fragment
as expected.
v0.2.1
$ java -cp dependencies/galimatias-0.2.1.jar:dependencies/icu4j-53_1.jar io.mola.galimatias.cli.CLI http://foo/path#f#g
Base: http://example.org/foo/bar
Analyzing URL: http://foo/path#f#g
Parsing...
Recoverable error found;
Error: Illegal character in path segment: not a URL code point
Position: 17
v0.1.0
$ java -cp dependencies/galimatias-0.1.0.jar:dependencies/icu4j-53_1.jar io.mola.galimatias.cli.CLI http://foo/path#f#g
Base: http://example.org/foo/bar
Analyzing URL: http://foo/path#f#g
Parsing...
Recoverable error found;
Error: Illegal character in fragment: not a URL code point
Position: 17
The actual results of parsing for this case are the same with v0.2.1 as with v0.1.0; the only difference is the error message reporting "path segment" instead of "fragment".
Some real subdomains use "invalid" symbols such as underscore (_) and it actually works on most browsers.
More info:
https://bugzilla.mozilla.org/show_bug.cgi?id=479520#c49
As discussed in #35, I'll try to add the possibility to use a default scheme via URLParsingSettings
. This is useful for parsing URLs introduced by users where a full absolute URL is expected but the user misses the scheme (e.g. example.com
, not http://example.com
).
WATWG specifies domain parsing as:
Although this behaviour does not seem consistent across browsers. At the moment, we'll just follow the spec here.
Utils that we should provide:
Host encoding/decoding is already provided through the Host
class and its subclasses.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.