Git Product home page Git Product logo

galimatias's People

Contributors

blicksky avatar dependabot[bot] avatar edwelker avatar fmela avatar josephw avatar ocadaruma avatar sideshowbarker avatar smola avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

galimatias's Issues

Add builder API

A builder API for URLs might be really useful.
At the moment, I'm delaying this to a version beyond 0.0.1.

Parsing U+10000 or above in username produces unexpected result

Test case: http://💩:[email protected]
(U+1F4A9 in username)

Results from galimatias:

$ java -cp dependencies/galimatias-0.2.1.jar:dependencies/icu4j-53_1.jar io.mola.galimatias.cli.CLI http://💩:[email protected]
Base: http://example.org/foo/bar
Analyzing URL: http://💩:[email protected]
Parsing...
    Recoverable error found;
        Error: Illegal character in user or password: not a URL code point
        Position: 13
    Result:
        URL: http://%F0%9F%92%A9%3F:[email protected]/
        URL type: hierarchical
        Scheme: http
        Scheme data: 
        Username: %F0%9F%92%A9%3F
        Password: fo
        Host: example.com
        Port: 80
        Path: /
Canonicalizing with RFC 3986 rules...
    Result identical to WHATWG rules
Canonicalizing with RFC 2396 rules...
    Result identical to RFC 3986 rules

Notice that is shows %F0%9F%92%A9%3F as the result for the username part, while it should just be %F0%9F%92%A9, and it shows fo instead of foo as the result for the password part.

I get expected results with code points in username less than U+FFFF; e.g., ○ (U+FFEE).
But with U+10000 and higher, e.g., 𐀠 (U+10020), I get the same unexpected behavior as above.

Use safer java.net.URI constructor

We can minimize URISyntaxException in URL.toJavaURI by using a safer constructor such as:

URI(String scheme, String userInfo, String host, int port, String path, String query, String fragment)

Add URL.toHumanString method

Add a method to convert a URL to a human-understandable String. That is, domains converted to Unicode and printable characters percent-decoded.

Release v0.2.2

I would like to use the new URL.searchParameters method, but it's not in 0.2.1. Could you make a new release?

Add URL manipulation methods

Currently, there is only URL.withScheme(String) as a proof-of-concept. This needs to be extended to all URL attributes.

ICU4J dependency

Hi Santiago,

We are using galimatias for quite some time now without any issue.

You recently introduced a dependency to ICU4J in this commit 5ce2cb9 and I was wondering if there was a way to fix the issue without adding this dependency.

You might have missed it but ICU4J is a 10 MB jar which is quite huge.

Thanks for your work.

IPv4 hosts handling

Currently, a Host can be either a Domain or an IPv6Address. Determine behaviour for IPv4Address.

Determine encoding behaviour of #fragment

java.net.URL, Android's URL, Gecko and WebKit accept "#foo bar" as a valid fragment. RFC 3986 does not, and java.net.URI doesn't either. "#foo%20bar" is left as-is for WebKit, while it is decoded to "#foo bar" in Gecko.

Let's give this a properly defined behaviour for the different parsing versions (WHATWG, RFC...).

Replace old RFC parsing by normalizers

Special-casing the parser in order to normalize URLs to old RFCs is overkill. Let's move the old RFC parsing to separate normalizers. Togerther with this, toJavaURI should be changed to not throw URISyntaxException (by changing the constructor).

For error-reporting parser, if URL contains whitespace char, report more specific “… contains space character.” message

Currently when the error-reporting parser is turned on, if a URL contains a space character (or tab or newline), a generic “… contains invalid character” message is emitted. It would be much more helpful if instead a specific “… contains space character” message (or “contains tab character” or “contains newline”) were emitted instead.

I seem to recall you saying that for the next release you were already planning on having the error messages emit the invalid character in the message. So maybe this is already on your radar. If not I’d be happy to provide a patch.

Decide on the behaviour of default/empty URL fields

I've changed to match WHATWG and W3C specs here, but that's an important deviation from java.net.URI and java.net.URL (including Android implementation).

java.net.URI, since it performs no normalization operation at all, accepts both empty and null userInfo. In our case, userInfo is never null. The same happens with authority and other fields.

We probably want this to stay this way, but some warnings to javadocs of the corresponding methods would be nice.

Provide command line client

A command line client would be a useful tool for URL debugging, as well as a good showcase for galimatias.

coveralls integration is broken

[ERROR] Failed to execute goal org.eluder.coveralls:coveralls-maven-plugin:2.1.0:cobertura (default-cli) on project galimatias: IO operation failed: /home/travis/build/smola/galimatias/target/site/cobertura/coverage.xml (No such file or directory) -> [Help 1]

It was broken here:
57f699e

Need feedback: Describe your use cases for non-HTTP(S) URIs

Galimatias born with the goal of parsing URLs that can be opened in a web browser. For my use cases, this included http, https and data. It soon became obvious that support for ftp, gopher, ws, wss and file would be sane and cheap to add since they are supported by most modern browsers and are specified in WHATWG's URL Standard.

Support for any other scheme is currently in place in a limited way. If there is double slash ("//") after the scheme (e.g. "git://"), the URL is parsed as a hierarchical URI. Otherwise, it is parsed an opaque URI. This is known to work with any URI except ed2k links (which is far from compliant with RFC 3986).

Before going on and overengineer anything, I would like to hear about your use cases with handling URIs other than http and https in Java. I will use this feedback to better define the scope of galimatias.

Thank you for your time!

URL parsing does not convert characters: [ and ] to correct presentation.

Example: System.out.println(URL.parse("http://test.com/path=[test]").toString());

Output: http://test.com/path=[test]

However if used toJavaURI:
System.out.println(URL.parse("http://test.com/path=[test]").toJavaURI().toString());

Output: http://test.com/path=%5Btest%5D

As far as I understand the first case should return the same result as the second, since it should be the save converted string.

There is method "toHumanString()" which in fact should return (and returns) what is returned now incorrectly by "toString()".

Process hashbangs

A hashbang (#!) should be converted to / by a crawler that needs to fetch the page without using Javascript.

Check for DNS length limits

DNS imposes a 63-byte length on each label, maximum 127 labels, and 253 characters. We should check for these limits

This is where empty labels (excep the root label) should throw an error.

Optionally normalize empty path segments ("//" and traling slash in path)

Multiple slashes

Multiple slashes together are ok with standards and have a different meaning than just one slash. That is: "/foo/bar" should be translated to path segments "foo" and "bar", while "/foo//bar" is "foo", "" and "bar".

Some people uses significant empty segments in their paths (see this). However, the most common case is that multiple slashes are not significant and are produced as an unintended consequence of bad serialization.

Trailing slash

It's generally accepted that a trailing slash can be added to an URL path if there is no "file extension". (e.g. /foo -> /foo/ but not foo.html -> /foo.htnl/). However, that changes semantics according to RFC 3986 and might break well-formed URLs in lots of cases.

Further considerations

Both of these normalizations can break standard-compliant URLs. So they should be optional and the user should be warned. Also, when to perform this normalization (during parsing or after parsing) is important, since it can change the result of /../.

Proper processing of these cases (as Google seems to be doing) is normalizing according to the result of fetching the URL and processing redirects and <link rel="canonical">.

Because of all of this, I still doubt that providing these normalizations in Galimatias is a sane choice.

Refine Host API

Checking if a host is a domain or IP shouldn't require instanceof.

Distinguish between different types of parse exceptions

I'm using your library to take URLs supplied by users, which may contain invalid syntax such as spaces, and convert them to valid URIs with URL.parse(it).toJavaURI().

One additional case that I'd like to cover is when the user leaves out the scheme. In this case, I would like to default it to http. Currently, I'm doing this by catching GalimatiasParseException, and checking to see if the exceptions's message is "Missing scheme".

This is working well for me, but checking the exception's message is very brittle. I'd like to suggest that there be a few subclasses of GalimatiasParseException, including something like MissingSchemeException, so that it can be captured in a safer way.

I'd be happy to submit a pull request if you think that this is a worthwhile enhancement.

Expose host parsing error position to the URL parser

Currently, any host error message is exposed to the ErrorHandler of a URLParser as a GalimatiasException with position to the beggining of the host. In order to get the actual position of the error, one must get the wrapped GalimatiasErrorException and calculate it. It might be nice to fix this in the future.

Regression in error reporting for bad fragments

For illegal characters in fragments, galimatias now unexpectedly reports Illegal character in path segment rather than Illegal character in fragment as expected.

v0.2.1

$ java -cp dependencies/galimatias-0.2.1.jar:dependencies/icu4j-53_1.jar io.mola.galimatias.cli.CLI http://foo/path#f#g
Base: http://example.org/foo/bar
Analyzing URL: http://foo/path#f#g
Parsing...
    Recoverable error found;
        Error: Illegal character in path segment: not a URL code point
        Position: 17

v0.1.0

$ java -cp dependencies/galimatias-0.1.0.jar:dependencies/icu4j-53_1.jar io.mola.galimatias.cli.CLI http://foo/path#f#g
Base: http://example.org/foo/bar
Analyzing URL: http://foo/path#f#g
Parsing...
    Recoverable error found;
        Error: Illegal character in fragment: not a URL code point
        Position: 17

The actual results of parsing for this case are the same with v0.2.1 as with v0.1.0; the only difference is the error message reporting "path segment" instead of "fragment".

Add a setting to use a default scheme for parsing

As discussed in #35, I'll try to add the possibility to use a default scheme via URLParsingSettings. This is useful for parsing URLs introduced by users where a full absolute URL is expected but the user misses the scheme (e.g. example.com, not http://example.com).

Percent-decode domain before parsing it

WATWG specifies domain parsing as:

  • Let host be the result of running utf-8's decoder on the percent decoding of input.
  • Let domain be the result of splitting host on any domain label separators.
  • Return the result of running domain to ASCII on domain.

Although this behaviour does not seem consistent across browsers. At the moment, we'll just follow the spec here.

Add encoding/decoding utils

Utils that we should provide:

  • encodePathSegment / decodePathSegment
  • encodePath / decodePath
  • encodeQuery / decodeQuery
  • encodeFragment / decodeFragment
  • encodeSchemeData / decodeSchemeData ?

Host encoding/decoding is already provided through the Host class and its subclasses.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.