ooxi / jdatauri Goto Github PK

Simple and well tested one file Java data URI parser

Home Page: https://ooxi.github.io/jdatauri/

License: Other

Java 99.06% Shell 0.94%

jdatauri's Introduction

jDataUri

jDataUri is a simple and well tested Java implementation for parsing data URIs, in a nearly self contained file (a Base64 decoder is needed).

jDataUri is licensed under the zlib/libpng License

'data' URI Syntax

"data:" + valueList + "," + data

"data:" is case-insensitve.
valueList is a ";"-separated list of values. The first value is a percent-encoded value representing the mime type. Although "/" and "+" in the mime type can be percent-encoded as "%2F" and "%2B" repsepectively, they are not required to be.
For the rest of the values in valueList, if the value does not contain a "=", then it is a content-encoding value (like base64). If the value does contain a "=", then the value is a name=value pair where name and value are percent-encoded representations of a name and value.
valid names in a name=value pair are "charset", "filename" and "content-disposition"
The mime type of the URI is the mime type value after percent-decoding it, trimming leading and trailing white-space from it and converting it to lowercase. If the value is then empty, the mime type for the URI is "text/plain".
For content-encoding values, before determining the content encoding value, it must be percent-decoded, stripped of leading and trailing white-space and converted to lowercase. Further, the value is not used if its length is 0 and when there are multiple content-encodings in the valueList, the first non-empty (if any) one that has a supported value is used and the rest are ignored. Supported values may differ between UAs. "base64" is usually the only supported value at the moment.
If a value is determined to be a name=value pair, the value must be split by the first "=". The value on the left-side of the "=" is the name and the value on the right-side of the first "=" is the value. Then, the name and value must each be percent-decoded, stripped of leading and trailing white-space and converted to lowercase. Further, if more than one value in the valueList contains a name=value pair with the same name, the first one with a non-empty value (if any) is used and the rest of the duplicates are ignored.
If a charset value is not found, then it defaults to US-ASCII.
data is a percent-encoded value representing the file's data. To get the data, it must be percent-decoded.

'data' URI Parsing Rules

This document was originally hosted at shadow2531.com

Let URI be the string representing the data URI.

If URI does not start with a case-insensitive "data:":
    Throw a MALFORMED_URI exception.

If URI does not contain a ",":
    Throw a MALFORMED_URI exception.
Let supportedContentEncodings be an array of strings representing the supported content encodings. (["base64"] for example)
Let mimeType be a string with the value "text/plain".
Let contentEncoding be an empy string.
Let contentEncodingAlreadySet be a boolean with a value of false.
Let supportedValues be a map of string:string pairs where the first string in each pair represents the name of the supported value and the second string in each pair represents an empty string or default string value. (Example: {"charset" : "", "filename" : "", "content-disposition" : ""})
Let supportedValueSetBits be a map of string:bool pairs representing each of the names in supportedValues with each name set to false.
Let comma be the position of the first "," found in URI.
Let temp be the substring of URI from, and including, position 5 to, and excluding, the comma position. (between "data:" and first ",")
Let headers be an array of strings returned by splitting temp by ";".

For each string s in headers:
    Let s equal the lowercase version of s
    Let eq be the position result of searching for "=" in s.
    Let name and value be empty strings.

    If eq is not a valid position in s:
        Let name equal the result of percent-decoding s.
        Let name equal the result of trimming leading and trailing white-space from name.

    Else:
        Let name equal the substring of s from position 0 to, but not including, position eq.
        Let name equal the result of percent-decoding name.
        Let name equal the result of trimmnig leading and trailing white-space from name.
        Let value equal the substring of s from position eq + 1 to the end of s.
        Let value equal the result of precent-decoding value.
        Let value equal the result of trimming leading and trailing white-space from value.

    If s is the first element in headers and eq is not a valid position in s and the length of name is greater than 0:
        Let mimeType equal name.

    Else:

        If eq is not a valid position in s:

            If name is found case-insensitively in supportedContentEncodings:

                If contentEncodingAlreadySet is false:
                    Let contentEncoding equal name.
                    Let ContentEncodingAlreadySet equal true.

        Else:

            If the length of value is greater than 0 and name is found case-insensitively in supportedValues:

                If the corresponding value for name found (case-insensitivley) in supportedValueSetBits is false:
                    Let the corresponding value for name found (case-insensitively) in supportedValues equal value.
                    Let the corresponding value for name found (case-insensitively) in supportedValueSetBits equal true.
Let data be the substring of URI from position comma + 1 to the end of URI.
Let data be the result of percent-decoding data.
Let dataURIObject be an object consisting of the mimeType, contentEncoding, data and supportedValues objects.
return dataURIObject.

jdatauri's People

Contributors

Stargazers

Watchers

Forkers

germantech silentmatt tstevens 1connect robinkanters avodonosov pramoth jayv skissane xuehuiniaoyu ccamel ruihang

jdatauri's Issues

Use JDK provided base64 encoder

Reenable doclint

In order to build on JDK 8+ doclint has been disabled in 1cd2edd.

We should fix those warnings and enable doclint again.

Javadoc Warnings

Building jdatauri results in numerous warnings

Javadoc Warnings
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "58" in "https://tools.ietf.org/html/rfc2397"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "https://tools.ietf.org/html/rfc2397"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "https://tools.ietf.org/html/rfc2397"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "https://tools.ietf.org/html/rfc2397"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "https://tools.ietf.org/html/rfc2397"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "58" in "http://shadow2531.com/opera/testcases/datauri/data_uri_rules.html"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "http://shadow2531.com/opera/testcases/datauri/data_uri_rules.html"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "http://shadow2531.com/opera/testcases/datauri/data_uri_rules.html"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "http://shadow2531.com/opera/testcases/datauri/data_uri_rules.html"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "http://shadow2531.com/opera/testcases/datauri/data_uri_rules.html"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "http://shadow2531.com/opera/testcases/datauri/data_uri_rules.html"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "http://shadow2531.com/opera/testcases/datauri/data_uri_rules.html"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "58" in "https://en.wikipedia.org/wiki/Data_URI_scheme"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "https://en.wikipedia.org/wiki/Data_URI_scheme"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "https://en.wikipedia.org/wiki/Data_URI_scheme"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "https://en.wikipedia.org/wiki/Data_URI_scheme"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see:illegal character: "47" in "https://en.wikipedia.org/wiki/Data_URI_scheme"
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see: reference not found: https://tools.ietf.org/html/rfc2397
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see: reference not found: http://shadow2531.com/opera/testcases/datauri/data_uri_rules.html
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:43: warning - Tag @see: reference not found: https://en.wikipedia.org/wiki/Data_URI_scheme
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:85: warning - @warning is an unknown tag.
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:92: warning - @warning is an unknown tag.
/jdatauri/src/main/java/com/github/ooxi/jdatauri/DataUri.java:99: warning - @warning is an unknown tag.

Build fails on Travis CI

Since the build configuration signs artefacts by default, the build will fail.

Method to encode a Data URI (probably `toString()`)

Although encoding a data URI is trivial, it's be nice if this is convenience method of DataURI (probably toString())

Publish to Maven Central repository

would help a lot to Maven / Gradle users :)

Unit test should cover equals and hashCode

spaces are incorrectly replaced by +

Steps:

Here is the data URI for the phrase "Hello, how do you do?" :

data:text/plain;charset=utf-8,Hello%2C%20how%20do%20you%20do%3F

it is generated at https://dopiaza.org/tools/datauri/index.php, without base64 encoding.

    DataUri dataUri = DataUri.parse("data:text/plain;charset=utf-8,Hello%2C%20how%20do%20you%20do%3F", UTF_8);
    System.out.println(new String(dataUri.getData(), dataUri.getCharset()));

=> Hello,+how+do+you+do?

Expected: Hello, how do you do?

The fix:

Below are some doubts about mime types in data URIs, but pull request #11 provides a fix consistent with the current approach. This fix just updates the percentDecode's workaround for '+' for the case when data by itself has space characters in contrast to space characters created from pluses by URLDecoder.

Obviously, this problem is caused by this commit: 4f5b6da.

But it's not clear what motivates this commit, unfortunately no unit test was added. URLDecoder is the right approach to decode values from an URI, why should we un-do the '+' to ' ' decoding it performs? If something in a data: URI needs to have a + in it, the + should be URLEncoded as %2B. That's how I read the RFC 2397.

Is that a problem with mime-types containing '+" ? For example, https://dopiaza.org/tools/datauri/index.php generates the following data URL for content "1+1=2" and mime type application/atom+xml without base64: data:application/atom+xml;charset=utf-8,%3Ca%3E1%2B1%3D2%3C%2Fa%3E. Here the + in the data is URLEncoded by %2B, while the + in mime-type is left unencoded. In my understanding of the RFC it's wrong. Anyways, even if leaving the + unencoded in mime type is correct, then we should only un-do the + to space decoding when parsing the mime type, but not when parsing the data.

Why I think the RFC requires URLEncoding + in the mime type. Secion 3 "Syntax" of the RFC says:

"type", "subtype",
"attribute" and "value" are the corresponding tokens from [RFC2045],
represented using URL escaped encoding of [RFC2396] as necessary.

Note, "URL escaped encoding of [RFC2396] as necessary"

Is filename and content-disposition part of the datauri standard?

I know charset is known by most browser parser (and wikipedia ). It seems like datauri is a dead standard, where most people are only keeping to rfc3986.

But I never seen "filename" and "content-disposition" in use in the wild however.

I'm trying to extend datauri in a manner that is in keeping with how datauri would be extended if used in a QR code context (aka split across multiple QR codes), for a pet project TagDrop . So if there is a existing extended standard that I should be aware of to avoid conflict with, I would like to do so. (e.g. recommended key name for sequence number, crc, compression type, or md5 hash etc...)

Reenable code coverage

JUnit code coverage report had to be disabled in 3be9d34 in order to be compatible with Java 17.