Git Product home page Git Product logo

Regression: inconsistent results between LicenseCompareHelper.isTextStandardLicense(), LicenseCompareHelper.isStandardLicenseWithinText(), and LicenseCompareHelper.matchingStandardLicenseIdsWithinText() about spdx-java-library HOT 22 CLOSED

pmonks avatar pmonks commented on July 22, 2024
Regression: inconsistent results between LicenseCompareHelper.isTextStandardLicense(), LicenseCompareHelper.isStandardLicenseWithinText(), and LicenseCompareHelper.matchingStandardLicenseIdsWithinText()

from spdx-java-library.

Comments (22)

goneall avatar goneall commented on July 22, 2024 1

@pmonks Thanks for tracking this down - agree it is an issue.

Just FYI - I'm going to be working on upgrading the library to the 3.0 spec over the next couple weeks, so it may be a while before I can work on this issue.

from spdx-java-library.

goneall avatar goneall commented on July 22, 2024 1

@pmonks - thanks for the analysis.

DOTALL and CASE_INSENSITIVE are added in the Java pattern. Since we are getting patterns passed in from the template and most (if not all) of the regexes for the alt element for the license list XML's don't specify matching new lines, this is overridden in the Java code itself.

There is some logic to change from greedy to non-greedy regexes - I think that's where the error may be. I'll do a bit more research.

from spdx-java-library.

pmonks avatar pmonks commented on July 22, 2024

Update: after re-testing with v1.1.5 and v1.1.6 this appears to be caused by a change to the license list XML for BSD-2-Clause in v3.21 of the license list, probably this commit. That said, it still appears to be a bug in Spdx-Java-library, as my understanding is that these methods should always behave consistently with one another when given the same (singleton) license text, regardless of how the license list XML changes.

[edit] this may be a furphy, given that the same inconsistency seems to be happening with other licenses (see next comment).

from spdx-java-library.

pmonks avatar pmonks commented on July 22, 2024

I'm also seeing the same inconsistencies between the outputs of these methods when testing with the official WTFPL license text:

  • !isTextStandardLicense().isDifferenceFound() returns true
  • isStandardLicenseWithinText() returns false
  • matchingStandardLicenseIdsWithinText() returns an empty collection ❌
  • matchingStandardLicenseIds() returns ["WTFPL"]

from spdx-java-library.

pmonks avatar pmonks commented on July 22, 2024

Yep no worries - no hurry on my end. Just reporting things as I see them!

from spdx-java-library.

pmonks avatar pmonks commented on July 22, 2024

Here's an MIT license text that's also triggering this exact same pattern of behaviour (some methods work as expected, others don't).

from spdx-java-library.

goneall avatar goneall commented on July 22, 2024

I've done some analysis and isolated the issue to the findTemplateWithinText(String text, String template) method.

Below is the Regex generated from the license template:

.?{0,5000}\QRedistribution\E\s*\Qand\E\s*\Quse\E\s*\Qin\E\s*\Qsource\E\s*\Qand\E\s*\Qbinary\E\s*\Qforms,\E\s*\Qwith\E\s*\Qor\E\s*\Qwithout\E\s*\Qmodification,\E\s*\Qare\E\s*\Qpermitted\E\s*\Qprovided\E\s*\Qthat\E\s*\Qthe\E\s*\Qfollowing\E\s*\Qconditions\E\s*\Qare\E\s*\Qmet:\E\s*.{0,20}\QRedistributions\E\s*\Qof\E\s*\Qsource\E\s*\Qcode\E\s*\Qmust\E\s*\Qretain\E\s*\Qthe\E\s*\Qabove\E\s*\Q-c-\E\s*\Qnotice,\E\s*\Qthis\E\s*\Qlist\E\s*\Qof\E\s*\Qconditions\E\s*\Qand\E\s*\Qthe\E\s*\Qfollowing\E\s*\Qdisclaimer.\E\s*.{0,20}\QRedistributions\E\s*\Qin\E\s*\Qbinary\E\s*\Qform\E\s*\Qmust\E\s*\Qreproduce\E\s*\Qthe\E\s*\Qabove\E\s*\Q-c-\E\s*\Qnotice,\E\s*\Qthis\E\s*\Qlist\E\s*\Qof\E\s*\Qconditions\E\s*\Qand\E\s*\Qthe\E\s*\Qfollowing\E\s*\Qdisclaimer\E\s*\Qin\E\s*\Qthe\E\s*\Qdocumentation\E\s*\Qand/or\E\s*\Qother\E\s*\Qmaterials\E\s*\Qprovided\E\s*\Qwith\E\s*\Qthe\E\s*\Qdistribution.\E\s*\QTHIS\E\s*.{0,5}\QIS\E\s*\QPROVIDED\E\s*\QBY\E\s*.{1,800}\Q"AS\E\s*\QIS"\E\s*\QAND\E\s*\QANY\E\s*EXPRESS(ED)?.{0,36000}\QPARTICULAR\E\s*\QPURPOSE\E\s*\QARE\E\s*\QDISCLAIMED.\E\s*\QIN\E\s*\QNO\E\s*\QEVENT\E\s*\QSHALL\E\s*.+\QBE\E\s*\QLIABLE\E\s*\QFOR\E\s*\QANY\E\s*\QDIRECT,\E\s*\QINDIRECT,\E\s*\QINCIDENTAL,\E\s*\QSPECIAL,\E\s*\QEXEMPLARY,\E\s*\QOR\E\s*\QCONSEQUENTIAL\E\s*\QDAMAGES\E\s*\Q(INCLUDING,\E\s*\QBUT\E\s*\QNOT\E\s*\QLIMITED\E\s*\QTO,\E\s*\QPROCUREMENT\E\s*\QOF\E\s*\QSUBSTITUTE\E\s*\QGOODS\E\s*\QOR\E\s*\QSERVICES;\E\s*\QLOSS\E\s*\QOF\E\s*\QUSE,\E\s*\QDATA,\E\s*\QOR\E\s*\QPROFITS;\E\s*\QOR\E\s*\QBUSINESS\E\s*\QINTERRUPTION)\E\s*\QHOWEVER\E\s*\QCAUSED\E\s*\QAND\E\s*\QON\E\s*\QANY\E\s*\QTHEORY\E\s*\QOF\E\s*\QLIABILITY,\E\s*\QWHETHER\E\s*\QIN\E\s*\QCONTRACT,\E\s*\QSTRICT\E\s*\QLIABILITY,\E\s*\QOR\E\s*\QTORT\E\s*\Q(INCLUDING\E\s*\QNEGLIGENCE\E\s*\QOR\E\s*\QOTHERWISE)\E\s*\QARISING\E\s*\QIN\E\s*\QANY\E\s*\QWAY\E\s*\QOUT\E\s*\QOF\E\s*\QTHE\E\s*\QUSE\E\s*\QOF\E\s*\QTHIS\E\s*\Q,\E\s*\QEVEN\E\s*\QIF\E\s*\QADVISED\E\s*\QOF\E\s*\QTHE\E\s*\QPOSSIBILITY\E\s*\QOF\E\s*\QSUCH\E\s*\QDAMAGE.\E\s*

And below is the normalized text that should match the above regex:

-c- -c- 2015, Atlassian Pty Ltd
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

 Redistributions of source code must retain the above -c- notice, this
  list of conditions and the following disclaimer.

 Redistributions in binary form must reproduce the above -c- notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE -c- HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF merchantability AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE -c- HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Next step would be to analyze why the regex doesn't match.

from spdx-java-library.

pmonks avatar pmonks commented on July 22, 2024

@goneall thanks for pulling those out! Couple of things I found:

  • The .?{0,5000} regex fragment at the start doesn't look correct - it should only have either ? (match zero or one character) or {0,5000} (match zero to 5000 characters), but not both. I believe .{0,5000} is what we need here.
  • . in a Java regex won't match newlines by default, so .{0,5000} still won't match the multi-line Atlassian copyright header unless DOTALL mode is enabled. IME doing this using inline directives is better than doing it programmatically (it makes regexes easier to troubleshoot in isolation), so the fix here would be to start the regex with (?s).{0,5000}
  • <optional> tags in the template are handled incorrectly / inconsistently:
    • the first one in the template is turned into the regex fragment .{0,5} (around index 909 of the regex posted above), which isn't long enough for the word that appears there by default in the license text (SOFTWARE, length of 8).
    • the second one in the template doesn't appear in the regex at all - the default text is removed but there's no regex fragment in its place (around index 1891 of the regex)

I can't see any other obvious problems, though manually fixing the regex for each of these issues still didn't match the Atlassian BSD-2-Clause text, so I'm not certain that I've found everything. My suggestion would be to fix these identified issues, then start troubleshooting again with the improved regex if it's still not matching.

from spdx-java-library.

goneall avatar goneall commented on July 22, 2024

I did a bit more research - found where we're adding the .? when it should be just a ?. Unfortunately, it is still hitting the same issue.

This is probably an issue whenever we have optional or variable text at the beginning of the license template.

from spdx-java-library.

goneall avatar goneall commented on July 22, 2024

Narrowed down the issue.

The template THIS<<beginOptional>> SOFTWARE<<endOptional>> is being encoded in the regex as \s*\QTHIS\E\s*.{0,5}\QIS\E\s*. Note that there is no regex to accept the SOFTWARE token even though it is optional.

It looks like the code only allowing optional text to be 5 characters long.

What we should do is capture the optional text, tokenize it and make it a character group as optional.

An easier solution would be to use the actual length of the optional text rather than the hard coded length of 5.

Both of these are a bit of a design change.

A short term fix would be to just increase the length of the optional text to something more reasonable.

from spdx-java-library.

goneall avatar goneall commented on July 22, 2024

After changing the character limit to 50, I ran into a second part of the match which failed:

\QANY\E\s*EXPRESS(ED)?.{0,36000}\QPARTICULAR\E\s*

isn't matching:

ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF merchantability AND FITNESS FOR A PARTICULAR

which has me a bit baffled.

from spdx-java-library.

goneall avatar goneall commented on July 22, 2024

Figured it out - needed to capture the newlines - changing .{0,36000} to (.|\n){0,36000} works.

from spdx-java-library.

goneall avatar goneall commented on July 22, 2024

The above change caused a stack overflow in the regex engine when comparing the entire text - so it isn't a solution. Since we're using DOTALL, I'm not sure why we need the (.|\n) pattern.

Not quite sure where to go from here - other than replacing the regex with a hand build parser.

from spdx-java-library.

sdheh avatar sdheh commented on July 22, 2024

What if you only used the regex to find only the beginning of the non-optional license text and ran LicenseCompareHelper.isTextStandardLicense on the rest of the text starting from there? If there is no difference or DifferenceDescription.getDifferenceMessage() starts with "Additional text found after the end of the expected license text", then there is a match for the license.

from spdx-java-library.

goneall avatar goneall commented on July 22, 2024

What if you only used the regex to find only the beginning of the non-optional license text and ran LicenseCompareHelper.isTextStandardLicense on the rest of the text starting from there? If there is no difference or DifferenceDescription.getDifferenceMessage() starts with "Additional text found after the end of the expected license text", then there is a match for the license.

Thanks @sdheh - Sounds like a good idea. I'll try it and see if it works.

from spdx-java-library.

goneall avatar goneall commented on July 22, 2024

The checking for the difference message didn't work - I was getting inconsistent message.

However, I figured out a different approach that did work.

I created 2 different patterns - one for the start and one for the end and just ran the 2 different regexes to get the beginning and end of the license text.

I'll work on a pull request after testing out other scenarios.

from spdx-java-library.

sdheh avatar sdheh commented on July 22, 2024

The checking for the difference message didn't work - I was getting inconsistent message.

Would you please tell me of some cases where it didn't work? I tried doing this myself and didn't run into any problems.

from spdx-java-library.

goneall avatar goneall commented on July 22, 2024

Would you please tell me of some cases where it didn't work? I tried doing this myself and didn't run into any problems.

I didn't keep track of all the details, but it was one of the unit tests I wrote for the recent matching issues. The error message that came back wasn't not the expected "Additional text found ..." but a different message. Possibly related to a variable or optional.

Since the second approach worked, I didn't investigate further.

from spdx-java-library.

pmonks avatar pmonks commented on July 22, 2024

@goneall will this approach (having two separate "before" and "after" regexes) work if a template has many <<optional>> elements? From memory there are quite a few license templates in this category.

BTW I'm also stumped as to why .{0,36000} would not match when (.|\n){0,36000} does, provided DOTALL mode is enabled. In fact when I test this using inline directives it does work, which leads me to suspect that DOTALL mode is not being enabled programmatically in some cases (another argument in favour of inline directives).

Here's how I tested (note: Clojure code, but it boils down to JVM byte codes / Java regexes under the covers):

user=> (def re #"(?i)(?s)\QANY\E\s*EXPRESS(ED)?.{0,36000}\QPARTICULAR\E\s*")
#'user/re
user=> (def s "ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF merchantability AND FITNESS FOR A PARTICULAR")
#'user/s
user=> (re-matches re s)
["ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF merchantability AND FITNESS FOR A PARTICULAR" nil]
user=> (re-matches re "This does not match, just to show what that would look like.")
nil

from spdx-java-library.

goneall avatar goneall commented on July 22, 2024

will this approach (having two separate "before" and "after" regexes) work if a template has many <> elements?

The optional text is replaced by a match all regex .{0,xxx} where xxx is a limit. There is a separate issue where xxx is being set too small causing some licenses not to match. I'm working on this.

If you have a license that may not work, I can add it to the unit tests just to be sure.

BTW I'm too stumped as to why .{0,36000} would not match when (.|\n){0,36000} does, provided DOTALL mode is enabled.

Me too. here's the code that sets the DOTALL:
https://github.com/spdx/Spdx-Java-Library/blob/df468223821d6ee879288094a986f6df7bf49be9/src/main/java/org/spdx/utility/compare/LicenseCompareHelper.java#L796C3-L796C3

It doesn't look like there is any other code path that would not set DOTALL.

Let me know if you see anything wrong.

What is also strange is that the DOTALL works for other patterns that span lines.

from spdx-java-library.

pmonks avatar pmonks commented on July 22, 2024

Me too. here's the code that sets the DOTALL:
df46822/src/main/java/org/spdx/utility/compare/LicenseCompareHelper.java#L796C3-L796C3

Bizarre. I did a search for regex compilation throughout the code, just to see if there might be an alternative code path that's being used instead of that one, but I don't see anything obvious from that (regexes are compiled elsewhere, but they don't seem to be used in template matching).

from spdx-java-library.

goneall avatar goneall commented on July 22, 2024

This is now fixed with PR #221

from spdx-java-library.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.