Comments (22)
@pmonks Thanks for tracking this down - agree it is an issue.
Just FYI - I'm going to be working on upgrading the library to the 3.0 spec over the next couple weeks, so it may be a while before I can work on this issue.
from spdx-java-library.
@pmonks - thanks for the analysis.
DOTALL
and CASE_INSENSITIVE
are added in the Java pattern. Since we are getting patterns passed in from the template and most (if not all) of the regexes for the alt
element for the license list XML's don't specify matching new lines, this is overridden in the Java code itself.
There is some logic to change from greedy to non-greedy regexes - I think that's where the error may be. I'll do a bit more research.
from spdx-java-library.
Update: after re-testing with v1.1.5 and v1.1.6 this appears to be caused by a change to the license list XML for BSD-2-Clause
in v3.21 of the license list, probably this commit. That said, it still appears to be a bug in Spdx-Java-library, as my understanding is that these methods should always behave consistently with one another when given the same (singleton) license text, regardless of how the license list XML changes.
[edit] this may be a furphy, given that the same inconsistency seems to be happening with other licenses (see next comment).
from spdx-java-library.
I'm also seeing the same inconsistencies between the outputs of these methods when testing with the official WTFPL
license text:
!isTextStandardLicense().isDifferenceFound()
returnstrue
✅isStandardLicenseWithinText()
returnsfalse
❌matchingStandardLicenseIdsWithinText()
returns an empty collection ❌matchingStandardLicenseIds()
returns["WTFPL"]
✅
from spdx-java-library.
Yep no worries - no hurry on my end. Just reporting things as I see them!
from spdx-java-library.
Here's an MIT
license text that's also triggering this exact same pattern of behaviour (some methods work as expected, others don't).
from spdx-java-library.
I've done some analysis and isolated the issue to the findTemplateWithinText(String text, String template)
method.
Below is the Regex generated from the license template:
.?{0,5000}\QRedistribution\E\s*\Qand\E\s*\Quse\E\s*\Qin\E\s*\Qsource\E\s*\Qand\E\s*\Qbinary\E\s*\Qforms,\E\s*\Qwith\E\s*\Qor\E\s*\Qwithout\E\s*\Qmodification,\E\s*\Qare\E\s*\Qpermitted\E\s*\Qprovided\E\s*\Qthat\E\s*\Qthe\E\s*\Qfollowing\E\s*\Qconditions\E\s*\Qare\E\s*\Qmet:\E\s*.{0,20}\QRedistributions\E\s*\Qof\E\s*\Qsource\E\s*\Qcode\E\s*\Qmust\E\s*\Qretain\E\s*\Qthe\E\s*\Qabove\E\s*\Q-c-\E\s*\Qnotice,\E\s*\Qthis\E\s*\Qlist\E\s*\Qof\E\s*\Qconditions\E\s*\Qand\E\s*\Qthe\E\s*\Qfollowing\E\s*\Qdisclaimer.\E\s*.{0,20}\QRedistributions\E\s*\Qin\E\s*\Qbinary\E\s*\Qform\E\s*\Qmust\E\s*\Qreproduce\E\s*\Qthe\E\s*\Qabove\E\s*\Q-c-\E\s*\Qnotice,\E\s*\Qthis\E\s*\Qlist\E\s*\Qof\E\s*\Qconditions\E\s*\Qand\E\s*\Qthe\E\s*\Qfollowing\E\s*\Qdisclaimer\E\s*\Qin\E\s*\Qthe\E\s*\Qdocumentation\E\s*\Qand/or\E\s*\Qother\E\s*\Qmaterials\E\s*\Qprovided\E\s*\Qwith\E\s*\Qthe\E\s*\Qdistribution.\E\s*\QTHIS\E\s*.{0,5}\QIS\E\s*\QPROVIDED\E\s*\QBY\E\s*.{1,800}\Q"AS\E\s*\QIS"\E\s*\QAND\E\s*\QANY\E\s*EXPRESS(ED)?.{0,36000}\QPARTICULAR\E\s*\QPURPOSE\E\s*\QARE\E\s*\QDISCLAIMED.\E\s*\QIN\E\s*\QNO\E\s*\QEVENT\E\s*\QSHALL\E\s*.+\QBE\E\s*\QLIABLE\E\s*\QFOR\E\s*\QANY\E\s*\QDIRECT,\E\s*\QINDIRECT,\E\s*\QINCIDENTAL,\E\s*\QSPECIAL,\E\s*\QEXEMPLARY,\E\s*\QOR\E\s*\QCONSEQUENTIAL\E\s*\QDAMAGES\E\s*\Q(INCLUDING,\E\s*\QBUT\E\s*\QNOT\E\s*\QLIMITED\E\s*\QTO,\E\s*\QPROCUREMENT\E\s*\QOF\E\s*\QSUBSTITUTE\E\s*\QGOODS\E\s*\QOR\E\s*\QSERVICES;\E\s*\QLOSS\E\s*\QOF\E\s*\QUSE,\E\s*\QDATA,\E\s*\QOR\E\s*\QPROFITS;\E\s*\QOR\E\s*\QBUSINESS\E\s*\QINTERRUPTION)\E\s*\QHOWEVER\E\s*\QCAUSED\E\s*\QAND\E\s*\QON\E\s*\QANY\E\s*\QTHEORY\E\s*\QOF\E\s*\QLIABILITY,\E\s*\QWHETHER\E\s*\QIN\E\s*\QCONTRACT,\E\s*\QSTRICT\E\s*\QLIABILITY,\E\s*\QOR\E\s*\QTORT\E\s*\Q(INCLUDING\E\s*\QNEGLIGENCE\E\s*\QOR\E\s*\QOTHERWISE)\E\s*\QARISING\E\s*\QIN\E\s*\QANY\E\s*\QWAY\E\s*\QOUT\E\s*\QOF\E\s*\QTHE\E\s*\QUSE\E\s*\QOF\E\s*\QTHIS\E\s*\Q,\E\s*\QEVEN\E\s*\QIF\E\s*\QADVISED\E\s*\QOF\E\s*\QTHE\E\s*\QPOSSIBILITY\E\s*\QOF\E\s*\QSUCH\E\s*\QDAMAGE.\E\s*
And below is the normalized text that should match the above regex:
-c- -c- 2015, Atlassian Pty Ltd
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above -c- notice, this
list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above -c- notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE -c- HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF merchantability AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE -c- HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Next step would be to analyze why the regex doesn't match.
from spdx-java-library.
@goneall thanks for pulling those out! Couple of things I found:
- The
.?{0,5000}
regex fragment at the start doesn't look correct - it should only have either?
(match zero or one character) or{0,5000}
(match zero to 5000 characters), but not both. I believe.{0,5000}
is what we need here. .
in a Java regex won't match newlines by default, so.{0,5000}
still won't match the multi-line Atlassian copyright header unlessDOTALL
mode is enabled. IME doing this using inline directives is better than doing it programmatically (it makes regexes easier to troubleshoot in isolation), so the fix here would be to start the regex with(?s).{0,5000}
<optional>
tags in the template are handled incorrectly / inconsistently:- the first one in the template is turned into the regex fragment
.{0,5}
(around index 909 of the regex posted above), which isn't long enough for the word that appears there by default in the license text (SOFTWARE
, length of 8). - the second one in the template doesn't appear in the regex at all - the default text is removed but there's no regex fragment in its place (around index 1891 of the regex)
- the first one in the template is turned into the regex fragment
I can't see any other obvious problems, though manually fixing the regex for each of these issues still didn't match the Atlassian BSD-2-Clause text, so I'm not certain that I've found everything. My suggestion would be to fix these identified issues, then start troubleshooting again with the improved regex if it's still not matching.
from spdx-java-library.
I did a bit more research - found where we're adding the .?
when it should be just a ?
. Unfortunately, it is still hitting the same issue.
This is probably an issue whenever we have optional or variable text at the beginning of the license template.
from spdx-java-library.
Narrowed down the issue.
The template THIS<<beginOptional>> SOFTWARE<<endOptional>>
is being encoded in the regex as \s*\QTHIS\E\s*.{0,5}\QIS\E\s*
. Note that there is no regex to accept the SOFTWARE
token even though it is optional.
It looks like the code only allowing optional text to be 5 characters long.
What we should do is capture the optional text, tokenize it and make it a character group as optional.
An easier solution would be to use the actual length of the optional text rather than the hard coded length of 5.
Both of these are a bit of a design change.
A short term fix would be to just increase the length of the optional text to something more reasonable.
from spdx-java-library.
After changing the character limit to 50, I ran into a second part of the match which failed:
\QANY\E\s*EXPRESS(ED)?.{0,36000}\QPARTICULAR\E\s*
isn't matching:
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF merchantability AND FITNESS FOR A PARTICULAR
which has me a bit baffled.
from spdx-java-library.
Figured it out - needed to capture the newlines - changing .{0,36000}
to (.|\n){0,36000}
works.
from spdx-java-library.
The above change caused a stack overflow in the regex engine when comparing the entire text - so it isn't a solution. Since we're using DOTALL
, I'm not sure why we need the (.|\n)
pattern.
Not quite sure where to go from here - other than replacing the regex with a hand build parser.
from spdx-java-library.
What if you only used the regex to find only the beginning of the non-optional license text and ran LicenseCompareHelper.isTextStandardLicense on the rest of the text starting from there? If there is no difference or DifferenceDescription.getDifferenceMessage() starts with "Additional text found after the end of the expected license text", then there is a match for the license.
from spdx-java-library.
What if you only used the regex to find only the beginning of the non-optional license text and ran LicenseCompareHelper.isTextStandardLicense on the rest of the text starting from there? If there is no difference or DifferenceDescription.getDifferenceMessage() starts with "Additional text found after the end of the expected license text", then there is a match for the license.
Thanks @sdheh - Sounds like a good idea. I'll try it and see if it works.
from spdx-java-library.
The checking for the difference message didn't work - I was getting inconsistent message.
However, I figured out a different approach that did work.
I created 2 different patterns - one for the start and one for the end and just ran the 2 different regexes to get the beginning and end of the license text.
I'll work on a pull request after testing out other scenarios.
from spdx-java-library.
The checking for the difference message didn't work - I was getting inconsistent message.
Would you please tell me of some cases where it didn't work? I tried doing this myself and didn't run into any problems.
from spdx-java-library.
Would you please tell me of some cases where it didn't work? I tried doing this myself and didn't run into any problems.
I didn't keep track of all the details, but it was one of the unit tests I wrote for the recent matching issues. The error message that came back wasn't not the expected "Additional text found ..." but a different message. Possibly related to a variable or optional.
Since the second approach worked, I didn't investigate further.
from spdx-java-library.
@goneall will this approach (having two separate "before" and "after" regexes) work if a template has many <<optional>>
elements? From memory there are quite a few license templates in this category.
BTW I'm also stumped as to why .{0,36000}
would not match when (.|\n){0,36000}
does, provided DOTALL
mode is enabled. In fact when I test this using inline directives it does work, which leads me to suspect that DOTALL
mode is not being enabled programmatically in some cases (another argument in favour of inline directives).
Here's how I tested (note: Clojure code, but it boils down to JVM byte codes / Java regexes under the covers):
user=> (def re #"(?i)(?s)\QANY\E\s*EXPRESS(ED)?.{0,36000}\QPARTICULAR\E\s*")
#'user/re
user=> (def s "ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF merchantability AND FITNESS FOR A PARTICULAR")
#'user/s
user=> (re-matches re s)
["ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF merchantability AND FITNESS FOR A PARTICULAR" nil]
user=> (re-matches re "This does not match, just to show what that would look like.")
nil
from spdx-java-library.
will this approach (having two separate "before" and "after" regexes) work if a template has many <> elements?
The optional text is replaced by a match all regex .{0,xxx}
where xxx is a limit. There is a separate issue where xxx is being set too small causing some licenses not to match. I'm working on this.
If you have a license that may not work, I can add it to the unit tests just to be sure.
BTW I'm too stumped as to why .{0,36000} would not match when (.|\n){0,36000} does, provided DOTALL mode is enabled.
Me too. here's the code that sets the DOTALL
:
https://github.com/spdx/Spdx-Java-Library/blob/df468223821d6ee879288094a986f6df7bf49be9/src/main/java/org/spdx/utility/compare/LicenseCompareHelper.java#L796C3-L796C3
It doesn't look like there is any other code path that would not set DOTALL
.
Let me know if you see anything wrong.
What is also strange is that the DOTALL
works for other patterns that span lines.
from spdx-java-library.
Me too. here's the code that sets the
DOTALL
:
df46822
/src/main/java/org/spdx/utility/compare/LicenseCompareHelper.java#L796C3-L796C3
Bizarre. I did a search for regex compilation throughout the code, just to see if there might be an alternative code path that's being used instead of that one, but I don't see anything obvious from that (regexes are compiled elsewhere, but they don't seem to be used in template matching).
from spdx-java-library.
This is now fixed with PR #221
from spdx-java-library.
Related Issues (20)
- Update CI to flag any new quality issues
- LicenseCompareHelper: EPL-2.0 license text published by Eclipse Foundation not recognised HOT 3
- LicenseCompareHelper: BSD-2-Clause license text not recognised HOT 2
- In need of a canonical way to add license texts for licenseRefs / Maybe bugged? HOT 4
- LicenseCompareHelper: GPL-2.0 license text not recognised HOT 2
- Question HOT 2
- Refactor nonOptionalTextToPatterns HOT 2
- Verify error when license exceptions are used in an expression in LicenseConcluded but not in LicenseInfoInFile HOT 2
- Official Apache-1.1 license text is not being matched correctly by LicenseCompareHelper.matchingStandardLicenseIdsWithinText() HOT 1
- Official GPL-1.0 license text is not being matched correctly by LicenseCompareHelper.isTextStandardLicense() HOT 1
- Potential enhancement: LicenseCompareHelper.isTextStandardLicense().getDifferenceMessage() should sort by line then column
- Official CC-BY-4.0 license text is not being matched correctly by LicenseCompareHelper.isTextStandardLicense() HOT 3
- Inconsistent behaviour between LicenseCompareHelper methods when called with official MIT license text HOT 1
- How to normalize the text for regex matching in src/main/java/org/spdx/utility/compare/TemplateRegexMatcher.java
- Add API to return licenses as a map in addition to list HOT 1
- Method `LicenseInfoFactory::getListedLicenseById` never returns `null` HOT 1
- Official MIT license text is not recognized as MIT by LicenseCompareHelper.isStandardLicenseWithinText HOT 4
- StringIndexOutOfBoundsException exception from org.spdx.utility.compare.TemplateRegexMatcher.findTemplateWithinText() HOT 1
- Column in DifferenceDescription returned by LicenseCompareHelper.isTextMatchingTemplate is not correct HOT 1
- TemplateRegexMatcher.getStartRegex sometimes returns a regex that matches with an index before the license start HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spdx-java-library.