Git Product home page Git Product logo

Comments (27)

goneall avatar goneall commented on September 27, 2024 1

Am I right in thinking that this enhancement might be possible solely via internal implementation changes within the org.spdx.storage.listedlicense.SpdxListedLicenseWebStore.getUrlInputStream() method?

I think that should work. I do want to make this optional with the default being uncached - there are some scenerios where having a cache could introduce security vulnerabilities.

from spdx-java-library.

goneall avatar goneall commented on September 27, 2024 1

@pmonks - I think your implementation approach should work, so feel free to create a PR.

We should also add documentation to the README on the caching and how to enable / disable it.

from spdx-java-library.

goneall avatar goneall commented on September 27, 2024 1

@pmonks - sorry for the late reply - really busy day yesterday.

I'll take a look at the code later today - but I do want to point you to some existing code to handle properties you can leverage:

listedLicenseModificationLock.writeLock().lock();

It solves a similar problem where we have a property to force using a local copy of the licenses - it was requested by someone who wanted to use the library totally offline (air gap security).

I forgot about that implementation until you raised the questions above.

from spdx-java-library.

goneall avatar goneall commented on September 27, 2024 1

BTW - I looked into the previously mentioned properties file implementation for the offline license model store. It turns out that it does accept system property SPDXParser.OnlyUseLocalLicenses which takes precedence over the property file. Here's the code that implements the system property: https://github.com/spdx/Spdx-Java-Library/blob/ae14e246866472e7a356ba934f28efcda61b5226/src/main/java/org/spdx/library/model/license/ListedLicenses.java#L65C3-L65C3

It's been a while, but I think the reason we used both approach is to allow someone to make a change in the local properties file and not have to always specify a system property when running.

I should probably document this in the README similar to how you documented the caching properties.

from spdx-java-library.

goneall avatar goneall commented on September 27, 2024 1

Completely agree with refactoring out the config into a separate class. Your proposed naming sounds good.

from spdx-java-library.

goneall avatar goneall commented on September 27, 2024 1

All good points - and I agree. If you can create a PR, I'll take one more review before merging.

from spdx-java-library.

goneall avatar goneall commented on September 27, 2024

Thanks @pmonks for the suggested solution. I've also run into this performance problem and would welcome a solution.

If we could make the opt-in version a superclass of the existing webstore and just add caching it would be a pretty straight-forward implementation - we would need to investigate a bit more to see if this approach would work.

Once implemented, we could add an option or environment variable for tools-java to use the cache - it would speed up the command line utility quite a bit.

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

we would need to investigate a bit more to see if this approach would work.

FWIW I implemented exactly this kind of caching before I discovered Spdx-Java-Library, and it did result in a pretty significant speed up. And that was despite employing some optimised download tricks (like using svn to get all of the license template files from GitHub - that's faster than using one-at-a-time HTTP requests). That said, I was actually caching downloaded-and-parsed JSON files, not just the raw downloaded JSON files themselves (so saving both download cost and parse cost for each cache hit) - in my testing parse cost was low compared to download cost, but then I was using completely different libraries to Spdx-Java-Library for parsing so I'm not sure if there's much to be learned from that.

Am I right in thinking that this enhancement might be possible solely via internal implementation changes within the org.spdx.storage.listedlicense.SpdxListedLicenseWebStore.getUrlInputStream() method? It seems like all of the various file downloads ultimately get funneled through that method, so that might be the right place to add this kind of functionality.

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

Oh yeah I agree with making it optional - was more just trying to understand the effort involved in implementing this. If it's just internal implementation of a single method I can probably give it a go, despite my very rusty Java.

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

So before I launch into this, I wanted to discuss some micro-design questions with you @goneall:

  1. Are you ok if a Java property is used to enable/disable this cache? That's easier for downstream code to control than, say, an OS-level environment variable, while still being easy to adjust by a human operator at JVM startup time (via the -D command line argument provided by the java executable).
  2. Any preferences on a name for that Java property? Perhaps something like org.spdx.storage.listedlicense.SpdxListedLicenseWebStore.enableCache that's a boolean value (and defaults to false if unset)?
  3. In order to preserve the signature of org.spdx.storage.listedlicense.SpdxListedLicenseWebStore.getUrlInputStream(), cached objects will need to be stored with a "key" that is a 1:1 mapping of the URL parameter that's passed in. It can't be the URL itself however, as that may contain characters that are not allowed in filenames. String hashcodes are also insufficient for this (Java's implementation of String hashing is quite "collisiony"), so my thought would be a BASE64 representation of the URL. For example, a request to https://spdx.org/licenses/licenses.json would create a cached file on disk called aHR0cHM6Ly9zcGR4Lm9yZy9saWNlbnNlcy9saWNlbnNlcy5qc29u. Thoughts?
  4. To support ETag requests, there'll also need to be a separate "metadata" file alongside the content file itself. Are you ok if this file is in JSON format, and has the same name as the cached content file, but with the additional suffix .metadata.json? So for example when https://spdx.org/licenses/licenses.json is cached, there would also be a aHR0cHM6Ly9zcGR4Lm9yZy9saWNlbnNlcy9saWNlbnNlcy5qc29u.metadata.json file in the same directory, containing the ETag value, the date/time of the download, and the source URI (these latter two data elements are optional, but in my experience are very useful in troubleshooting cache bugs).
  5. Any thoughts on caching the results of parsing these downloaded files too, not just the raw downloaded content? I have no idea if the object graph created out of the downloaded JSON files is Serializable or not, but if it is, that might be an additional performance win. We could also punt this to a later PR and see how much improvement we get just from caching downloads alone (I expect it will be quite substantial). I suspect this is a more intrusive change than just caching downloads, from my rudimentary knowledge of the code paths, so I would probably lean towards punting.

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

@goneall I went ahead with a first cut at this, based on my thoughts on each of the above 5 points - you can find it here. Would love any/all comments you might like to share!

You can test this branch out with the command:

$ mvn -Dorg.spdx.storage.listedlicense.SpdxListedLicenseWebStore.enableCache=true test -Ptest

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

sorry for the late reply - really busy day yesterday.

@goneall absolutely no problem at all! None of this work is urgent on my end, so please don't feel compelled to even reply until you have an appropriate time to look at this!

It solves a similar problem where we have a property to force using a local copy of the licenses - it was requested by someone who wanted to use the library totally offline (air gap security).

I might be misunderstanding, but that sounds like a different use case to mine. I do want to use the online versions of the various files (rather than the ones baked into the library's JAR file), but I want them cached locally on disk so that the download cost is only incurred once per SPDX license list version (rather than once per JVM invocation, as is currently the case).

Separately (and unrelated to this branch), I've noticed that some URLs are downloaded repeatedly during the unit tests. Does that suggest there's a bug in the in-memory storage of this downloaded content (i.e. something doesn't realise that that content is already downloaded, and therefore downloads it again)? It seems to mostly (only?) be the JSON files for individual licenses and exception (e.g. https://spdx.org/licenses/Apache-2.0.json is downloaded several times, redundantly). While this new caching logic mitigates that bug, I'm pretty certain there would be a benefit to fixing it in the non-caching code paths too. Perhaps I should raise this in a separate issue?

from spdx-java-library.

goneall avatar goneall commented on September 27, 2024

I might be misunderstanding, but that sounds like a different use case to mine.

Agree - different use case, but I think you can use the same infrastructure for properties since they both involve how we store and access the license information - e.g., add an additional property to the property file.

Separately (and unrelated to this branch), I've noticed that some URLs are downloaded repeatedly during the unit tests.

I don't think the individual JSON files are cached, so it may re-fetch every time. I'm not sure if I could call it a bug - but it certainly could be optimized.

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

Ah ok - I'll check that code out. That said, property files are difficult to manipulate from downstream code that wants to programmatically set these behaviours (rather than relying on a human operator to do so), so that rings little alarm bells for me.

A more typical way to do this is to use Java properties, or (better yet) provide "init" type methods (or even constructor args) that accept these options - those are also a lot more amenable for use in DI frameworks (such as Spring).

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

Latest update adds a "Configuration options" section to the readme, and also enables control over how often each cache entry gets rechecked for staleness (via an ETag request).

from spdx-java-library.

goneall avatar goneall commented on September 27, 2024

@pmonks Thanks for pulling together the caching implementation.

I took a look at the proposed changes and the overall approach looks good.

I have a couple of suggestions:

  • If the XDG_CACHE property isn't there, have the default directory be more specific to SPDX to avoid possible conflicts with other files (e.g. change System.getProperty("user.home") + "/.cache" to System.getProperty("user.home") + "/.spdx-license-cache"
  • If there is any failure in setting up the caches, log a warning and revert to the non-cached - e.g. in the exception handler for creating the directories, set cacheEnabled to false and log the warning)
  • We can shorten the property name from org.spdx.storage.listedlicense.SpdxListedLicenseWebStore.enableCache to org.spdx.storage.listedlicense.enableCache

I'm not sure if we need to add any locks around the caching - the class should support multi-threading and we are accessing some share metadata files, but I can't think of any scenarios where we would have a damaging race condition.

If we want to be safe and add locking, we can use the readLock and writeLock from the superclass SpdxListedLicenseModelStore.

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

If the XDG_CACHE property isn't there, have the default directory be more specific to SPDX to avoid possible conflicts with other files (e.g. change System.getProperty("user.home") + "/.cache" to System.getProperty("user.home") + "/.spdx-license-cache"

The behaviour I've implemented is faithful to the XDG Base Directory Specification, which says that if the XDG_CACHE_HOME environment variable is not set, applications should default to ${HOME}/.cache/<appropriate subdirectory name specific to that app>. I personally would rather we adhered to that specification (all of it), rather than surprise users by doing something bespoke. Wdyt?

If there is any failure in setting up the caches, log a warning and revert to the non-cached - e.g. in the exception handler for creating the directories, set cacheEnabled to false and log the warning)

✅ - I'll make that change now.

We can shorten the property name from org.spdx.storage.listedlicense.SpdxListedLicenseWebStore.enableCache to org.spdx.storage.listedlicense.enableCache

✅ - I'll make that change now.

Also, interesting trivia from my testing this morning: enabling the cache avoids downloading 613 files totaling ~20MB on each use of the library, which (on my laptop) takes the initialisation time from ~4 minutes down to ~1 second.

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

@goneall the second and third changes are ready for review.

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

BTW - I looked into the previously mentioned properties file implementation for the offline license model store. It turns out that it does accept system property SPDXParser.OnlyUseLocalLicenses which takes precedence over the property file.

Ok cool - I'll go look into that now and try to replicate it. As long as downstream code can programmatically modify the behaviour, I'm good with whatever approach you decide we should go with.

It's been a while, but I think the reason we used both approach is to allow someone to make a change in the local properties file and not have to always specify a system property when running.

Yeah makes sense. Putting this in constructor args / setters and then punting the specifics of how those methods get called right out of the library is probably ideal going forward - that's the most DI-friendly approach, and lets DI users leverage all of the weird and wonderful mechanisms DI frameworks support for setting these kinds of configuration options (which go well beyond just properties files and Java system properties). That's probably a longer term / bigger refactor type of activity though, so I'd probably prefer not to tackle it as part of this cache work.

I should probably document this in the README similar to how you documented the caching properties.

I was wondering whether the GitHub project might benefit from a wiki for this kind of thing? While the README is great, it might become unwieldy as we start adding this kind of information, and a wiki lets us start to structure the information into separate categories / pages (e.g. "usage guide", "configuration guide", etc.).

from spdx-java-library.

goneall avatar goneall commented on September 27, 2024

The behaviour I've implemented is faithful to the XDG Base Directory Specification, which says that if the XDG_CACHE_HOME environment variable is not set, applications should default to ${HOME}/.cache/. I personally would rather we adhered to that specification (all of it), rather than surprise users by doing something bespoke. Wdyt?

Actually - the code there is fine. I missed the enclosing parenthesis around the conditional - I originally thought it was just defaulting to ~/.cache but it is actually defaulting to ~/.cache/Spdx-Java-Library.

I did notice one new issue when looking at this.

Your should change the hard coded file separators to system dependent separators - you can use Java Path or replace the / with File.separator.

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

Your should change the hard coded file separators to system dependent separators - you can use Java Path or replace the / with File.separator.

Good catch, and now fixed! It's clearly been too long since I had to use Windows. ;-)

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

@goneall I've taken a look at how the existing properties configuration logic functions, and think it will become pretty messy to extend it in place to support the WebStore. Instead, I think we should create a separate, dedicated configuration management class that any other classes (org.spdx.library.model.license.ListedLicenses, org.spdx.storage.listedlicense.SpdxListedLicenseWebStore, and anything else that might need configuration in the future) can interrogate to get their configuration. I'd propose org.spdx.Configuration.java for this - thoughts?

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

Ok I think that change is now done too. Let me know if you have any feedback on it! I also merged in the latest changes to master, to make this easier to merge once I create a PR for it.

from spdx-java-library.

goneall avatar goneall commented on September 27, 2024

I'm thinking there may be a cleaner way to handle the confusing different prefixes on the system properties.

In the properties file, we could just use a simple unqualified string (e.g. OnlyUseLocalLicenses) which would be used in the properties file and add a parameter prefix which would be prepended just for the system property only - that way the properties file would be more concise and we would just use different prefixes for the two different calls.

Let me know what you think.

BTW - the old prefix SPDXParser was the name of the java package years ago - we just kept it in the property name for compatibility. It would be nice to change it to match the current code, but I'm afraid it may break some implementations.

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

My sense is that having "smart" logic like that may actually be confusing for an implementer who's looking at someone's configuration i.e. it wouldn't be obvious that OnlyUseLocalLicenses in the properties file is the same as SPDXParser.OnlyUseLocalLicenses (or ideally eventually org.spdx.storage.listedlicense.OnlyUseLocalLicenses) specified on the command line.

Then there's also the possibility of possible conflicts in the file itself. If some new class elsewhere in the library also has a configuration property called OnlyUseLocalLicenses, what would we do then? At least requiring them to be fully qualified in the properties file provides a way to namespace them.

And then there's the bigger question of potentially ditching all of this anyway in favour of a more DI-compatible approach (which wouldn't use system properties or property files at all - that gets externalised out of the library and becomes the sole responsibility of the DI framework). IOW how much do we want to invest in code that might eventually go away anyway?

tl;dr - while I'm ok with keeping that one property special cased (to preserve backwards compatibility), I think going forward we should discourage that kind of thing

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

@goneall ok PR created, after some more refactoring and a lot of polishing of JavaDocs. 😉

Probably the biggest change since you last looked was that I took the liberty of refactoring out the entire DownloadCache into its own class (a bit like the Configuration class) - this makes both the cache logic and the WebStore logic a lot more self-explanatory, and also opens up the possibility of using the DownloadCache for any other classes elsewhere in the library that might retrieve content from URLs (though there aren't any other examples of that at this point, as best I can tell).

from spdx-java-library.

pmonks avatar pmonks commented on September 27, 2024

This was fixed in PRs #205 and #206.

from spdx-java-library.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.