Git Product home page Git Product logo

cavil's Introduction

Cavil Coverage Status

Cavil is a legal review and Software Bill of Materials (SBOM) system for the Open Build Service. It is used in the development of openSUSE Tumbleweed, openSUSE Leap, as well as SUSE Linux Enterprise.

Features

  • Source code legal review system for RPMs, Tarballs, Kiwi images, Docker images, and Helm charts
  • High performance source code scanner with support for recursively decompressing almost any archive format
  • 25.000 curated patterns for 1000 licenses with 500 distinct SPDX expressions
  • Software Bill of Materials (SBOM) support with SPDX 2.2 reports
  • Legal risk assessments by lawyers for every pattern match
  • Human reviews with approval/rejection workflow, and optional automatic approvals based on risk
  • Optional support for machine learning models to classify pattern matches
  • REST API for integration into existing source code management systems
  • Open Build Service integration via bots
  • OpenID Connect (OAuth 2.0) authentication

Important: Note that most of the data used by Cavil has been curated by lawyers, but the generated reports do not count as legal advice and no guarantees are made for their correctness!

Screenshot

Components

This distribution contains the two main components of the system. A Mojolicious web application that lawyers can use to efficiently review package contents, and Minion background jobs to process and index packages, to create easy to digest license reports.

Additionally there is large curated set of license patterns the SUSE lawyers have created included in this distribution. Currently this set consists of over 20000 patterns for all known Open Source licenses.

The easiest way to connect OBS to Cavil is the legal-auto.py bot from the openSUSE Release Tools repository. But you can also upload tarballs directly for analysis.

Getting Started

The easiest way to get started with Cavil is the included staging scripts for setting up a quick development environment. All you need is an empty PostgreSQL database (with the pgcrypto extension activated) and the following dependencies:

$ sudo zypper in -C postgresql-server postgresql-contrib 'rubygem(sass)'
$ sudo zypper in -C perl-Mojolicious perl-Mojolicious-Plugin-Webpack \
  perl-Mojo-Pg perl-Minion perl-File-Unpack perl-Cpanel-JSON-XS \
  perl-Spooky-Patterns-XS perl-Mojolicious-Plugin-OAuth2 perl-Mojo-JWT \
  perl-BSD-Resource perl-Term-ProgressBar perl-Text-Glob
$ npm i
$ npm run build

Then use these commands to set up and tear down a development environment:

$ perl staging/start.pl postgresql://tester:testing@/test
...
$ CAVIL_CONF=staging/do_not_commit/cavil.conf morbo script/cavil
...
$ CAVIL_CONF=staging/do_not_commit/cavil.conf script/cavil minion worker
...
$ perl staging/stop.pl
...

The morbo development web server will make the web application available under http://127.0.0.1:3000. And script/cavil minion worker will start the job queue for processing background jobs.

Ongoing Maintenance

To keep your reports and checksums fresh even after new license patterns have been added or updated, we recommend running script/cavil rindex in regular intervals (we do it every weekend). And to free up space you can run script/cavil cleanup in regular intervals as well. It helps to organize reports into products to exclude them from cleanup.

cavil's People

Contributors

coolo avatar dependabot[bot] avatar kraih avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cavil's Issues

Optimize daily cleanup

Our daily cleanup background jobs take much longer than they should, wasting a lot of resources. There's probably many ways to improve that significantly.

tree navigation of parsed licenses

The cavil view should have a cascaded view of dependencies instead of just a flat-view. So instead of just having a flat list of files, I should be able to navigate the tree and have a subview of the licenses in that tree only.

Position dropdown menu for managing patterns dynamically

Currently the dropdown menu for managing patterns in the review ui always appears right below the button. This can be problematic if the pattern is at the end of the report. So we should probably position the menu dynamically above or below the button depending on where we have the most space.

menu

Zstandard compression support

Cavil can't currently unpack .zst files because File::Unpack lacks support for the format. We have started to see this format being used in OBS though, for packages like trivy.

trademark logo scan

is cavil able to scan for trademarked logos and could flag them for manual review? trademark usage is difficult as fair-use could apply, however I think currently we're not highlighting those at all?

Inconsistent risk assessments

Some named licenses have multiple conflicting risk assessments for various patterns:

Apache-1.1: 3, 4
Apache-2.0: 2, 1, 3
Apache-2.0 AND CC-BY-SA-4.0: 3, 2
Apache-2.0 OR Artistic-2.0: 3, 2
Apache-2.0 OR BSD-3-Clause: 2, 1
Apache-2.0 OR GPL-2.0: 3, 2
Apache-2.0 OR GPL-2.0+: 3, 2
Apache-2.0 OR MIT: 3, 1
Apache-2.0 WITH LLVM-exception: 3, 2
...

This needs to be cleaned up once we have gotten a normalised list back from the lawyers. And perhaps it would be a good idea to dedicate a new cli command to license pattern maintenance.

Be aware: Cases like Any Proprietary: 5, 3, 1, 4 need to have patterns with different risk assessments, since they don't represent one specific named license.

Add UI for removing globs again

Currently we only have a UI for adding globs, but not one to remove them again.
glob
There should probably be a simple table view with delete button for admin users.

Map licenses to SPDX identifiers

Much of our license pattern data predates the existence of SPDX, so we rely on mostly arbitrarily chosen identifiers. Recently there has been growing interest in reports that also include SPDX identifiers. This has many advantages, such as the ability to exchange reports in standard formats with tools like Fossology. Which in turn would also allow us to cooperate more with open source projects like OSSelot (see #64).

Review correction ui

Currently it is very hard to audit and possibly correct any already finished reviews. We could probably expand the file viewer to show all pattern matches for the whole file, including ignores and use a pull down menu with correction options, such as the removal of an ignore pattern.

UI for reviewing ML classification

Currently we have to look at PostgreSQL directly to review ML classification results. This has become rather tedious with an increasing number of classification failures. We are going to need a proper UI for reviewing results that can also be used to create new training data to reduce the number of failures again.

What i'd like to see is a long list of license snippets with ML assessment results and two buttons for a human to press, green and red.

LegalDB report should use license definitions acceptable by obs-service-format_spec_file

Hello

this is an example copy-paste, where rather than GPL-2.0+ we should use GPL-2.0-or-later and similar.

It has happened to me a few times, that we've accepted changes to devel project, however, they failed to build in Factory where we have strict rpmlint checking. I can't recall what the license was, but the mistake was that I did copy paste the license text from Cavil and didn't cross-check against https://github.com/openSUSE/obs-service-format_spec_file/blob/master/licenses_changes.txt which I newly do since this issue occurred..
So, could we only use licenses and exceptions that are acceptable/listed by obs-service-format_spec_file?
GPL-2.0+ OR MIT: [1 files] ...
GPL-2.0+ WITH Autoconf-exception-3.0: ...
GPL-2.0+ WITH Libtool-exception: ...
GPL-3.0+ WITH Autoconf-Exception-3.0 ...

I understand that that might be challenging as I've seen a report which was referencing an older version of license than we had in the obs-service-format_spec_file. Perhaps such exceptions could be colorized or so, to warn the reviewer.

Show the original string if SPDX parsing fails

For spec files not coming from openSUSE, cavil is making it hard to review the rpm license as it enforces SPDX. If SPDX parsing fails, it should show the original rpm license with the information that spdx mapping failed included.

Ignore snippet everywhere does not work

bad-request

It appears the "Ignore snippet everywhere" feature does not currently work, and results in a 400 response. Ignore snippet for package seems fine though, which is probably why it has not been noticed earlier.

Support license incompatibilities

Not all Open Source licenses are compatible with each other. It would be nice if Cavil could highlight known incompatibilities in the report. Perhaps with a UI for incompatibility management.

Inconsistent patterns without license

We have 461 patterns without a license, and only 48 of them are keywords with risk 9. 248 have a risk assessment of 0, suggesting they have been used as a hack-ish version of ignore patterns, before the real feature existed. An unknown but not insignificant number also has actual license text, which seem to have been accidentally not assigned a license name.

We should find out which of these patterns are not in use for current Factory packages and remove all that have become obsolete.

Support LicenseRef- prefix in specfiles

The nmap package in Factory has started using the license LicenseRef-NPSL-0.93 and Cavil currently thinks that is not valid SPDX. But it is spec compliant, while our SUSE-* prefix is not. So we should at least support LicenseRef-* in addition to SUSE-*.

Don't scan the code itself

Things like "throw IllegalArgumentException();" in a code trigger Risk 9 because of the string legal being part of the exception name. Also, comments that state what is "legal" or "illegal" as an input for a function to the same. Would there be a way to make the scanning more smart and avoid a load of false positives?

OBS import race condition

The recent Factory change to accept every request after 2 hours (and to obsolete the legal review in progress) means that we now have many repeated imports of the same sources from OBS (first factory review, then product sync). That seems to result in source checkouts getting lost completely sometimes, causing an empty legal report.

The problem probably existed for a long time, but the recent change made it a much more frequent occurrence.

Port the UI to Vue.js

Performance issues are becoming more common with our current UI. Especially the AJAX driven tables are a big problem once the data sets reach a certain size. We've learned a lot about how to make better performing UIs with Vue.js for the QEM Dashboard. Those lessons should be applied to Cavil as well.

One click UI for creating new patterns

From the report it should be easier to create new license patterns. For many keyword matches Cavil already has a good estimation for what the license pattern metadata will look like. Here it should be much easier to create the pattern without leaving the report UI. Perhaps we could even do something like GitHub reviews, where multiple patterns can be created from the report UI and submitted together as a batch.

Race condition around unpack job locks

This seems to be pretty rare, but recently we've seen a job history like this where a lock was active that should not have been, preventing the package from being unpacked.

    id    |    task     |  state   |                   result                    |            created            
----------+-------------+----------+---------------------------------------------+-------------------------------
 91476425 | obs_import  | finished | null                                        | 2021-12-09 00:09:02.835544+01
 91476428 | unpack      | finished | "Package 281044 is already being processed" | 2021-12-09 00:09:21.143415+01
 91487879 | index_later | finished | null                                        | 2021-12-10 20:00:02.032591+01

Error-0:Yv6G - pocl empty on checkout works on reimport

Context: Leap has backlog of 60~+ requests not reviewed for over 8 days.

This particular issue was identified by Sebastian Riedl
That there were two requests for pocl in the backlog, and one has an Error-0:Yv6G, which means it was empty when checked out from OBS.

legal report should not be empty judging by

$ curl https://api.opensuse.org/public/source/science/pocl?rev=05f1e68e1f6817c7e6c5391f8eac871e
<directory name="pocl" rev="05f1e68e1f6817c7e6c5391f8eac871e" srcmd5="05f1e68e1f6817c7e6c5391f8eac871e">
  <linkinfo project="openSUSE:Factory" package="pocl" srcmd5="c566cacae98e515d0e0c93647593f951" baserev="c566cacae98e515d0e0c93647593f951" lsrcmd5="c492a9be0daa12e8a8e737d7959356f9"/>
  <entry name="link_against_libclang-cpp_so.patch" md5="fb3145931e75c3a11f764f22e68425cf" size="553" mtime="1608907150"/>
  <entry name="pocl-3.0.tar.gz" md5="bd79db59fa31e38759296849291210a3" size="1722809" mtime="1662482194"/>
  <entry name="pocl-rpmlintrc" md5="a8031c13cb3a4cb232bed0fd7f42dd4e" size="45" mtime="1662487601"/>
  <entry name="pocl.changes" md5="0c2660588be38db939d7f04cf1c5ec7b" size="16917" mtime="1667384010"/>
  <entry name="pocl.spec" md5="67906134b152d59a355044292367b6ce" size="4377" mtime="1667388077"/>
</directory>

Works fine on manual reimport

Make priorities more visible for open reviews

Currently we only list priorities as part of the link, which is easily overlooked. It should probably be a separate table column with some kind of colour highlighting for high priority reviews.

Problems with File::Unpack

ldig@legaldb:~/cavil> ./script/cavil minion job -f 28673440
T: 445 files ...
Deep recursion on subroutine "File::Unpack::unpack" at /usr/lib/perl5/vendor_perl/5.18.2/File/Unpack.pm line 1170.
unpack('/data/auto-co/legal-bot/gcc46/5c4638b8b35ffd2d07223f6844e9f64e/.unpacked/gcc-4.6.2-20111212/libgo/go/archive/zip/testdata/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r','/data/auto-co/legal-bot/gcc46/5c4638b8b35ffd2d07223f6844e9f64e/.unpacked/gcc-4.6.2-20111212/libgo/go/archive/zip/testdata/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r/r'): recursion limit 200 at /home/ldig/cavil/script/../lib/Cavil/Checkout.pm line 154.
[2019-07-04 06:52:47.58874] [8318] [info] [27411
] Unpacked /data/auto-co/legal-bot/gcc46/5c4638b8b35ffd2d07223f6844e9f64e

Allow filtering open reviews by minimum priority

Our production review backlog tends to be rather large because of many automatically imported priority 1 (low) openSUSE:Factory packages. It would help to be able to filter those out with a minimum priority setting in the ui, probably defaulting to priority 2 and above.

RFE: Speeding up license correction

Hello team!

from a position of person fast-tracking Leap legal reviews in my spare time.
As part of my reviews if I see a package where list of licenses doesn't match the spec file, which is every second review of community packages basically.

I typically want to submit the correct license right away. You typically want to send SR to the development project, where the package is developed. Then get a change to Factory and finally Leap.

I typically start with osc bco $some:Devel:project $package ... followed by commit && sr once the license in spec is tweaked.

For that, I typically click on the SR (in my case Leap SR), which takes me to the OBS Leap submission. Then I click from there on the source (typically openSUSE:Factory) from where I can see developed in particular devel project.

It would help me to speed up such corrections if I'd see "developed in (link)$project(/link)" right in the WebUI of cavil.
The ideal would be "To checkout package osc bco $some:Devel:project $package". But perhaps that's too much detail for applications outside of OBS (not our case), but that would be quite nice.

It might sound like not a big deal, but we have a queue of 100 plus packages, and a good 50 will have incorrect/partial license tags in the spec file. And that is a lot of clicking.

Bring back ordering for ui tables

With the switch to server-side pagination we've lost support for ordering in all ui tables. It's a bit tricky to reimplement, but we should bring it back at least in places where users have requested it.

  • Order packages by state for /products/*

Flagging changes authored by AI

Hello team,

SUSE is currently running a pilot of Github Copilot https://mysuse.sharepoint.com/sites/github-copilot-pilot/SitePages/Introduction.aspx

So far it is a pilot aware of "AI Pair programmer" https://opensource.suse.com/legal/policy and none of the code will make it to the SUSE product.

However since SUSE Legal doesn't review any of openSUSE legal reviews, I'd like to make sure that we do not automatically fast-forward requests containing such changes. Without somebody actually looking into it.

Keeping this open as a high level tracker.

handle OCI container license labelling

OCI containers can have a license declaration as part of the container metadata:

https://github.com/opencontainers/image-spec/blob/master/annotations.md#pre-defined-annotation-keys

Kiwi and podman / docker support setting those labels during build time. example for kiwi:

     <containerconfig
        name="my-container"
        tag="latest"
        additionaltags="1.0.0.%RELEASE%">
         <labels>
          <!-- See https://en.opensuse.org/Building_derived_containers#Labels -->
          <suse_label_helper:add_prefix prefix="org.example.container">
  ...
            <!-- Select a correct license from https://github.com/openSUSE/spec-cleaner#spdx-licenses -->
            <label name="org.opencontainers.image.licenses" value="SUSE-Permissive"/>
          </suse_label_helper:add_prefix>
        </labels>
        <history author="Fabian Vogt &lt;[email protected]&gt;">Derive the image</history>
       </containerconfig>

it would be good if the legal auto bot would check for this to be set and accurate (that is harder)

RFE: Sharing and Re-using OSS Compliance infromation

Hello

this is just a quick thought from Today's Open Chain webinar by Caren Kresse about OSSelot: The Open Source Curation Database

Project site: See https://osselot.org/
Videos: https://www.osselot.org/index.php?s=videos

Could we extend or reuse existing analysed data as part of our legal review process?
https://github.com/Open-Source-Compliance

Seems like the process utilizes Fossology for the scan.

Data:
https://github.com/Open-Source-Compliance/package-analysis/tree/main/analysed-packages

The DB grows with every day and it seem to be a way how to get an extra curator (Oliver reviews PRs).
package_growth

Encoding error when generating SPDX reports

Dec 04 13:13:02 legaldb cavil[1636]: [1636] [e] Non-existing path in SPDX report 329096: /data/auto-co/legal-bot/java-11-openjdk/2a9e351679e9f9d5f078110b24744813/.unpacked/openjdk/test/jdk/sun/misc/URLClassPath/testclasses/+ª-ë-ï+Ñ-å-î.class
Dec 04 13:13:05 legaldb cavil[1636]: [1636] [e] Non-existing path in SPDX report 329096: /data/auto-co/legal-bot/java-11-openjdk/2a9e351679e9f9d5f078110b24744813/.unpacked/openjdk/test/jdk/sun/security/tools/jarsigner/JarSigning_RU/New/ðñð©ÐêðÁÐÇ/English
Dec 04 13:13:05 legaldb cavil[1636]: [1636] [e] Non-existing path in SPDX report 329096: /data/auto-co/legal-bot/java-11-openjdk/2a9e351679e9f9d5f078110b24744813/.unpacked/openjdk/test/jdk/sun/security/tools/jarsigner/JarSigning_RU/New/ðñð©ÐêðÁÐÇ/ðáÐâÐüÐüð¦ð©ð¦
Dec 04 13:13:05 legaldb cavil[1636]: [1636] [e] Non-existing path in SPDX report 329096: /data/auto-co/legal-bot/java-11-openjdk/2a9e351679e9f9d5f078110b24744813/.unpacked/openjdk/test/jdk/tools/launcher/UnicodeTest/ClassAϺ+äϦϦϿ+èÏ®õ©¡µûçõ©¡µûçÓñ¦Óñ+ÓñéÓñªÓÑÇÎóÎæοÎÖάµùѵ£¼Þ¬×Ýò£ÛÁ¡ýû¦espa+¦olÓ¦äÓ©ùÓ©ó.class

It seems we have a file system path encoding problem somewhere between File::Unpack2 and Cavil. For now i've added a workaround that makes such files not prevent SPDX report generation anymore. But of course this will need to be fixed.

Handle arbitrary tarballs

There's currently a hack that pretends uploaded tarballs are actually RPM packages. It kinda works for testing stuff, but we should at some point make that a real feature and fully support arbitrary tarballs.

handle dockerfile and helm Chart.yml package containers

When a package only consists of a Dockerfile or a Chart.yaml (which is referring to a helm chart), then cavil fails the review wtih an error message.

it would be good if we would continue to do source tarball evaluation (e.g. scan all the tar files in that package) instead, so that we can get something that is a valid report for legal.

Inconsistent license capitalisation

We have some duplicate license names with different capitalisation. Like Any permissive and Any Permissive, which are considered different licenses. That should probably not be the case.

Diffiicult to navigate to sources from report with nested archives

A given report with .obscpio archive which contains other archives would have a report like,

MPL-Unspecified: 3 files

node_modules.obscpio._/package._1281/index.js
node_modules.obscpio._/package._1282/index.js
node_modules.obscpio._/package._943/node_modules/spdx-correct/index.js

It would be a lot more helpful to have the output include the name of the inner archives in the filenames. Even if you filter it to only include limited characters set [0-9a-zA-Z_+\-\.] (think XSS) it would be a lot more helpful than the current format.

MPL-Unspecified: 3 files

node_modules.obscpio._/package._1281.some_program_5.4.tgz/index.js
node_modules.obscpio._/package._1282.another_program_1.4.tgz/index.js
node_modules.obscpio._/package._943.magics_23.tgz/node_modules/spdx-correct/index.js

Prevent obs_import race condition

Reported by @coolo. If the same package is requested multiple times in quick succession it might be possible to create the same obs_import job multiple times, resulting in a race condition.

Extend the import mechanism with git support

Currently we are very focused on importing packages from OBS. To support new ALP workflows and to make Cavil easier to use for the community, we should implement native git (GitHub) support.

Full test coverage for the main review process

#76 has shown that our test coverage is still not great. We should make sure that at least everything needed for the normal review process is covered. A good first step would probably the addition of coverage metrics.

Carwos project "API"

As discussed over an email.

Background: Our CI legal pipeline that produces license report for
our customer depends on retrieving JSON from legaldb.suse.de and it has
been changing recently.

We rely on following interfaces (accessed with header containing "carwos" token in place and "Accept: application/json"):

url = f"http://legaldb.suse.de/package/{report_id}"
url = f"http://legaldb.suse.de/reviews/calc_report/{report_id}"
url = f"http://legaldb.suse.de/reviews/fetch_source/{file_id}"
url = "http://legaldb.suse.de/packages?" + urlencode(info)
(the report_id e.g. 232522, file_id is e.g. 7115234143 etc.)

We can fix the query at our end (e.g. to url = f"http://legaldb.suse.de/reviews/calc_report/{report_id}.json) but we would like to ensure that such defined interface is not changing too often or disappears completely.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.