google / licenseclassifier Goto Github PK

View Code? Open in Web Editor NEW

305.0 14.0 76.0 87.43 MB

A License Classifier

License: Apache License 2.0

Go 100.00%

license-management classifier google

licenseclassifier's Introduction

License Classifier

Introduction

The license classifier is a library and set of tools that can analyze text to determine what type of license it contains. It searches for license texts in a file and compares them to an archive of known licenses. These files could be, e.g., LICENSE files with a single or multiple licenses in it, or source code files with the license text in a comment.

A "confidence level" is associated with each result indicating how close the match was. A confidence level of 1.0 indicates an exact match, while a confidence level of 0.0 indicates that no license was able to match the text.

Adding a new license

Adding a new license is straight-forward:

Create a file in licenses/.
- The filename should be the name of the license or its abbreviation. If the license is an Open Source license, use the appropriate identifier specified at https://spdx.org/licenses/.
- If the license is the "header" version of the license, append the suffix ".header" to it. See licenses/README.md for more details.
Add the license name to the list in license_type.go.
Regenerate the licenses.db file by running the license serializer:
```
$ license_serializer -output licenseclassifier/licenses
```
Create and run appropriate tests to verify that the license is indeed present.

Tools

Identify license

identify_license is a command line tool that can identify the license(s) within a file.

$ identify_license LICENSE
LICENSE: GPL-2.0 (confidence: 1, offset: 0, extent: 14794)
LICENSE: LGPL-2.1 (confidence: 1, offset: 18366, extent: 23829)
LICENSE: MIT (confidence: 1, offset: 17255, extent: 1059)

License serializer

The license_serializer tool regenerates the licenses.db archive. The archive contains preprocessed license texts for quicker comparisons against unknown texts.

$ license_serializer -output licenseclassifier/licenses

This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

licenseclassifier's People

Contributors

Stargazers

Watchers

licenseclassifier's Issues

match offset and extent out of range

Thank you for this amazing library! I'm amazed by its speed and accuracy, this is truly great work!

One problem I notice is that:

When I use MultipleMatch and parse the offset & extent, I found them to be inaccurate.
This is understandable if the algorithm is not perfect. However, I found cases when offset and extent is invalid for the input full text -- offset + extent is greater than full text length.

My guess is that offset and extent did not take the text normalization step into consideration:

licenseclassifier/classifier.go

Line 162 in bb04aff

norm := normalizeText(contents)

. So they are in fact offset and extent of the normalized text, which doesn't match the original full text.

Can you give more details of how to run the application?

Hi. Can you give more details on how to run the application? README.md seems too scarce on information.
I tried go run license_serializer.go. It works fine.
But identify_license can't find licenses.db.
I have an error 2023/03/29 20:33:49 cannot create license classifier: cannot register licenses from archive: open licenses.db: file does not exist.
Where does licenses.db need to be located ?

Another question - how does it scan the files?
I have all licenses files (*.txt) in one folder. Do I need to indicate the path on this?
Does it also work on a folder recursively?

Please add more details in README.md. Thank you.

ClassifyLicensesWithContext doesn't actually close when context does

ClassifyLicensesWithContext doesn't actually close when context does...

If you look here:

licenseclassifier/tools/identify_license/backend/backend.go

Line 107 in df6aa8a

 func (b *ClassifierBackend) ClassifyLicensesWithContext(ctx context.Context, filenames []string, headers bool) (errors []error) { 

You'll see:

	done := make(chan bool)
	go func() {
		errors = b.ClassifyLicenses(filenames, headers)
		done <- true
	}()
	select {
	case <-ctx.Done():
		err := ctx.Err()
		errors = append(errors, err)
		return errors
	case <-done:
		return errors
	}

The goal when you provide a context (ctx) is to have a mechanism to signal something that you want it to shutdown. This is usually a timeout, but it can also be a ^C signal for example. The problem with the above code is that it doesn't actually work-- it will cause the function to quit, but that doesn't stop the goroutine doing the actual work from continuing...

If you happen to shutdown the entire program by exiting main, then any goroutines running will close, so this might appear to do what you expect, but it's actually a context bug.

If you'd like help fixing this, then let me know and I'm happy to advise. If you're accepting patches, please let me know, and if I have time I can look into writing one in the future.

Thank you!

bug: Index out of range when splitting path while loading assets on windows

When the assets are loaded for v2 in Windows, the path of the embedded license files is split on os.PathSeparator but this is incorrect for Windows.

When trying to split using backslash, there is an index out of range panic

panic: runtime error: index out of range [1] with length 1 [recovered]
	panic: runtime error: index out of range [1] with length 1

As per the embed documentation (https://pkg.go.dev/embed) - Even on Windows, the separator for embedded filesystem is forward slash

The //go:embed directive accepts multiple space-separated patterns for brevity, but it can also be repeated, to avoid very long lines when there are many patterns. The patterns are interpreted relative to the package directory containing the source file. The path separator is a forward slash, even on Windows systems. Patterns may not contain ‘.’ or ‘..’ or empty path elements, nor may they begin or end with a slash. To match everything in the current directory, use ‘*’ instead of ‘.’. To allow for naming files with spaces in their names, patterns can be written as Go double-quoted or back-quoted string literals.

Lucent Public License Version 1.02 not picked up correctly

LICENSE here is LPL1.02 picked up here as UNKNOWN

https://pkg.go.dev/9fans.net/[email protected]/draw?tab=overview

Release another v2 version?

The latest release: https://github.com/google/licenseclassifier/releases/tag/v2.0.0-alpha.1 was a while back.
There were quite some good improvements after that.

Shall we tag another version for v2? Because when people use go get, they default to tagged versions first.

Bug in the computeQ function - v2 classifier

Describe the issue
In the computeQ when the threshold is set to 1.0 the granularity is being calculated as 10, but if we set the threshold to 0.95, 0.99, or 0.999 the granularity is being calculated as 19, 99, 999, respectively where there is exponential growth and also the granularity is greater than the granularity set at maxThresold(1.0) which is 10.

Is this intentional?

A problem occurring due to this issue is that when we set the threshold to 0.95 or greater a lot of licenses are not being detected which in the case we set to 0.9 are easily being detected.

I ran the program for around 17,300 license files out of which around 2950 BSD-3-Clause, 850 BSD-2-Clause and some other licenses were not at all detected which were otherwise detected at a granularity of 10 because at that threshold the granularity is greater than 20 and nearly reaches 100.

A possible solution would be to set the granularity to 10 for a threshold greater than 0.9 and it will also handle the divide by zero cases.

Add blessing license to unencumbered licenses?

I'm using sqllite3 with blessing license, I'd like to add it to unencumbered licenses in this library.
Can I do that?

https://spdx.org/licenses/blessing.html

Need Help using the tool

Please tell me how do I use this tool.
I am a go newbie and facing difficulty running this tool.
I ran go get github.com/google/licenseclassifier/v2
then
go install github.com/google/licenseclassifier/v2@latest
but I am getting this
go install: package github.com/google/licenseclassifier/v2 is not a main package
So I imported in main , I am not sure how to use it there

I am sorry if it looks silly.
Please help me

How to install this tool

Can you please add the instructions on how to install this package using go install?

Typo error inside header function documentation

I think I found a small spelling mistake while going through the code. The word end on this line has been misspelled as endd. If it's intentional please close the issue.

Publish binary releases for common platforms

We're considering to add the identify_license tool as what we call a "scanner" to ORT, see here for some context. As such it would be nice if we could easily bootstrap identify_license by simply downloading a binary for the respective platform.

So, would it be possible to cut a new release (also see #37) and attach binary assets to it?

Trying to vendor go-licenses, which uses this, fails because the license DB is not vendored.

I see how this is trying to be very clever and find the license DB in its own source code path.

This totally breaks down in the face of Go's vendoring, which elides directories that do not have linked Go code in them.

So....ideas? Wouldn't it be better to embed the DB into the binary and make it self contained?

Add main package file in v2

Perhaps it would be good to have a main package file in v2 of licenseclassifier as in v1. Alongside licensedetection capabilites, it can also support copyright detection, JSON results and scanning entire directory for faster and more efficient results.

Thoughts?

Add a way to specify the db file

In the case where I distribute the binary and the db file separately, I want to be able to specify where licenseclassifier should look for the db. This is related to go-licences (https://github.com/google/go-licenses/blob/master/licenses/classifier.go)

As of today, you need the source of the go-license binary in the right path (GOPATH, …) to be able to use it.

Another alternative would be to embedded the .db in the binary 👼

Switch to go:embed

These two .db files are basically looked up directly at runtime. Does anyone have any objection to switching this to go:embed?
And is there a maintainer around who would want to merge such a fix?
This would need golang 1.16 as a minimum, but that's pretty reasonable.

https://github.com/google/licenseclassifier/blob/main/licenses/forbidden_licenses.db

Cheers!

Where does list of licenses come from?

Hi,

I was playing with this code and to give one example, I found this license: https://github.com/google/licenseclassifier/blob/main/licenses/Facebook-2-Clause.txt
It says in your source code (

licenseclassifier/license_type.go

Line 26 in 148b633

// The names come from the https://spdx.org/licenses website, and are

) that:

The names come from the https://spdx.org/licenses website, and are also the filenames of the licenses in licenseclassifier/licenses.

I can't actually find this and certain other licenses on the SPDX website. Could you elaborate how you both maintain the list and how you found this on the SPDX site?

I think it's great that you have more licenses in here, but I just want to make sure there isn't a collision since the returned identifiers are meant to be unique. If these aren't actually in SPDX, are you willing to rename them or accept a patch to do so which adds the SPDX LicenseRef- prefixes? (These prefixes are there to specify non-SPDX licenses.)

Thanks!

Found a small typo.

licenseclassifier/v2/classifier_test.go

Line 127 in bb04aff

name: "overlap at end",

I think at line 127 it should be name: "overlap at start"

Compiled default classifier

DefaultClassifier is a bit expensive as it tokenizes and normalizes licenses. We don't want to perform it every run. Is there any way to pass a parsed docs that is a private field? What if we will add NewClassifierWithDocs or something like that?
https://github.com/aquasecurity/licenseclassifier/blob/c913e304a1534c4580fa70c2c3af5cd85d99fc9c/v2/classifier.go#L194