Git Product home page Git Product logo

licenseclassifier's Introduction

License Classifier

Build status

Introduction

The license classifier is a library and set of tools that can analyze text to determine what type of license it contains. It searches for license texts in a file and compares them to an archive of known licenses. These files could be, e.g., LICENSE files with a single or multiple licenses in it, or source code files with the license text in a comment.

A "confidence level" is associated with each result indicating how close the match was. A confidence level of 1.0 indicates an exact match, while a confidence level of 0.0 indicates that no license was able to match the text.

Adding a new license

Adding a new license is straight-forward:

  1. Create a file in licenses/.

    • The filename should be the name of the license or its abbreviation. If the license is an Open Source license, use the appropriate identifier specified at https://spdx.org/licenses/.
    • If the license is the "header" version of the license, append the suffix ".header" to it. See licenses/README.md for more details.
  2. Add the license name to the list in license_type.go.

  3. Regenerate the licenses.db file by running the license serializer:

    $ license_serializer -output licenseclassifier/licenses
  4. Create and run appropriate tests to verify that the license is indeed present.

Tools

Identify license

identify_license is a command line tool that can identify the license(s) within a file.

$ identify_license LICENSE
LICENSE: GPL-2.0 (confidence: 1, offset: 0, extent: 14794)
LICENSE: LGPL-2.1 (confidence: 1, offset: 18366, extent: 23829)
LICENSE: MIT (confidence: 1, offset: 17255, extent: 1059)

License serializer

The license_serializer tool regenerates the licenses.db archive. The archive contains preprocessed license texts for quicker comparisons against unknown texts.

$ license_serializer -output licenseclassifier/licenses

This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

licenseclassifier's People

Contributors

alficles avatar becoded avatar bharat-biradar avatar bwendling avatar crschmidt avatar dndn10 avatar dprotaso avatar dsymonds avatar mithro avatar nick-jones avatar owenrumney avatar rspier avatar saschagrunert avatar teeler avatar thockin avatar waych avatar wcn3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

licenseclassifier's Issues

Publish binary releases for common platforms

We're considering to add the identify_license tool as what we call a "scanner" to ORT, see here for some context. As such it would be nice if we could easily bootstrap identify_license by simply downloading a binary for the respective platform.

So, would it be possible to cut a new release (also see #37) and attach binary assets to it?

bug: Index out of range when splitting path while loading assets on windows

When the assets are loaded for v2 in Windows, the path of the embedded license files is split on os.PathSeparator but this is incorrect for Windows.

When trying to split using backslash, there is an index out of range panic

panic: runtime error: index out of range [1] with length 1 [recovered]
	panic: runtime error: index out of range [1] with length 1

As per the embed documentation (https://pkg.go.dev/embed) - Even on Windows, the separator for embedded filesystem is forward slash

The //go:embed directive accepts multiple space-separated patterns for brevity, but it can also be repeated, to avoid very long lines when there are many patterns. The patterns are interpreted relative to the package directory containing the source file. The path separator is a forward slash, even on Windows systems. Patterns may not contain ‘.’ or ‘..’ or empty path elements, nor may they begin or end with a slash. To match everything in the current directory, use ‘*’ instead of ‘.’. To allow for naming files with spaces in their names, patterns can be written as Go double-quoted or back-quoted string literals.

Need Help using the tool

Please tell me how do I use this tool.
I am a go newbie and facing difficulty running this tool.
I ran go get github.com/google/licenseclassifier/v2
then
go install github.com/google/licenseclassifier/v2@latest
but I am getting this
go install: package github.com/google/licenseclassifier/v2 is not a main package
So I imported in main , I am not sure how to use it there
goe
I am sorry if it looks silly.
Please help me

Bug in the computeQ function - v2 classifier

Describe the issue
In the computeQ when the threshold is set to 1.0 the granularity is being calculated as 10, but if we set the threshold to 0.95, 0.99, or 0.999 the granularity is being calculated as 19, 99, 999, respectively where there is exponential growth and also the granularity is greater than the granularity set at maxThresold(1.0) which is 10.

Is this intentional?

A problem occurring due to this issue is that when we set the threshold to 0.95 or greater a lot of licenses are not being detected which in the case we set to 0.9 are easily being detected.

I ran the program for around 17,300 license files out of which around 2950 BSD-3-Clause, 850 BSD-2-Clause and some other licenses were not at all detected which were otherwise detected at a granularity of 10 because at that threshold the granularity is greater than 20 and nearly reaches 100.

A possible solution would be to set the granularity to 10 for a threshold greater than 0.9 and it will also handle the divide by zero cases.

Where does list of licenses come from?

Hi,

I was playing with this code and to give one example, I found this license: https://github.com/google/licenseclassifier/blob/main/licenses/Facebook-2-Clause.txt
It says in your source code (

// The names come from the https://spdx.org/licenses website, and are
) that:

The names come from the https://spdx.org/licenses website, and are also the filenames of the licenses in licenseclassifier/licenses.

I can't actually find this and certain other licenses on the SPDX website. Could you elaborate how you both maintain the list and how you found this on the SPDX site?

I think it's great that you have more licenses in here, but I just want to make sure there isn't a collision since the returned identifiers are meant to be unique. If these aren't actually in SPDX, are you willing to rename them or accept a patch to do so which adds the SPDX LicenseRef- prefixes? (These prefixes are there to specify non-SPDX licenses.)

Thanks!

Add main package file in v2

Perhaps it would be good to have a main package file in v2 of licenseclassifier as in v1. Alongside licensedetection capabilites, it can also support copyright detection, JSON results and scanning entire directory for faster and more efficient results.

Thoughts?

ClassifyLicensesWithContext doesn't actually close when context does

ClassifyLicensesWithContext doesn't actually close when context does...

If you look here:

func (b *ClassifierBackend) ClassifyLicensesWithContext(ctx context.Context, filenames []string, headers bool) (errors []error) {

You'll see:

	done := make(chan bool)
	go func() {
		errors = b.ClassifyLicenses(filenames, headers)
		done <- true
	}()
	select {
	case <-ctx.Done():
		err := ctx.Err()
		errors = append(errors, err)
		return errors
	case <-done:
		return errors
	}

The goal when you provide a context (ctx) is to have a mechanism to signal something that you want it to shutdown. This is usually a timeout, but it can also be a ^C signal for example. The problem with the above code is that it doesn't actually work-- it will cause the function to quit, but that doesn't stop the goroutine doing the actual work from continuing...

If you happen to shutdown the entire program by exiting main, then any goroutines running will close, so this might appear to do what you expect, but it's actually a context bug.

If you'd like help fixing this, then let me know and I'm happy to advise. If you're accepting patches, please let me know, and if I have time I can look into writing one in the future.

Thank you!

match offset and extent out of range

Thank you for this amazing library! I'm amazed by its speed and accuracy, this is truly great work!

One problem I notice is that:

When I use MultipleMatch and parse the offset & extent, I found them to be inaccurate.
This is understandable if the algorithm is not perfect. However, I found cases when offset and extent is invalid for the input full text -- offset + extent is greater than full text length.

My guess is that offset and extent did not take the text normalization step into consideration:

norm := normalizeText(contents)
. So they are in fact offset and extent of the normalized text, which doesn't match the original full text.

Can you give more details of how to run the application?

Hi. Can you give more details on how to run the application? README.md seems too scarce on information.
I tried go run license_serializer.go. It works fine.
But identify_license can't find licenses.db.
I have an error 2023/03/29 20:33:49 cannot create license classifier: cannot register licenses from archive: open licenses.db: file does not exist.
Where does licenses.db need to be located ?

Another question - how does it scan the files?
I have all licenses files (*.txt) in one folder. Do I need to indicate the path on this?
Does it also work on a folder recursively?

Please add more details in README.md. Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.