jelmer / upstream-ontologist Goto Github PK

View Code? Open in Web Editor NEW

12.0 4.0 2.0 1.57 MB

discover information about upstream projects

License: GNU General Public License v2.0

Python 5.46% Makefile 0.08% Rust 90.33% HTML 4.14%

upstream ontology

upstream-ontologist's Introduction

Upstream Ontologist

The upstream ontologist provides a common interface for finding metadata about upstream software projects.

It will gather information from any sources available, prioritize data that it has higher confidence in as well as report the confidence for each of the bits of metadata.

The ontologist originated in Debian and the currently reported metadata fields are loosely based on DEP-12, but it is meant to be distribution-agnostic.

Provided Fields

Standard fields:

Homepage: homepage URL
Name: human name of the upstream project
Contact: contact address of some sort of the upstream (e-mail, mailing list URL)
Repository: VCS URL
Repository-Browse: Web URL for viewing the VCS
Bug-Database: Bug database URL (for web viewing, generally)
Bug-Submit: URL to use to submit new bugs (either on the web or an e-mail address)
Screenshots: List of URLs with screenshots
Archive: Archive used - e.g. SourceForge
Security-Contact: e-mail or URL with instructions for reporting security issues
Documentation: Link to documentation on the web

Extensions for upstream-ontologist, not defined in DEP-12:

SourceForge-Project: sourceforge project name
Wiki: Wiki URL
Summary: one-line description of the project
Description: longer description of the project
License: Single line license (e.g. "GPL 2.0")
Copyright: List of copyright holders
Version: Current upstream version
Security-MD: URL to markdown file with security policy
Author: List of people who contributed to the project
Maintainer: The maintainer of the project
Funding: URL to more information about funding

Supported Data Sources

At the moment, the ontologist can read metadata from the following upstream data sources:

Python package metadata (PKG-INFO, setup.py, setup.cfg, pyproject.timl)
package.json
composer.json
package.xml
Perl package metadata (dist.ini, META.json, META.yml, Makefile.PL)
Perl POD files
GNU configure files
R DESCRIPTION files
Rust Cargo.toml
Maven pom.xml
metainfo.xml
.git/config
SECURITY.md
DOAP
Haskell cabal files
go.mod
ruby gemspec files
nuspec files
OPAM files
Debian packaging metadata (debian/watch, debian/control, debian/rules, debian/get-orig-source.sh, debian/copyright, debian/patches)
Dart's pubspec.yaml
meson.build

It will also scan README and INSTALL for possible upstream repository URLs (and will attempt to verify that those match the local repository).

In addition to local files, it can also consult external directories using their APIs:

Example Usage

The easiest way to use the upstream ontologist is by invoking the guess-upstream-metadata command in a software project:

$ guess-upstream-metadata ~/src/dulwich
Security-MD: https://github.com/dulwich/dulwich/tree/HEAD/SECURITY.md
Name: dulwich
Version: 0.20.15
Bug-Database: https://github.com/dulwich/dulwich/issues
Repository: https://www.dulwich.io/code/
Summary: Python Git Library
Bug-Submit: https://github.com/dulwich/dulwich/issues/new

Alternatively, there is a Python API. There are also autocodemeta and autodoap commands that can generate output in the codemeta and DOAP formats, respectively.

upstream-ontologist's People

Contributors

Stargazers

Watchers

Forkers

upstream-janitor pombredanne

upstream-ontologist's Issues

support using github credentials

The GitHub API has a low rate-limit per IP for unauthenticated users, which we regularly hit on the scruffy.

The upstream ontologist should support having credentials passed in, or reading them from the users home directory.

For simplicity we should support the GITHUB_TOKEN environment variable, in addition to potentially reading the users' home directory.

Use GITHUB_TOKEN for API operations
Inject into Breezy git operations

python3-upstream-ontologist: should depend on python3-ruamel.yaml and python3-breezy

Hi since there isn't a discussion tab I had to open an issue, I found this issue on the bug tracking system https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1029750 I would like to work on it but I could not setup my development environment as I'm using a Windows machine with WSL enabled could you guide me to setup the development environment and mentor me into resolving this bug as I'm fairly new to Debian. Here is the list of things that I've done till now

subscribed to the relevant mailing lists
understood the bug-tracking system
asked #debian-mentors on IRC how to set up the dev environment and they told me to use sbuild for building the package and docker for testing. but I don't know how to reproduce the error on WSL(Ubuntu).

so as of now, I need help setting up my dev environment and reproducing the error.
thanks and apologies if this isn't the right way to ask for help

restructure API

The current API in rust is basically a 1:1 copy of what existed in Python. Instead, we should probably have a more rustic API.

Should UpstreamDatum be an enum (like it is now), or a trait for example?

consult anitya (https://release-monitoring.org/)

See https://release-monitoring.org/

follow redirects in Repository URLs

See edenhill/kcat#372

upstream-ontologist currently verifies that URLs are valid, but it should update URLs when it receives 30X responses.

merge gemfile parsing code with gemfileparser

upstream-ontologist currently has its own gemfile parsing logic. It would be great to use a standard solution like https://github.com/gemfileparser/gemfileparser instead

(that currently doesn't support parsing non-dependency metadata, but that should be fixable)

suggests invalid bug URLs for GitLab instances

In https://salsa.debian.org/debian/decopy/-/merge_requests/3, lintian-brush suggests a Bug Database URL that is broken.

I can reproduce this with guess-upstream-metadata:

Name: decopy
Repository: https://salsa.debian.org/debian/decopy.git
Homepage: https://salsa.debian.org/debian/decopy
X-Version: 0.2.4.7
X-Summary: Automatic debian/copyright Generator
X-Description: |2-

      Decopy automates writing and updating the debian/copyright file.

      It reads all files in the source tree, analyzes the licenses and copyright
      messages included and generates the corresponding debian/copyright file.
      When the file already exists, decopy parses it to generate a more complete
      output.

X-License: ISC
X-Author:
- !Person
  name: Maximiliano Curia
  email: [email protected]
  url:
Repository-Browse: https://salsa.debian.org/debian/decopy
Bug-Database: https://salsa.debian.org/debian/decopy/issues
Bug-Submit: https://salsa.debian.org/debian/decopy/issues/new

However, while the project exists, it does not have issues enabled.

support multiple maintainers

Some projects have multiple maintainers, and e.g. DOAP files will list all of them.

It would be good to change the X-Maintainer field into List[Person] rather than just a single string.

analyze README/homepage with AI?

it would be good to fall back to AI models using the transformers library, perhaps with low certainty. We could use these on the README file and homepage contents.

See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1038326 for the Debian ITP

https://huggingface.co/tiiuae/falcon-40b for a model that could potentially be shipped somewhere in Debian?

parse README.md with markdown to extract long description

Some approximation of a long description can probably be done by parsing README.md and:

Skipping over the initial header for the project
taking the paragraphs until the next header
filtering out anything clearly irrelevant ("See INSTALL for ... ")

If there are too many paragraphs, perhaps just take the first paragraph. Otherwise, take all paragraphs.

@isomer