wheelodex / wheelodex Goto Github PK

View Code? Open in Web Editor NEW

42.0 4.0 1.0 1.75 MB

An index of wheels

Home Page: https://www.wheelodex.org

License: MIT License

Python 82.18% CSS 1.29% HTML 15.33% Mako 0.33% Shell 0.40% Jinja 0.47%

python pypi wheel packages pep427 website

wheelodex's Introduction

Site | GitHub | Issues | Changelog

Packaged projects for the Python programming language are distributed in two main formats: sdists (archives of code and other files that require processing before they can be installed) and wheels (zipfiles of code ready for immediate installation). A project's wheel contains the complete information about what modules, files, & commands the project installs, along with information about what other projects the project depends on, but the Python Package Index (PyPI) (where wheels are distributed) doesn't expose any of this information! This is the problem that Wheelodex is here to solve.

Wheelodex scans PyPI for wheel files, analyzes them, and stores & displays the results. The site allows users to view the complete metadata inside wheels, search for wheels containing a given Python module or file, browse or search for wheels that define a given command or other entry point, and even find out projects' reverse dependencies.

Note that, in order to save disk space, Wheelodex only records data on wheels from the latest version of each PyPI project; wheels from older versions are periodically purged from the database. Projects' long descriptions aren't even recorded at all.

Suggestions and pull requests are welcome.

wheelodex's People

Contributors

Stargazers

Watchers

Forkers

pombredanne

wheelodex's Issues

Try to speed up queries with Yield Per

SQLAlchemy documentation: https://docs.sqlalchemy.org/en/20/orm/queryguide/api.html#orm-queryguide-yield-per

Possible places where this could be used:

Fetching wheels for the dump command (currently paginated via
Flask-SQLAlchemy)
Assuming it's safe to use Yield Per while modifying the results:
- Project query in purge_old_versions()
- Wheel.to_process()
- process_orphan_wheels()

Ability to sort reverse dependencies by a proxy for usage

Thanks for wheelodex! I find the "reverse dependencies" option really useful, but it would be nice to able to sort the list by some proxy for usage (e.g. download count, or possibly the number of their reverse dependencies).

Add a command for dumping information about wheel errors from the database

Organize the output by processing errors vs. wheel-inspect validation errors and their respective types.

Base the command on the following scripts I've been using for ad-hoc error dumping:

errors2dir.py

from pathlib import Path
import sqlalchemy as sa
from sqlalchemy.orm import Session
from wheelodex.models import Wheel

suberrors = {
    "no unique *.dist-info/ directory in wheel": "no-dist-info",
    "headerparser.errors.": "header-error",
    "Invalid name or filename": "invalid-name",
    "Invalid wheel filename": "invalid-name",
}

DIR = Path(__file__).with_name("errors")
DIR.mkdir(exist_ok=True)
for k, v in suberrors.items():
    d = DIR / v
    d.mkdir(exist_ok=True)
    suberrors[k] = d

engine = sa.create_engine("---URL REDACTED---")
session = Session(engine)

for whl in session.scalars(sa.select(Wheel).filter(Wheel.errors.any())):
    for e in whl.errors:
        for k, d in suberrors.items():
            if k in e.errmsg:
                edir = d
                break
        else:
            edir = DIR
        with (edir / f"whl{whl.id}-{e.id}.txt").open("w") as fp:
            print("Filename:", whl.filename, file=fp)
            print("URL:", whl.url, file=fp)
            print("Uploaded:", whl.uploaded, file=fp)
            print("Wheelodex-Version:", e.wheelodex_version, file=fp)
            print(file=fp)
            print(e.errmsg, file=fp)

invalids.py

from pathlib import Path
import sqlalchemy as sa
from sqlalchemy.orm import sessionmaker
from wheelodex.models import WheelData

DIR = Path(__file__).with_name("invalid")
DIR.mkdir(exist_ok=True)

engine = sa.create_engine("---URL REDACTED---")
session = Session(engine)

for data in session.scalars(sa.select(WheelData).filter(WheelData.valid == False)):
    whl = data.wheel
    vdir = DIR / data.raw_data["validation_error"]["type"]
    vdir.mkdir(exist_ok=True)
    with open(str(vdir / f"whl{whl.id}.txt"), "w") as fp:
        print("Filename:", whl.filename, file=fp)
        print("URL:", whl.url, file=fp)
        print("Uploaded:", whl.uploaded, file=fp)
        print("Processed:", data.processed, file=fp)
        print("Size:", whl.size, file=fp)
        print("Wheel-Inspect-Version:", data.wheel_inspect_version, file=fp)
        print("Error-Type:", data.raw_data["validation_error"]["type"], file=fp)
        print("Message:", data.raw_data["validation_error"]["str"], file=fp)

Support sorting entry point lists by entry point name

Add a command for analyzing specific projects, versions, or wheels

Include an option for forcing reanalysis of already-analyzed wheels.
Include an option for fetching details about the given resources from PyPI if they're not already in the database.
- Include an option for refetching PyPI data?

Set up automatic deployment via GitHub Actions

See https://docs.github.com/en/actions/deployment/about-deployments/deploying-with-github-actions.

Deployment should be triggered whenever a GitHub release is created (which means I'll have to start creating releases for tags).

Store the Ansible Vault password and a private SSH key for [email protected] in GitHub secrets.

Support searching/filtering entry points for a group by entry point name

Add ansible-lint as a pre-commit task and/or CI job

See https://ansible.readthedocs.io/projects/lint/configuring/#pre-commit-setup on using ansible-lint with pre-commit.

Blocked by: ansible/ansible-lint#3846

Add "wheel" icon/logo

I'm thinking a blue tire-like wheel viewed from the side and tilted up slightly.

The icon should also be used as the picture for the wheelodex GitHub organization.

Add more tests

Tests for each command (process-orphan-wheels in particular)
More thorough tests for views

Restart nginx and/or uwsgi whenever PostgreSQL is updated

Currently, the deployment pins the PostgreSQL package to prevent unattended security updates, as updating PostgreSQL causes it to restart, breaking Wheelodex's database connection. This is obviously sub-optimal.

Possible resolution: Configure systemd to restart nginx and/or uwsgi whenever PostgreSQL is restarted. In addition, configure unattended upgrades to not run while Wheelodex jobs are running ~~(Use systemd's Conflicts field?)~~.

It seems the only way to prevent two systemd timer services from running at the same time without causing one of them to fail is to use the flock command.
- Problem: unattended-upgrades is run as root, and the wheelodex jobs are run as the wheelodex user, so there will likely be permission errors if they both have the same lockfile.

Look into other possible resolutions, as well.

Give `load` an option for overwriting WheelData already in the database

This would require first fixing Wheel.set_data(); see the comment in its source.

Add a "View Random Project" link to the main page

Don't select projects that don't have wheels.

Fill in descriptions for entry point groups

Wheelodex has a list of every entry point group defined by a wheel, each one linking to a list of the entry points defined for it. In order to make things less bland and to give people more information on what they're looking at, individual groups can have summaries displayed next to them in the groups list and descriptions displayed at the tops of their entry points list (example). However, this requires someone to write out summaries & descriptions for the entry point groups first, and that's where I could use some help.

If there's an entry point group you're familiar with that's lacking a description, you can add a description to it by creating a pull request modifying the wheelodex/data/entry_points.ini file. Add a section with the same name as the group (keeping the sections in lexicographic order), and give it summary and description fields whose values are CommonMark Markdown describing the group. The description should include what projects consume the entry point group, a brief idea of what defining an entry point in the group accomplishes, and (if it exists) a link to the consuming project's documentation on using the entry point group. See entry_points.ini for examples.

Is the full dependency dataset available?

I am looking to analyse the dependency graph between Python packages. Does Wheelodex make this data available as a single download?

Missing Documentation on how to build & execute

Really like what you built. Amazing idea.
Though it's really hard to be able to help if you don't provide a simple way to start running the service on a local machine / server.

If I may, a docker compose recipe would probably be a good start :)

Supporting dumping & loading error information

Give dump an option for including wheel processing error details in the dumped structures, and make load support loading this information.

Add an admin interface for viewing & deleting errors

Honor packaging yanking

Naïvely, if a release or asset is yanked, it should be deleted from the database — but what if the latest release (or all its wheels) is yanked, the previous release has already been purged, and the project doesn't make another release for some time? (This is similar to #32.)

Log/notify when web server encounters an error

Possible setup: See "Sending errors via email" at https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-vii-unit-testing-legacy.

Update `wheel_sort_key()` for more modern wheel tags

The wheels provided by pydantic-core should be a good source of modern tags.

Incomplete list of new tag elements to sort:

Platforms:
- macosx_11_0_{arch}
- manylinux_2_5_{arch}
- musllinux*
Architectures:
- aarch64
- arm64
- ppc64le
- s390x

Give the Ansible playbook an option for skipping the database backup

For use when repeatedly running the playbook again & again due to some stupid mistake that ansible-lint couldn't find.

`load-entry-points`: Support deleting entries from the database

Give load-entry-points an option for deleting the summaries & descriptions of entry point groups not listed in the input file

Distinguish extra-only project dependencies

Project dependencies that are only required for extras should be distinguished from non-extra dependencies somehow.

Ideas:

Make "Most Depended-On Projects" display two leaderboards, one taking extras into account and one not.
In addition to projects' "Reverse Dependencies" counts & lists, add a "Reverse Dependencies (No Extras)" (Working title) count and list

What data do people want to see?

Wheels have a bunch of data in them, but Wheelodex isn't quite taking advantage of all of it — yet. What information do people want to be able to search, browse, or click on?

Browsing by keywords and/or licenses, similar to PyDigger?
Browsing by classifiers? (or does PyPI already cover that well enough?)
Browsing Platform fields?
Project-URL labels?
Statistics on wheel Generator fields?
Searching by namespace packages?
Listing which projects have the most reverse dependencies?
Other metadata?

I'll probably get around to implementing most of these eventually, but I'd like some feedback on what people want first.

Make Wheel's `md5` and `sha256` fields no longer nullable

It seems that PyPI's JSON API now always fills in the relevant digests, and there are no wheels in the database with missing digests. The nullability thus no longer serves a purpose.

Be sure to change all md5 and sha256 fields & arguments in the code to non-None.

Also remove the display of null digests as "[Unknown]" in the wheel_data template.

Better handling for deletion of a project's latest version

The deletion of a version from PyPI may leave Wheelodex with no wheels registered for the project even though there may be lower-versioned versions on PyPI with wheels. Try to keep this from happening.

Update project display names more often

Should a project's display name be updated whenever it gets a new release or wheel?

Increase number of trailing log lines sent in service error e-mails

100 lines should do it.

Try to speed up file searches with an index

Try to speed up file search queries with:

CREATE EXTENSION pg_trgm;
    -- ^^ Must be run inside the database by a superuser
CREATE INDEX files_path_idx ON files USING GIN (path gin_trgm_ops);

Do likewise for other columns queried with LIKE/ILIKE?

Note: The SQLAlchemy equivalent of the CREATE INDEX statement appears to be:

sa.Index(
    "files_path_idx",
    File.path,
    postgresql_using="gin",
    postgresql_opts={"path": "gin_trgm_ops"},
)

Prevent the `register-wheels` and `process-wheels` services from running at the same time

Use [Unit]Conflicts=?

Set up scheduled database backups

And store them somewhere other than the wheelodex.org server.

Monitor slow PostgreSQL queries

PostgreSQL is currently configured to log queries that take more than one second, but I'm not paying attention to the logs. The logs should be shipped to an ELK stack, Papertrail.com, Data Dog, or whatever hip sysadmins are using these days.

Support browsing classifiers

In the METADATA displays, make each classifier into a hyperlink to a page listing all projects with that classifier.
Add a page listing all classifiers (linking to pages listing matching projects), alongside matching project counts.

Make scan-pypi smarter about projects whose latest versions don't have wheels

If the latest version of a project doesn't have any wheels, should scan_pypi() instead register the latest version that does?

Add footer to pages

Show the current Wheelodex version (and a link to Wheelodex's GitHub repo?) at the bottom of every page.

Support searching/filtering entry point groups

The "Browse Entry Point Groups" page should gain a search box for filtering down to just groups whose names match a given glob pattern. If only one group matches the given pattern, redirect to that group's page.

Use a systemd slice to limit the resource usage of Wheelodex jobs

See https://medium.com/horrible-hacks/using-systemd-as-a-better-cron-a4023eea996d.

To look into: What would happen if a job reached its resource usage? Would this ever cause it to fail or be killed?

Make things look good

Wheelodex is, at time of launch, severely lacking in the aesthetics department. Some sort of styling by someone with a vague grasp of web design and UX could make quite the difference.

Some specific areas in need of beautification include:

The list of wheels for a project can fill almost the entire page (example); some sort of collapsible display would be nice.
Header field names in METADATA and WHEEL files often get wrapped at hyphens; might want to keep that from happening
File paths in RECORD files often get wrapped at slashes; might want to keep that from happening
The search boxes on the main page should really be aligned. Not sure how to do that ...
Putting all the page content in a 500pt-wide box is probably a stupid thing to do
The "recently analyzed wheels" table gets stretched out too much by wheels with long names (e.g., just about anything for Mac OS X)

Show (more) processing error details in the web and/or JSON APIs

Make it easier to view reverse dependencies of projects without wheels

Because projects without wheels are excluded from search and the "Browse Projects" list, it can be difficult to reach their pages in order to view their reverse dependencies (which is the only thing their pages offer other than a link to PyPI). At the moment, the only way to get to such a page is via either URL manipulation or by clicking on a link from a project that depends on them or in the reverse dependencies list of a project that they depend on.

Idea: Add a checkbox next to the project search input for also searching projects without wheels.

In the METADATA displays, make each keyword into a hyperlink to a page listing all projects with that keyword.
Add a "Browse Keywords" page listing keywords and their counts, with a search box for filtering by a given glob pattern.
See https://stackoverflow.com/q/18228994/744178 for how to keep a table of keywords and their counts up to date.
Should the database normalize all keywords to lower case?

Set up monitoring of website health & performance

Monitor whether the site is up and not returning errors.
Monitor page load times.
Possible resources:
- Sentry.io
- Statuscake
- https://github.com/lebinh/ngxtop
- https://github.com/muatik/flask-profiler
- https://news.ycombinator.com/item?id=18293434