Git Product home page Git Product logo

admin-portal's Introduction

Green Web Foundation API

In this repo you can find the source code for the API and checking code that the Green Web Foundation servers use to check the power a domain uses.

Build Status

Overview

Following Simon Brown's C4 model this repo includes the API server code, along with the green check worker code in packages/greencheck.

API

Apps - API Server at api.thegreenwebfoundation.org

This repository contains the code served to you when you visit http://api.thegreenwebfoundation.org.

When requests come in, symfony accepts and validates the request, and creates a job for enqeueue to service with a worker.

API

The greenweb api application running on https://api.thegreenwebfoundation.org

This provides a backend for the browser extensions and the website on https://www.thegreenwebfoundation.org

This needs:

  • an enqueue adapter, like fs for development, amqp for production
  • php 7.3
  • nginx
  • redis for greencheck library
  • ansible and ssh access to server for deploys

Currently runs on symfony 5.x

To start development:

  • Clone the monorepo git clone [email protected]:thegreenwebfoundation/thegreenwebfoundation.git
  • Configure .env.local (copy from .env) for a local mysql database
  • composer install
  • bin/console server:run
  • check the fixtures in packages/greencheck/src/TGWF/Fixtures to setup a fixture database

To deploy:

  • bin/deploy

To test locally:

Packages - Greencheck

In packages/greencheck is the library used for carrying out checks against the Green Web Foundation Database. Workers take jobs in a RabbitMQ queue, and call the greencheck code to return the result quickly, before passing the result, RPC-style to the original calling code in symfony API server.

API

Packages - public suffix

In packages/publicsuffix is a library provides helpers for retrieving the public suffix of a domain name based on the Mozilla Public Suffix list. Used by the API Server.

admin-portal's People

Contributors

arendjantetteroo avatar br0ken- avatar denning avatar dependabot[bot] avatar eharris128 avatar fershad avatar hanopcan avatar jonathan-s avatar mrchrisadams avatar rgieseke avatar roald-teunissen avatar tortila avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

admin-portal's Issues

Switch from the homecooked Ipaddressfield to GenericIpAddressField

It'd be nice to switch from our homecooked field to the generic ip address field that django has built in. The difference between the two is that the data stored needs to be stored as a Decimal field in mysql right now to be compatible with the old code.

The generic ip address field stores it as a 39 long CharField.

Switch to new admin checklist

  • Double check that permissions are correct for each group. Hostingproviders should only see certain things.
  • Check that mail works correctly.
  • Once domain has been changed, send out email that says that they should reset password.
  • Anything else?

Make the import from CSV feature resolve full paths

When I used the import_from_csv management command to update a hosting providers set of IP ranges, I had to provife the full path, rather than the relative.

Here's an abridged error code:

    with open(path, "r+") as csvfile:
FileNotFoundError: [Errno 2] No such file or directory: '../shared/hosting-provider-ips.csv'

It would be better to have this resolve the full path, as I wrote the code myself, and expected it to be able to pass in a relative path.

Extend tooling to support correspondance with hosters

  • Extending the email form (on the same page - we'd likely have a separate screen)
  • allow user generated templates
  • show preview before sending
  • decide how we capture key info from correspondence (if we can keep this out of the app, then it means we don't need to spend time building tools for managing ppl's rights under the GDPR)

The current full list of urls for exports has many invalid hostnames

After working on updating the greenweb data export, I've realised that we have a lot of invalid hostnames.

This is bad for a few reasons, not least for our own releases of data, like richer versions of the the url2green dataset.

Ideally we'd have a decent white list of urls that are meaningful.

I think we could do this by doing something like:

  • make a list of ALL the urls
  • run this list against a checker to only return valid hostnames
  • maintain this list
  • add the most recent info for each url in this subset
  • add it to the export management task that Jonathan created before

Remove username_canonical and email_canonical

Once we've completely left the old admin these two fields can be removed. Django handles this by storing the email and username field in lowercase from the beginning so these are no longer needed.

Add support for cachebusting option for greencheck API

We currently have four layers of caching:

  • we have multi tier CDN caching with Cloudflare
  • then caching with nginx
  • then caching with Redis
  • and then MySQL’s internal caching mechanism

This has helped us handle the uptick in traffic we've seen, but it also means that we can end up with stale values being served, which is confusing for users, as well as being straight up innacurate.

We have a greencheck API inside admin-portal now, so this adds a place we can update the caching behaviour.

If we add a param like nocache=true, (because it's already used with nginx to skip the cache there), we could trigger a purge for a specific cache key all through the layers we have direct control over.

This would mean that all subsequent lookups would serve the fresh key, over any API we use to serve a lookup.

Richer info that just Green/Grey - sitespeed coaching info

note: I'm not sure where this should fit yet. It might be better in separate tool.

But if you have a HAR file, then you can run coach against it, without needing to run a full fat browser against it.

This is good for examining a page, and getting some useful ratings and metrics from it.

If we're providing better tooling anyway for checking a page, this would be tres cool.

coach-explained

More in the coach docs:
https://www.sitespeed.io/documentation/coach/how-to/

How it might work

If we have a url given, we can also check against the HTTP archive data and download their helpful set of HAR files, instead of browsers.
https://discuss.httparchive.org/t/how-to-download-the-http-archive-data/679

We can then fetch them from the buckets they reside in using queries like so:
https://cloud.google.com/bigquery/docs/exporting-data#exportingmultiple

Then run coach against the HAR files.

Why not just hit the url we are given?

We can do that, but because the data is already collected, we can do a bulk job where we prepopulate a dataset with it.

Updates for the hosting pages

I'm adding the notes here from the screen sharing session with Rene and Shreya. These are seem the most useful things to make it easier to update the hosting data.

  • list the user that belongs to this hosting organisation on the same page, and link to them
  • allow comments on the hosting provider page - so we know when they've been updated, or need someone to verify the data
  • list when requests to addition IP ranges or ASN were made, and approved
  • think how to show data about the ASN on the hosting page (perhaps peering DB) see #74
  • add editable preview for contacting hosting provider

Creating a new IP gives little feedback for the user

When creating a new IP you'll see

The hostingprovider "your host" was changed successfully. You may edit it again below.

But when you check the IP address table you'll see no new IP addresses, you'll have to click IP approvals which is hidden to see any change. IP approvals should probably not be hidden, and it could be a good idea to add a message that your IP now awaits approval.

Add richer data for tryout

Right now when you do a check in the internal admin system you get back a basic yes/no answer, like so:

Screenshot 2020-03-26 at 09 44 36

This isn't much use for troubleshooting, as it only shows that the API returns.

Ideally we'd show a more detailed view, that also skips any caching we have, the same way that the previous admin did:

Screenshot_2020-03-26 Greencheck Admin area - tryout - The Green Web Foundation Administration

how to add a better tryout

there is a useful library, ipwhois, that provides this kind of info, with a nice API.

Check this usage example:

>>>> from ipwhois import IPWhois
>>>> from pprint import pprint

>>>> obj = IPWhois('74.125.225.229')
>>>> results = obj.lookup_rdap(depth=1)
>>>> pprint(results)

If we rendered this info, in a template, along with the current lookup when checking the database, we'd have a a much more useful system.

https://pypi.org/project/ipwhois/

Having both status and action column is redundant

Having both columns action or status is redundant. They do track the same thing. For a later change my suggestion would be that only status column is kept.

Ie example. A new ip address comes in, the status is set to new. Someone approves it, status is set to approved. Alternatively the status could be set to rejected as well.

Writing tests for the project

It would ideal to write tests for the project for the following reasons

  • It makes it easier to see any assumptions behind the code.
  • If changes are made you can easily check that the tested code works on prior assumptions.
  • A new contributor contributes code, the reviewer gets increased confidence that the code change won't break anything because of the tests.

What we need to do.

  • Set-up travis so whenever a PR is added the tests are run automatically.
  • Write tests for all important paths currently being made in admin and other views.
  • Include coverage, so that we can clearly see what parts of the code is being tested. The aim should be to test ~80% of the code.
  • Start using https://codecov.io/ so any new PR doesn't reduce coverage.
  • Add testing factories for the models that we're using and use these as fixtures.

Estimation that this would take at most 3 days of work.

Add check for updated datasets to be available with public-read permissions

We had a request asking about the datasets at the green_urls endpoint:

https://admin.thegreenwebfoundation.org/green-urls

At present the uploaded objects don't have public read permissions - oops!

How to fix it:

This line where we upload to an AWS S3 compatible bucket:

class GreenDomainExporter:
    """
    Exports a snapshot of the green domains, to use for bulk exports
    or making available for analysis and browsing via Datasette.
    """

    # snip  
    def upload_file(self, file_path: str, file_destination: str):
        subprocess.run(["aws", "s3", "cp", file_path, file_destination])

We need to explicitly mark this by adding --acl public-read to the cli command,, or do it programmatically with boto3.

https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#object

Styleguide for writing python

In the last few pull requests, I've seen a number of ways of formatting strings, from f-strings, to the format, to the older way of concatenating strings like 'somestring' + some_var.

Can we discuss getting a consistent style for writing, so we only use one approach for consistency?

I like Trey Hunner's one as an example:

https://github.com/TruthfulTechnology/style-guide

Also, the for django, I think Octopus energy has a good one.

https://github.com/octoenergy/conventions

The value here is largely having some explicit things to check against, to make reviewing PRs easier.

@jonathan-s can you add any you think would also be good candidates please?

Parameterise the domain backfill procedure for our green domains list

After a call with @jonathan-s, we agreed that being able to pass in a date for the current backfill procedure would make it easier to test, and use in production.

So rather than simply having us call:

CALL backfill()

We'd be able to do:

CALL backfill("2019-12-01")

And have this only fill domains added to the greencheck table since the date.

This lets us make snapshots more quickly, and would bring down the expected time for a backfill run from 3.5 days for the last ten years, to a few minutes if we only care about the last month or so.

New users should be added to groups automatically

We've got two groups that currently give the same permissions. Hostingproviders and datacenter. When a new user signs up they should automatically be added to the hostingprovider group. If they are not added they won't be able to make any changes at all.

Allow only 1 globally unique AS number to be active

Our queries expect that there is only 1 as number returned if you do a lookup on as number. This morning we saw that we had 2 active entries for the same as number.

We should either fix the query to allow multiple results and pick the first one or have a way for the admin system to have a uniqueness constraint on this as number/active flag combo.

Allow filtering by dates, and show them in the admin UI

Right now, you can't see when a hoster org was created, and we don't have an easy way of filtering by this.

Django admin has support for this, and adding will allow users when checking data to focus on a subset of data to verify.

  • allow filtering by year of creation
  • show the hoster creation date in the listing if we have it

Partner: None

This shows up for a user that registers a new hostingprovider. Ideally it should display something different here. Like not a partner. Or perhaps even a link to: How to become a partner or something similar.

Dump table that contains all green urls & upload to object storage

As #33 is now merged and we've got the procedures in place we are now able to create the SQLite database from the table green_presenting. We can use https://github.com/simonw/db-to-sqlite that creates a sqlite file from a table without too much fuss.

Once we've got the sqlite file we make sure that this file is uploaded to the object storage.

The last step is to be able to present all the uploaded objects to the user. We can do that by creating a page in admin that is accessible without login. This would page would query the object storage and display the 10 most recent uploaded sqlite files.

Add datacentre from the hosting provider edit form, not the other way around

With the old system, when you sign in you're immediately sent to the hosting provider that represents your org,

Screenshot 2020-08-22 at 15 53 28

Once you're there, you then have the option of listing a datacentre, or adding an existing one. If you add it it looks like so, and the link from your hosting org to the datacentre is implicit - you don't need to say which hosting org it's linked to yourself

Adding an existing DC:

Screenshot 2020-08-22 at 15 51 49

Adding a new DC:
Screenshot 2020-08-22 at 15 53 18

This is different to the current setup in admin-portal using django admin, as the widget currently makes it easy to add other hosting providers, not just your one. We've seen a few users get confused, and add the wrong hosting orgs, meaning we need to then fix it, so we'll need a different way to do this.

One approach might be adjusintg django admin to follow the previous workflow closely, but I really think it's worth trying to do this without the django admin, as making changes to the django admin will get harder over time.

Sanity check on migrating tables referring to IP addresses, that are used by the API

Hey @Arend-Jan Tetteroo , I was running through the deployment steps to deploy the new admin, and I realised there are some changes to the database where I don't know if they might cause problems for the symfony app.

Can you please look over the SQL statements below here and give a heads up if you see anything that might cause the production symfony app to choke?

https://github.com/thegreenwebfoundation/greenwebfoundation-admin/blob/master/apps/accounts/management/commands/sql_migrations.py

I'd like to run the migration, but some of the migrations touching ipaddress columns look like they might cause problems

Add a way to run the following queries on the database:

We need to be able to make these queries, on a cronjob, and make them accessible via an API, re-run daily on a cronjob:

  • (1) number of checks past 24 hours
  • (2) number of those checks labeled as green
  • (3) number of checks per TLD
  • (4) number of checks green per TLD and
  • (5) number of checks for each listed green hoster (link to ID)

Add release tracking in sentry

I think that generally speaking only unexpected exceptions should make it to sentry - if we know an exception is occurring, and if we can't fix right away, we really should have some graceful recovery, account for it in our code, by capturing it the correct kind of Exception, so we can identify it better.

Sentry has some useful tools to help us see if releases introduce or resolve these, and when they do, what percentage of users are affected.

More here:

https://docs.sentry.io/workflow/releases/
https://docs.sentry.io/workflow/releases/health/#getting-started

Show the size of an IP range when submitting or approving, to avoid accidentally approving millions of IPs

We currently have an interface which allows people to add the start and end point of an IP range.

One problem with this approach is that's easy to accidentally submit an IP range that's huge.

Another problem is that's it's also easy to accidentally approve an IP range request because there is no indication of how large they are.

Here's the UI right now:
Screenshot 2020-10-02 at 16 29 48

Python's batteries include a nice library that will show this. There are a bajillion npm modules of varying quality that might do the same client side too.

Check a given hosters' provided ASN against PeeringDB on the admin pages

When checking the quality of data given by a hoster, it's useful to see if there is existing information for the ASN that is associated with them.

We do this manually a lot when checking, so being able to see that information we matches up against peering DB would be helpful. Ideally, if we see there's enough overlap, we might find ways to integrate both ways.

We'd ideally show this in the admin page, on a hoster page, to supplement the info we have.

How we can do this

If we only have an IP address

If we have an IP addrress, we can use the same logic in the greencheck-api to work out the likely1 ASN it.

The logic is outlined on the Cymru IP to ASN docs page, but the TLDR version is that if we have an ip address like 85.17.184.227 (our server), we can then look up the ASN with a query like so, where we have reversed the order of the ip address octets:

dig +short 227.184.17.85.origin.asn.cymru.com TXT 

That DNS lookup will return something like the following info:

60781 | 85.17.0.0/16 | NL | ripencc | 2005-03-11

If we have the ASN

If we have the ASN already recorded, we can show info stored in peering DB by querying their API. This will return information about the Org associated with that ASN, a bit like our notion of a hoster.

The API call for returning the org looks a bit like this:

https://peeringdb.com/api/org?asn=ASN_NUMBER.

So, for us we'd do:
https://peeringdb.com/api/org?asn=60781

And we'll get back data that looks like so:

{
  "data": [
    {
      "id": 14270,
      "name": "LeaseWeb Netherlands B.V.",
      "website": "",
      "notes": "",
      "address1": "",
      "address2": "",
      "city": "",
      "country": "",
      "state": "",
      "zipcode": "",
      "created": "2016-05-27T20:08:40Z",
      "updated": "2016-05-27T20:08:40Z",
      "status": "ok"
    }
  ],
  "meta": {}
}

If we look up an org with that ID, we'll end up with a page listing all the ASN associated with the org, and sometimes, info about the physical addresses, too. The link below shows an example of the info already present in peering DB - sometimes it's good, sometimes it isn't.

https://www.peeringdb.com/org/14270

On those pages, you can see other ASNs associated with a given org. This can help sanity check an ASN, if it's been a while since it was last updated.

https://peeringdb.com/api/net?org_id=14270

The deck linked below gives a bunch of useful information about the PeeringDB, and how we might query it to get info in future:

https://docs.peeringdb.com/presentation/20200220-1-2-GUI-API-APRICOT2020-Arnold-Nipper.pdf

More on peering DB:

The full list of API endpoints, with sample responses: https://www.peeringdb.com/apidocs/
Their website: https://www.peeringdb.com/
API docs site - https://docs.peeringdb.com/

1 - likely, because if a domain is behind cloudflare or similar CDN, they'll increasingly resort to using something like anycast to make the same up address point to different infrastructure depending on where the initial device was. KeyCDN's [info about anycast][keydcn] is quite good, along with this linked background from serverius too.

Remove meerjahren plan from db

This field is only specific to netherlands and doesn't fulfill much use any longer. So this field can be removed from the database.

500 triggered when I try try to update my hosting provider

On production, when I try to update hosting provider details for an existing account, I get a 500.

Screenshot 2020-01-29 at 19 45 46

I've shared the stack trace, but it should also be visible in sentry:

https://sentry.io/share/issue/f9d309a6e4804c258805ae3d0cfab221/

I think I can reproduce this by creating a new hosting provider for a website I own.

Try signing in and creating an IP entry for your own website. I think is triggered when accessing the change view for a given hosting provider.

Steps to reproduce

  • sign in as user
  • hit create for hosting provider
  • fill in details
  • hit submit
  • 😞

Password not setting on new registrations

Oh jeez, since the migration with MariaDB, it appears the handling of nulls is different too.

I'm making notes here as I investigate.

Signing up with a new user triggers this exception:

Field 'password' doesn't have a default value

This is new behaviour since the migration, and it seems to be related to the earlier work to allow logins using the old php admin app, AND the new django admin one.

My current theory for what I think is happening.

In our user model, we have this line here:

password = models.CharField('password', max_length=128, db_column='django_password')
    # password already provided in abstract base user.

It's writing it to the django_password column, and it needs to be NOT NULL.

Looking at the database on working MySQL and Postgres confirms that NOT NULL is correct for me, as well as the underling abstract class which our user model inherits from.

My guess right now is that perhaps the registration form we use is not using the correct user model , or we're saving one password column, but not the other.

[Discussion] Compress datasets with bzip2

Original size

$ du -sh green_urls_2021-02-19.db
121M	green_urls_2021-02-19.db

gzip

$ gzip --force --best green_urls_2021-02-19.db
$ du -sh green_urls_2021-02-19.db.gz
37M	green_urls_2021-02-19.db.gz

bzip2

$ bzip2 --force --compress green_urls_2021-02-19.db
$ du -sh green_urls_2021-02-19.db.bz2
27M	green_urls_2021-02-19.db.bz

Result

~10 MB smaller size of a resulting archive

Thoughts

I'm concerned about the existing solutions (like my own) built around the currently available gzip archives. Its logic is the following:

  • visit https://admin.thegreenwebfoundation.org/green-urls and parse dataset URLs using /href="(.+?\.db\.gz)"/;
  • download the latest available dataset if no local file exists or check the local's dataset release date to determine if there is a newer version available.

Considering the above, suddenly replacing gzip archives with bz2 will break such integrations.

I don't think that publishing bz2 in addition to gzip would make sense because this almost doubles the size in S3 for a single dataset version.

New users see permissions and important dates when they shouldn't

If you create a new user, when you sign up, you see more permissions that you should.

Specifically, you see:

  • user and group permissions
  • staff / superuser
  • important dates

These should be visible to super users, but not end users.

### Where this seems to be occuring

I think this is down to the admin page being overridden here, where there are explicit listings of admin widgets.

https://github.com/thegreenwebfoundation/greenwebfoundation-admin/blob/master/apps/accounts/admin.py#L58-L71

I say this because there don't seem to be permissions associated with the hosting group that would list these widgets otherwise.

Add alert for when a hoster submits a new IP range for approval

We had this in the old admin system, and it makes approving IP ranges much faster if we have it.

Expected behaviour:

When a user responsible for updating a hosting provider adds a new IP range or ASN, an email is sent to administrators to review, so they can either reject it, or approve it.

When a request has been approved, we should send a notification to the hosting provider thanking them for submitting the information.

Add CSV importer for hosting orgs who provide structured data

We have some hosting orgs with lots of IPs where they are sending us CSV files full of IP ranges, to update on a quarterly, or monthly basis.

We should implement an IP-Range ingester that can read a CSV file, and create the corresponding list of of Green IPs to check against in our whitelist.

TODO

  • create importer class
  • add importer for ipv4
  • add management command to trigger this

Limited user should not see approves in menu

A limited user needs to have the permission to view approvals. But we should hide the menu for approvals. Ie they don't need to be able to view other approvals. They should only be able to view approvals for their own host.

Remove bad data from hoster list

We have a few hosters with nonsensical names from years ago, when hackers tried to submit bad data to a much earlier version of the site.

They look like xss attempts from years ago.

Update deployment script and instructions for pushing changes

This is as good a time as any to set up a more automated workflow for deployments.

Limiting concurrency for deploys

Github actions is great, but sadly, multiple PRs being merged in can result in multiple overlapping deploy steps if you have a a deploy on push/merge style set up.

Moreover, there seems no officially supported way for limiting GH Actions to only one job at a time:
https://github.community/t5/GitHub-Actions/How-to-limit-concurrent-workflow-runs/td-p/37786/page/2

The closest thing there seems to be is turnstyle - an action that implements some kind of locking functionality:

https://github.com/softprops/turnstyle

Adding this would mean we can use much of the same ansible deploy scripts, but make them only work one at a time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.