eida / eida-statistics Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 535 KB

Aggregated statistics of EIDA nodes

License: GNU General Public License v3.0

Python 96.39% Dockerfile 0.44% Gherkin 3.17%

eida-statistics's People

Contributors

Watchers

eida-statistics's Issues

Having openapi spec file served on the right protocol (https)

In the branch fix_openap3_proto the deployment uses pyramid_openapi3 delivered by vpet github repository.

Now, we need to force the protocol to https, I don't remember how to do so in the code. @vpet98 can you help ?

It should be configured with an environment variable EIDASTATS_API_PROTO

Define the OpenAPI for EIDA Statistics API

Let's prepare an OpenAPI specification of the EIDA Statistics API.

Some references:

WFCatalog OpenAPI specs: https://github.com/EIDA/wfcatalog/blob/master/wfcatalog_swagger.yaml
OpenAPI rendered in Swagger: https://www.orfeus-eu.org/swagger/dist/index.html?url=https://www.orfeus-eu.org/data/eida/webservices/wfcatalog/wfcatalog.yaml

Review the first specifications

There is a first specification available at https://github.com/EIDA/eida-statistics/blob/main/ingestor_specs.md

@ALL would you please comment ?

It's very basic, and should be straightforward to implement (at least the ingestion part). Thank you.

Permission error on table dataselect_stats problem is not reported to client

We should reply error 500 in such cases and rollback the transaction.

2023-04-06 14:30:23,068 INFO  [ws_eidastats.helper_functions:134][MainThread] Registering 3557 statistics.
2023-04-06 14:30:23,094 ERROR [ws_eidastats.helper_functions:142][MainThread] Postgresql error 42501 registering statistic
2023-04-06 14:30:23,094 ERROR [ws_eidastats.helper_functions:143][MainThread] ERROR:  permission denied for table dataselect_stats

2023-04-06 14:30:23,094 INFO  [ws_eidastats.helper_functions:144][MainThread] Statistics successfully registered

Reimplementation with a better framework

tightly link the implementation, the openapi definition and the documentation
document the code
separate code in an MVC framework

Output of human example links

Hello,

Thanks for this very nice webservice.
Playing with the example links for human, I noted one question about the csv content.

The nb_reqs column appears always at None. Shouldn't it be at least the same number at the column nb_successful_reqs ?

Also, the country column is always showing *. Maybe this feature is not yet implemented ?

Starttime mandatory

To be more consistent with other FDSN webservice and reduce the default amount of responses, make starttime mandatory, endtime can be optional.

[Aggregator] add compression and send data

The aggregator should compress the data before sending it.
The aggregator should be able to send the aggregation to the central webservice directly

[Collector] Collector web service prototype

Adding --bzip2 flag for compressing on the fly

I would suggest the ability to bzip2 the log file on the fly with a python module like:
https://www.tutorialspoint.com/python-support-for-bzip2-compression-bz2

so the node maintainer could run:
eida_stats_aggregator --bzip2 --output-directory aggregates fdsnws-requests.log.2020-11-02 fdsnws-requests.log.2020-11-03.bz2

Pyramid-openapi3 dependency

New branch with latest version of pyramid-openapi3 dependency at https://github.com/EIDA/eida-statistics/tree/openapi_dependency.

Tested locally and works, hope it works in production as well.

Tell me when to merge in main.

Layout of the documentation

Change the title (Swagger UI -> EIDA statitistics)
Remove the banner where user can change the opapi.yaml URL

Write unit tests and publish coverage

Make github action to run tests and test coverage, generate a badge and put it in the README.

Write some behavioural tests

Using the behave framework (https://behave.readthedocs.io/en/latest/), we should write some test scenarios and test them.

define scenarii (@jschaeff)
implement the sceinarii with behave (@vpet98 )

Add more functional tests

Wherever there is logic in the code, we should test that it does what it should.

For instance the restriction function with those use cases:

node with default policy OPEN, network without restriction inversion
node with default policy OPEN, network with restriction inversion
node with default policy CLOSED, network with restriction inversion
node with default policy CLOSED, network without restriction inversion

Aggregator: dupplicated log should be taken in account

It should be enough to identify them with the creation time.

Group all statistics regarding restricted networks in "Other"

When giving statistics to a user that is not authorized to see stats
AND
When there is more than one level in the result
Show all the restricted statistics summed up in an "Other" network item.

If there is only one restricted network in the result, reply 403 unauthorized

A simple method to get nodes and networks

We miss 2 public endpoints

/nodes to list all nodes in json format with their default policy
/networks to list all known networks with their restriction policy

The endpoint _nodes could be deleted.

Network identifier

The aggregator identifies networks only with 2 letters. This is wrong.

Commit 86a36cf fixes this.

Implement a caching system

In order to avoid query flooding and provide fastest replies, implement the caching of the request.

See https://docs.python.org/3/library/functools.html and https://realpython.com/lru-cache-python/

Strange distribution of data from some nodes.

Something strange happens with network FR.

FR seems to be distributed through RESIF, ETH and ICGC.

It might be that the ETH logging for FR stops in the beginning of 2022?? so this might be a temporary problem, but it would be nice to understand what is happening and whether something needs to be fixed.

Clear bug is that the number of users per year only shows ETH.

See result of this query: https://ws.resif.fr/eidaws/statistics/1/dataselect/public?network=FR&start=2021-01&end=2023-12&level=node&format=json

public(2).csv

Internal 500 errors

Based on Sentry issues and https://docs.sqlalchemy.org/en/20/errors.html#error-3o7r, I think we need to try to increase the QueuePool SQLAlchemy uses for connections.

I'll commit now in the development branch firstly, though this fix can be tested more efficiently when goes into production.

Inefficient caching of FdsnNetExtender.extend()

FdsnNetExtender.extend(self, net, date_string) has lru_cache(maxsize=1000), but since date_string is different most of the time, caching seems to be inefficient. In any case, I can observe urls like http://www.fdsn.org/ws/networks/1/query?fdsn_code=3E being downloaded hundreds of times. Sometimes this causes an exception, which seems to be the reason of incomplete statistics at GFZ.

Maybe date_string should be reduced to year (two different temporary networks with the same code never exist in same year?). Alternatively I would suggest caching the result of urlopen(request).

Add a webservice for getting the statistics

First task for this is to build an API in the openAPI3 standard, for instance using the swagger online tools.

In order to imagine a suitable API, you can look at the matrix document. First 2 rows define the questions and the granularity level.

The code attached to this project needs a better documentation, I'm on it (see issue #11)

The datamodel is specified in the code : https://github.com/EIDA/eida-statistics/tree/main/backend_database

You can use this project to bring up your own empty database if needed.

You can create a directory for the webservice specification and implementation at the root of this project.

empty stats for GFZ

Thanks for publishing this interface. When retrieving yearly network statistics for each node I get results for all nodes except GFZ:

https://ws.resif.fr/eidaws/statistics/1/dataselect/query?start=2022-01&end=2022-12&datacenter=GFZ&aggregate_on=month,station,country&format=json

returns an empty result. The same happens with unknown data center names. Better would be to return an error if the data center name is invalid.

I also tried "../submit/.." instead of "../dataselect/..". This doesn't work at all.

eidastats_man manage authorizations

We said that the central operator should manage authorizations for networks.

The cli eida_statsman should help us do that.

eida_statsman network set group ABCD

Webservice: accept JWT authentication

Sort CSV output

CSV output should be sorted by date when details=month or year

Exemple :
curl -X 'GET' 'https://ws.resif.fr/eidaws/statistics/1/dataselect/public?start=2022-01&end=2022-12&details=month&format=csv'

# version: 1.0.0
# request_parameters: start=2022-01&end=2022-12&details=month&format=csv
date,node,network,station,location,channel,country,bytes,nb_reqs,nb_successful_reqs,clients
2022-09,*,*,*,*,*,*,49249517419520,93309158,61742567,3752
2022-04,*,*,*,*,*,*,52075391539200,70253741,56097249,5135
2022-03,*,*,*,*,*,*,35866232961024,76959640,62862467,6096
2022-07,*,*,*,*,*,*,47809205437440,100682394,86495962,4220
2022-08,*,*,*,*,*,*,41827452808448,199812690,111005715,3361
2022-10,*,*,*,*,*,*,34598181185536,84436994,64883858,4267
2022-06,*,*,*,*,*,*,54756623463168,92399681,75015880,4025
2022-12,*,*,*,*,*,*,75743023855104,115305619,82503762,4524
2022-02,*,*,*,*,*,*,49705000816128,92485574,76534546,4626
2022-05,*,*,*,*,*,*,70791218339072,69100676,53027093,4143
2022-11,*,*,*,*,*,*,31853315838464,122664892,65935181,4714
2022-01,*,*,*,*,*,*,47874364480512,70079038,57161798,3733

Change parameter aggregate_on

On /public and /restricted methods, change aggregate_on to:

level

one value in datacenter,network,station,location,channel
if no value is provided, the server responds at EIDA level, all datacenters grouped

details

Will show the details of the query.
Possible values are:

month or year
countries

multiple values are allowed. If month and year are specified, reply 400 and a nice detail.

Upgrade fdsnnetextender

In order to fix #8 , I released a new version of the fdsnnetextender package on wich the aggregator relies.

@ALL could you make an update of fdsnnetextender on all nodes ?

pip install --upgrade fdsnnetextender

The targetted version is 3.3.0

Better documentation

All node upgrade eida-statistics-aggregator to 0.6.0

Hello @ALL

I released a new version for the dataselect statistics aggregator.
This release adds identification of temporary networks by their extended identifier. Wich is important in the statistics because otherwise we mix up statistics from different networks sharing the short network code.

Please all node, could you upgrade ? Depending on your installation method, this should not be much more work than:

pip3 install --upgrade eida-statistics-aggregator

Please note, minimal python version is 3.6 but it can run in it's isolated environment without problem. It has been tested up to python 3.10

Please report in this issue when you're done:

All webservice methods in one Flask application

Curently, the webservices /statistics/1/* and /dataselectstats are written to be executed in separate flask applications.

I would like to serve both in one single application:

PUSH /dataselectstats => statistics ingestion
GET /dataselectstats => statistics query
GET /query
GET /health
GET / => documentation

Besides, do not declare all the statistics/1/ part in the routes, as they will be set on the deployment side.

You can reorganize the project to split the routes and the methods as you see fit.

Rewrite webservice/README

Use just one connexion to database backend

Instead of issuing one connexion to the SQL backend on each request, use the SQL alchemy native method to interact with the database.

This is usually done with a singleton object managing the database connexion, and all the other functions build the SQL statement and pass it to this object.

Allow plain dates in start/end parameter

The query /query?start=2022-06-01 should be accepted. Currently it gives:

BAD REQUEST: invalid value of parameter 'end'

query without argument should fail

We should make some arguments mandatory and not allow sucking all the database by issuing /query without parameters ...

Maybe make one of start / end param mandatory

eida_statsman : add interface to manage networks and nodes policy

toggle default policy on a node
- when an operator tries to change the policy on a node, there is 2 possible behaviours:
  - if default policy is changed to "open", then make sure that all networks is open, show to the operator the list of networks with resulting restriction
  - else, make sure all networks conform to the default policy. Opening networks has to be done manually
toggle policy on a network
list policies for networks (optionally filter by node)

Sentry: set DSN by environment variable

Could you remove the DSN from the code and get it from an environment variable ?

Also, please look at how to setup the environment (dev, staging, production) so that sentry can make the difference.

https://docs.sentry.io/platforms/go/guides/martini/configuration/environments/

inconsistency in clients cardinality.

(Reported by @vpet98)
I noticed some inconsistency, to an extent that I don't know if should be ignored, about the number of clients and HLL objects in the results that the webservice returns.

Try this: https://ws.resif.fr/eidaws/statistics/1/dataselect/public?start=2023-01&country=GR&details=country&format=json
And then the same in node level: https://ws.resif.fr/eidaws/statistics/1/dataselect/public?start=2023-01&country=GR&level=node&details=country&format=json

You would expect adding the clients of the results of the second query to be approximately equal to the clients in the first query. But the difference is quite noticeable (first query 78 clients, second query in total 103 clients).
And is even worse for countries with more clients (in another example I had 2232 vs 3115 clients).

My SQL query includes this in the select clause: hll_union_agg(dataselect_stats.clients), which has to be correct.
Then I use this library: https://github.com/AdRoll/python-hll.
And as the library indicates in its README, I print the cardinality like this: HLL.from_bytes(NumberUtil.from_hex(row.clients[2:], 0, len(row.clients[2:]))).cardinality(), for each row that the SQL query returns.
7:50 PM

Could you have a quick look at it if there is time?

Replace datacenter and network management API with an information method

/nodes/id would show the information about a node, which is basically it's default restriction policy

More info could be added in the future, for instance, latest payload submitted ?

Use a templating system for the documentation

In order to insert dynamic content (like the full URL) in the documentation, it would be interesting to use a templating system (jinja2 is a popular option).

The URL prefix can be taken from an environment variable like

EIDASTATS_API_HOST=server.exemple.gr
EIDASTATS_API_PATH=/eidaws/statistics/1

# request_parameters: start=2022-01&end=2022-12&details=month&format=csv
# rejected_parameters: groupby=day

csv file does not give back network when requested

https://ws.resif.fr/eidaws/statistics/1//dataselect/query?start=2022-01&end=2022-12&datacenter=NOA&network=HP&aggregate_on=location,channel,month,datacenter,network,station,country&format=csv
query.csv

Moreover format=text does not work. Only csv

Get rid of psycopg2 direct calls in the code

Change all psycopg2 calls to switch to a pure sqlalchemy implementation