eida / eida-statistics Goto Github PK
View Code? Open in Web Editor NEWAggregated statistics of EIDA nodes
License: GNU General Public License v3.0
Aggregated statistics of EIDA nodes
License: GNU General Public License v3.0
In the branch fix_openap3_proto the deployment uses pyramid_openapi3 delivered by vpet github repository.
Now, we need to force the protocol to https, I don't remember how to do so in the code. @vpet98 can you help ?
It should be configured with an environment variable EIDASTATS_API_PROTO
Let's prepare an OpenAPI specification of the EIDA Statistics API.
Some references:
There is a first specification available at https://github.com/EIDA/eida-statistics/blob/main/ingestor_specs.md
@ALL would you please comment ?
It's very basic, and should be straightforward to implement (at least the ingestion part). Thank you.
We should reply error 500 in such cases and rollback the transaction.
2023-04-06 14:30:23,068 INFO [ws_eidastats.helper_functions:134][MainThread] Registering 3557 statistics.
2023-04-06 14:30:23,094 ERROR [ws_eidastats.helper_functions:142][MainThread] Postgresql error 42501 registering statistic
2023-04-06 14:30:23,094 ERROR [ws_eidastats.helper_functions:143][MainThread] ERROR: permission denied for table dataselect_stats
2023-04-06 14:30:23,094 INFO [ws_eidastats.helper_functions:144][MainThread] Statistics successfully registered
Hello,
Thanks for this very nice webservice.
Playing with the example links for human, I noted one question about the csv content.
The nb_reqs
column appears always at None
. Shouldn't it be at least the same number at the column nb_successful_reqs
?
Also, the country
column is always showing *
. Maybe this feature is not yet implemented ?
To be more consistent with other FDSN webservice and reduce the default amount of responses, make starttime mandatory, endtime can be optional.
The aggregator should compress the data before sending it.
The aggregator should be able to send the aggregation to the central webservice directly
I would suggest the ability to bzip2 the log file on the fly with a python module like:
https://www.tutorialspoint.com/python-support-for-bzip2-compression-bz2
so the node maintainer could run:
eida_stats_aggregator --bzip2 --output-directory aggregates fdsnws-requests.log.2020-11-02 fdsnws-requests.log.2020-11-03.bz2
New branch with latest version of pyramid-openapi3 dependency at https://github.com/EIDA/eida-statistics/tree/openapi_dependency.
Tested locally and works, hope it works in production as well.
Tell me when to merge in main.
Make github action to run tests and test coverage, generate a badge and put it in the README.
Using the behave framework (https://behave.readthedocs.io/en/latest/), we should write some test scenarios and test them.
Wherever there is logic in the code, we should test that it does what it should.
For instance the restriction function with those use cases:
It should be enough to identify them with the creation time.
When giving statistics to a user that is not authorized to see stats
AND
When there is more than one level in the result
Show all the restricted statistics summed up in an "Other" network item.
If there is only one restricted network in the result, reply 403 unauthorized
We miss 2 public endpoints
The endpoint _nodes could be deleted.
The aggregator identifies networks only with 2 letters. This is wrong.
Commit 86a36cf fixes this.
In order to avoid query flooding and provide fastest replies, implement the caching of the request.
See https://docs.python.org/3/library/functools.html and https://realpython.com/lru-cache-python/
Something strange happens with network FR.
FR seems to be distributed through RESIF, ETH and ICGC.
It might be that the ETH logging for FR stops in the beginning of 2022?? so this might be a temporary problem, but it would be nice to understand what is happening and whether something needs to be fixed.
Clear bug is that the number of users per year only shows ETH.
See result of this query: https://ws.resif.fr/eidaws/statistics/1/dataselect/public?network=FR&start=2021-01&end=2023-12&level=node&format=json
Based on Sentry issues and https://docs.sqlalchemy.org/en/20/errors.html#error-3o7r, I think we need to try to increase the QueuePool SQLAlchemy uses for connections.
I'll commit now in the development branch firstly, though this fix can be tested more efficiently when goes into production.
FdsnNetExtender.extend(self, net, date_string) has lru_cache(maxsize=1000), but since date_string is different most of the time, caching seems to be inefficient. In any case, I can observe urls like http://www.fdsn.org/ws/networks/1/query?fdsn_code=3E being downloaded hundreds of times. Sometimes this causes an exception, which seems to be the reason of incomplete statistics at GFZ.
Maybe date_string should be reduced to year (two different temporary networks with the same code never exist in same year?). Alternatively I would suggest caching the result of urlopen(request).
First task for this is to build an API in the openAPI3 standard, for instance using the swagger online tools.
In order to imagine a suitable API, you can look at the matrix document. First 2 rows define the questions and the granularity level.
The code attached to this project needs a better documentation, I'm on it (see issue #11)
The datamodel is specified in the code : https://github.com/EIDA/eida-statistics/tree/main/backend_database
You can use this project to bring up your own empty database if needed.
You can create a directory for the webservice specification and implementation at the root of this project.
Thanks for publishing this interface. When retrieving yearly network statistics for each node I get results for all nodes except GFZ:
returns an empty result. The same happens with unknown data center names. Better would be to return an error if the data center name is invalid.
I also tried "../submit/.." instead of "../dataselect/..". This doesn't work at all.
We said that the central operator should manage authorizations for networks.
The cli eida_statsman should help us do that.
eida_statsman network set group ABCD
CSV output should be sorted by date when details=month or year
Exemple :
curl -X 'GET' 'https://ws.resif.fr/eidaws/statistics/1/dataselect/public?start=2022-01&end=2022-12&details=month&format=csv'
# version: 1.0.0
# request_parameters: start=2022-01&end=2022-12&details=month&format=csv
date,node,network,station,location,channel,country,bytes,nb_reqs,nb_successful_reqs,clients
2022-09,*,*,*,*,*,*,49249517419520,93309158,61742567,3752
2022-04,*,*,*,*,*,*,52075391539200,70253741,56097249,5135
2022-03,*,*,*,*,*,*,35866232961024,76959640,62862467,6096
2022-07,*,*,*,*,*,*,47809205437440,100682394,86495962,4220
2022-08,*,*,*,*,*,*,41827452808448,199812690,111005715,3361
2022-10,*,*,*,*,*,*,34598181185536,84436994,64883858,4267
2022-06,*,*,*,*,*,*,54756623463168,92399681,75015880,4025
2022-12,*,*,*,*,*,*,75743023855104,115305619,82503762,4524
2022-02,*,*,*,*,*,*,49705000816128,92485574,76534546,4626
2022-05,*,*,*,*,*,*,70791218339072,69100676,53027093,4143
2022-11,*,*,*,*,*,*,31853315838464,122664892,65935181,4714
2022-01,*,*,*,*,*,*,47874364480512,70079038,57161798,3733
On /public and /restricted methods, change aggregate_on to:
Will show the details of the query.
Possible values are:
multiple values are allowed. If month and year are specified, reply 400 and a nice detail.
Hello @ALL
I released a new version for the dataselect statistics aggregator.
This release adds identification of temporary networks by their extended identifier. Wich is important in the statistics because otherwise we mix up statistics from different networks sharing the short network code.
Please all node, could you upgrade ? Depending on your installation method, this should not be much more work than:
pip3 install --upgrade eida-statistics-aggregator
Please note, minimal python version is 3.6 but it can run in it's isolated environment without problem. It has been tested up to python 3.10
Please report in this issue when you're done:
Curently, the webservices /statistics/1/* and /dataselectstats are written to be executed in separate flask applications.
I would like to serve both in one single application:
PUSH /dataselectstats => statistics ingestion
GET /dataselectstats => statistics query
GET /query
GET /health
GET / => documentation
Besides, do not declare all the statistics/1/
part in the routes, as they will be set on the deployment side.
You can reorganize the project to split the routes and the methods as you see fit.
Instead of issuing one connexion to the SQL backend on each request, use the SQL alchemy native method to interact with the database.
This is usually done with a singleton object managing the database connexion, and all the other functions build the SQL statement and pass it to this object.
The query /query?start=2022-06-01
should be accepted. Currently it gives:
BAD REQUEST: invalid value of parameter 'end'
We should make some arguments mandatory and not allow sucking all the database by issuing /query
without parameters ...
Maybe make one of start / end param mandatory
Could you remove the DSN from the code and get it from an environment variable ?
Also, please look at how to setup the environment (dev, staging, production) so that sentry can make the difference.
https://docs.sentry.io/platforms/go/guides/martini/configuration/environments/
(Reported by @vpet98)
I noticed some inconsistency, to an extent that I don't know if should be ignored, about the number of clients and HLL objects in the results that the webservice returns.
Try this: https://ws.resif.fr/eidaws/statistics/1/dataselect/public?start=2023-01&country=GR&details=country&format=json
And then the same in node level: https://ws.resif.fr/eidaws/statistics/1/dataselect/public?start=2023-01&country=GR&level=node&details=country&format=json
You would expect adding the clients of the results of the second query to be approximately equal to the clients in the first query. But the difference is quite noticeable (first query 78 clients, second query in total 103 clients).
And is even worse for countries with more clients (in another example I had 2232 vs 3115 clients).
My SQL query includes this in the select clause: hll_union_agg(dataselect_stats.clients), which has to be correct.
Then I use this library: https://github.com/AdRoll/python-hll.
And as the library indicates in its README, I print the cardinality like this: HLL.from_bytes(NumberUtil.from_hex(row.clients[2:], 0, len(row.clients[2:]))).cardinality(), for each row that the SQL query returns.
7:50 PM
Could you have a quick look at it if there is time?
/nodes/id would show the information about a node, which is basically it's default restriction policy
More info could be added in the future, for instance, latest payload submitted ?
In order to insert dynamic content (like the full URL) in the documentation, it would be interesting to use a templating system (jinja2 is a popular option).
The URL prefix can be taken from an environment variable like
EIDASTATS_API_HOST=server.exemple.gr
EIDASTATS_API_PATH=/eidaws/statistics/1
Just aesthetically speaking, it would be nicer to rename.
For now, it says :
Content-Type: text/html; charset=utf-8
I like "a lot" the extra lines with comments you included at the top of the CSV (#40 ).
Could you please consider to include an extra piece of information?
For instance: rejected or malformed parameters?
# request_parameters: start=2022-01&end=2022-12&details=month&format=csv
# rejected_parameters: groupby=day
Change all psycopg2 calls to switch to a pure sqlalchemy implementation
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.