consider to create a caching mechanism for queries against external services in order

please consider also this <a href="https://blog.r-hub.io/2021/07/30/cache/" rel="nofol

Dear <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Be polite with external services about relter HOT 12 OPEN

ptagliolato commented on July 3, 2024

Be polite with external services

from relter.

Comments (12)

oggioniale commented on July 3, 2024

please consider also this https://blog.r-hub.io/2021/07/30/cache/

from relter.

ptagliolato commented on July 3, 2024

Two points on this issue.
The caching mechanism could be better implemented in the get_id function. This requires some work for managing internal and system variable and exactly define the flow. I stopped myself on this way after first steps in a branch on my fork. The point is that this sort of local cache could be useful to speed up one's work, but it does not fix the current issue,

e.g. during workshops.

A different approach would be that of implementing a reverse-proxy with caching capabilities. I installed on CNR-IREA internal swarm server a version (with some customization) of the docker image from https://github.com/djmaze/docker-caching-proxy.

I made some benchmark by setting the ReLTER::set_deims_base_url("") and invoking ReLTER::get_ilter_observedProperties(sitesNum = 100) and first results seem promising (50% time of the first -not cached- request).
I'm not an nginx expert and I'm sure it should be fine tuned, but it seems to me a possible strategy to ameliorate the workshop scenario.

from relter.

micha-silver commented on July 3, 2024

Hi Paolo:
Thanks for revisiting this. Maybe I am not following, but what is the use case for caching queries? Do we expect multiple queries from one user for the same data? This might happen in a workshop framework, as you mention, but that's a special case, no?

In a broader view, the audience for the ReLTER package would be researchers working on a particular site. They will generally get the site metadata from DEIMS once, then go on to search for real data, elsewhere. Is that your view also?
If that is indeed the intention, then why do we need, i.e. get_ilter_generalinfo() to return a full list of all sites on DEIMS? (> 1000 sites). Why should we allow that query to run at all without at least a country_name or a site_name ??

And more importantly, following Christoph's request, we need to always use his API to get site data and never request the whole list of sites. Unless I am missing something, we have not yet implemented that. If I am wrong, please correct me!

from relter.

stopopol commented on July 3, 2024

Hi,

I agree with Micha as I struggling to see the use case here. I think that if we optimise the way the API is queried as well as add additional filter options for the API, then we should have covered most (potential) performance issues.

Setting up a dedicated caching layers seems to be a little too much. Most users are probably not interested in querying all site records and their contents at once.

We can, of course, discuss this during the upcoming eLTER Software workshop.

Best,
Christoph

from relter.

oggioniale commented on July 3, 2024

Dear @ptagliolato, @micha-silver and @stopopol.
The caching mechanism is not only for the large query but for improve the performances of all the queries from those dedicated to a single site (get_site_info()) to those dedicated to a network (get_network_observedProperties(), get_network_sites()). I thinking, for example, to have 100 simultaneous calls to obtain information from the list of sites of the LTER-Italia network.

The function get_ilter_generalinfo() was designed to get an overall idea of the global network and to answer network management questions (e.g. what is the altimetric distribution of sites in the ILTER network, what is the distribution of sites with respect to biogeographical regions, etc.). These queries will be made with low temporal frequency and during the training could be made simultaneously by a high number (20?) of people. In both cases, caching could be a good solution.

If we feel that the functions dedicated to ILTER are not important, we can also delete them, but before the workshop we cannot touch them.

from relter.

stopopol commented on July 3, 2024

I've just had a look at the get_ilter_generalinfo function, but the description is not entirely correct (https://docs.ropensci.org/ReLTER/reference/get_ilter_generalinfo.html). There are only ~750 ILTER sites. If the function returns much more than that, it's actually querying all of DEIMS. Looking at the source code (https://github.com/ropensci/ReLTER/blob/main/R/get_ilter_generalinfo.R) the function is actually doing that. The query parameter should be something like this: https://deims.org/api/sites?network=1aa7ccb2-a14b-43d6-90ac-5e0a6bc1d65b&verified=true

EDIT: This also applies to the other ILTER functions, e.g. https://github.com/ropensci/ReLTER/blob/main/R/get_ilter_observedProperties.R

The functions are querying all of DEIMS instead of only querying the ILTER sites. Please update these functions.

EDIT 2: From what I can see, the functions also don't capitalise on the filter functionality of the API. The API can be filtered by site name (https://deims.org/api/sites?name=lago) and country (https://deims.org/api/sites?country=it).

from relter.

ptagliolato commented on July 3, 2024

@stopopol Last commits (see #83) exploits the DEIMS API filter functionality you suggested (get_ilter_generalinfo). Moreover: it is now mandatory to set at least one of the two filters; the function now avoids any other request per-site (with the exception of cases where the first response from /sites/? returns only 3 sites, this is for temporary compatibility of old tests). I hope this solves the issue on the first function. Other get_ilter* functions must be modified or discarded (their status "questioning" accounts for this).

from relter.

micha-silver commented on July 3, 2024

@ptagliolato : mentioning here so I won't forget :-)
I think we need a function similar to get_ilter_generalinfo() to query for networks. IIUC, There is, so far, no way to get the DEIMS ID of a network. Maybe the same function with an additional parameter?

from relter.

ptagliolato commented on July 3, 2024

In order not to do the same mistake in communication, I would involve @stopopol and ask him for an appropriate API to retrieve network-only-related info (networkId, name, related networks for example). We can further discuss this online.

from relter.

micha-silver commented on July 3, 2024

@ptagliolato Good idea. But it will be after Lyon in any case.

from relter.

stopopol commented on July 3, 2024

We can discuss feature requests in detail in Lyon or other dedicated (virtual) meetings. In principle, if you need additional filter and query functionality for the rest-api I'll be happy to implement but it depends on the field/entity type. networks should be more straightforward to do but there are different ways to realise it.

from relter.

micha-silver commented on July 3, 2024

@stopopol Looking at the workshop agenda, the best window for a meeting among us will be Wednesday.

from relter.

Be polite with external services about relter HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent