Git Product home page Git Product logo

invenio-records-resources's Introduction

invenio-records-resources's People

Contributors

0einstein0 avatar alejandromumo avatar anikachurilova avatar audrium avatar chriz-uniba avatar dfdan avatar fenekku avatar floriancassayre avatar github-actions[bot] avatar glignos avatar jrcastro2 avatar kpsherva avatar lnielsen avatar max-moser avatar mb-wali avatar ntarocco avatar origliante avatar pineirin avatar rekt-hard avatar slint avatar tlgino avatar tzubaki avatar utnapischtim avatar wgresshoff avatar yashlamba avatar zzacharo avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

invenio-records-resources's Issues

Make maximum number of results configurable

(maybe this doesn't need to be configurable if ES caps at 10000)

This could be configured in invenio-records-rest, so maybe we want to be able to configure this still.

(honestly I am not a particularly big proponent of this)

RFC clarification and separation of concerns about the different objects/abstractions

@fenekku:

What is the difference between the
a) api.py Record,
b) the controller.py's RecordController and
c) the resources/record.py's RecordResource ?
I mean Record abstracts DB+ES record and adds a PID to it and RecordResource plays the role of the Flask-RESTful equivalent resource... but then what does the controller do? Record already knows about DB, ES... Like in the frontend, duplication of exact names is an indicator for me that we haven't settled on the responsibilities and their location.

investigate how to normalise link building

We should normalize the way we generate links across the codebase. The main rule is: The business logic layer should not be tied to the presentation layer aka views.

Here is a list of places that break the above rule:

  • The create_pid_redirected_error_handler is trying to build a redirect link based on a view URL i.e using url_for.

Search refactoring

Use cases

As a developer, I would like to easily extend the search engine with new features, so that I don't have to write a lot of code.

For instance, I want to add spatial querying.

  • Add a new querystring field for the rest API: ``/records/?pin=,
  • Apply the parsed values to the search engine query
  • Format the response: include the pin in the search next, self, prev links as a minimum.

The developer should not have to overwrite the service nor the resource methods. Just provide options in the configs.

Design thoughts

URL parsing is already done in Flask-Resources. The search engine should have a query evaluator/interpreter that is passed all arguments (q, sort, page, pin, size, communities, ...). It will then "interpret/evaluate" the parameters by creating the Elasticsearch DSL object.

review response making functions

problem

It is not clear how the reponse building should treat links. The problem arises from the fact the needed attributes are not available in the places where it is needed (controller?)

context

@fenekku

I think these methods' responsibility is to wrap things up in Flask responses... but they also seem to be serializing with that links and prettyprint is that normal?

@ppanero prettyprint is part of the flask reponse. It specifies the tabbing.

The links, I assumed it would be part of the response no matter the type of serialization. Do we think that sort of answer (no links) would be used/needed?

Once this question is answered, we get into the separation of concerns. There we need to ask ourselves: Are links part of the record that goes in the response, or part of the response per se. IMHO, the second one, since is part of the REST paradigm.

python compatibility issues with dates coming from ES

See api.py#81

metadata = record_hit["_source"]
revision_id = record_hit["_version"]
# TODO: Py <3.7 cannot handle : in the offset.
# Cannot use %Y-%m-%dT%H:%M:%S.%f%z
updated = parser.parse(metadata["_updated"])
created = parser.parse(metadata["_created"])

responses: support for total and aggregations

old code:

        return json.dumps(
            dict(
                hits=dict(hits=serialized_records_list, total=total),
                # TODO FIX global links
                links="",
                aggregations=aggregations,
            ),
            **self._format_args(response_ctx.get("prettyprint", False))
        )

tests: fix search tests

Current
Tests are not being cleaned correctly: indexes are not created/deleted properly. A fake index must be setup and base_app must be used rather than app in the fixtures.

Previous
The get (search) endpoint is failing because the fetcher can't retrieve 'recid' key from returned data:

invenio_resources/service/service.py:158: in search
    pid = cls.fetcher()(data=record, record_uuid=None)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

record_uuid = None
data = {'_access': {'metadata_restricted': False, 'files_restricted': False}, '_owners': [1], '_created_by': 1, 'access_right...d title', 'type': 'Other', 'lang': 'eng'}], 'publication_date': '2020-07-08', '_publication_date_search': '2020-07-08'}

    def recid_fetcher(record_uuid, data):
        """Legacy way to fetch a record's identifiers.
    
        :param record_uuid: The record UUID.
        :param data: The record metadata.
        :returns: A :data:`invenio_pidstore.fetchers.FetchedPID` instance.
        """
        pid_field = current_app.config['PIDSTORE_RECID_FIELD']
        return FetchedPID(
            provider=RecordIdProvider,
            pid_type=RecordIdProvider.pid_type,
>           pid_value=str(data[pid_field]),
        )
E       KeyError: 'recid'

.venv/lib/python3.6/site-packages/invenio_pidstore/fetchers.py:67: KeyError

Needs investigation.

Per instance configuration of Resource and Service

As a developer installing InvenioRDM, I should be able to configure certain aspects of the Resource and Service without jumping through hoops. For instance, it should be possible and documented how to change the sorting or faceting of bibliographic records (this module should provide the capacity to do so).

Describe the solution you'd like

In an instance's invenio.cfg, it should be possible to do:

class MyBibliographicRecordResourceConfig(BibliographicRecordResource):
    # my overrides

RDM_RECORDS_BIBLIOGRAPHIC_RECORD_CONFIG = MyBibliographicRecordResourceConfig

Additional context

This is related to #92 .

Should be able to configure permission back as well.

Search serialization returns None for "updated" and "created" fields

Package version (if known): 0.4.X

Describe the bug

When performing a search on /rdm-records the returned hits have null values for updated and created fields.

Steps to Reproduce

  1. /rdm-records
  2. Look at json results
  "created": null,
  "updated": null,

Expected behavior

These should have the UTC string of the respective updated nad created fields.

Additional context

This breaks the search because there is string manipulation of the dates but they are javascript null at this point. The use of lodash's _.get() was not enough to prevent this problem because lodash tests on undefined (not a present field of value null).

Fix new links generation

Package version (if known): master (0.6.x)

Describe the bug

New links generation

  • doesn't account for aggregation filters,
  • doesn't insert the "/api" prefix,
  • self_html
  • publish
  • should fix pagination to not generate a next link if no next
    (and I suspect doesn't urlencode urlencodes differently)

Steps to Reproduce

  1. Un skip the endpoint link tests in test_pagination/test_resource_pagination or test_faceting

Expected behavior

Generated links always have: page, size, sort. Optionally have aggregations if passed in the first place

evaluate black

Add black to the run-tests.sh so it is checked or remove black linting.

  • Does black substitute PEP-8?
  • What are the main 0 config changes?
  • How do I configure my IDE to use black?

Allow for per instance configuration of ResourceConfig / ServiceConfig

Our approach so far makes it very easy for a module like invenio-rdm-records to define its own BibliographicResourceConfig or BibliographicServiceConfig that (eventually) inherit from the config defined here in records-resources. However this approach doesn't take into account the use case where an instance might want to further customize these configuration objects.

For instance, changing the API urls is not possible per instance.

This is tricky to accomplish because .as_blueprint() is called at the top-level in views.py outside of an application context...

Potential solutions:

  • make sure nothing in configs and services' constructors is relying on the current_app (might sometimes be tricky)
  • place blueprint registration in ext.py of invenio-rdm-records (like invenio-records-rest). To avoid automatic endpoint registration, we may want to have a whitelist only policy that is enabled via entry_points... (doesn't feel like we are improving much on records-rest here, but maybe that's the way to go with our constraints)

EDIT

  • I think providing a callable blueprint is actually the way to go, this way a ResourceConfig can be defined in the config.py and fetched with an application context. (I think).

Pass UI routes to Service

The service layer needs to know about the UI routes in order to provide links to HTML page of records.

Right now it assumes the UI route is the API route without the api/ part which is obviously not always the case.

Related to #45

Data flow design: Service -> Resource -> Serializer -> Response

Currently invenio_records_resources.resources.record:RecordResource has two functions:

    def _make_item_body(self, item_result):
        """Make the body content."""
        # Resolve links
        item_result.links.resolve(config=self.config.links_config)
        # Create the result
        res = item_result.record
        # Add errors if present
        if item_result.errors:
            res['errors'] = item_result.errors
        return res

    def _make_list_body(self, list_result):
        """Make the body content for a list item."""
        return {
            "hits": {
                "hits": [
                    self._make_item_body(record)
                    for record in list_result.records
                ],
                "total": list_result.total
            },
            "links": list_result.links,
            "aggregations": list_result.aggregations
        }

Which are used for instance like this:

    def read(self):
        """Read an item."""
        item = self.service.read(
            id_=resource_requestctx.route["pid_value"],
            identity=g.identity
        )
        return self._make_item_body(item), 200

Key questions arising from this:

  • Question 1: Why is the read() resource method not returning a response?
  • Question 2: Why cannot the service result be returned almost directly (i.e. why do the _make_item/list_body need to transform the data again?

Resources return values (question 1)

Currently we don't return a response in read() because we want to avoid repeating two lines that do the actual serialization in flask_resources.responses.Response.

Perhaps it makes sense that in-fact a resource has more control over the serialization?

Service results (question 2)

If the service results (RecordItem, RecordList) looked differently, they could perhaps be returned directly instead? Only the link resolving needs to be run by the resource (because the service doesn't know about the URLs).

RecordItem
A record item could likely be reduced to simply:

  • record project
  • links
  • errors (if needed)

RecordList

Would have to mimick the structure that _make_list_body creates.

permissions: search

How to filter fields according to permissions.

A user might have access to restricted files or extension metadata, but other might not. however, the second might still have access to metadata. How do we filter that without leaking info (e.g. checksums, etc.).

serializers: support refs

old code:

    def process_metadata(self, record, response_ctx, *args, **kwargs):
        return (
            copy.deepcopy(record.replace_refs())
            if self.replace_refs
            else record.dumps()
        )

Dynamic resource/service config loading without app context

The current ConfigLoaderMixin does not allow for running outside the Flask application context because it is accessing the Flask application configuration.

This means you have to do this:

r = Resource(config=ResourceConfig, service=Service(config=ServiceConfig))
app.register_blueprint(r.as_blueprint('records'))

but the config is already defined as default config:

class Resource:
     default_config = ResourceConfing
class Service:
     default_config = ServiceConfing

Instead, you should just do like this:

r = Resource(service=Service())
app.register_blueprint(r.as_blueprint('records'))

search: support filters (faceting)

Zenodo supports aggregations (done), filters, post_filters (done). Filters are left to support.

Question:

What is the use case for filtering? It seems to be used for communities, but it is unclear to me what the general problem it is trying to solve.

service: filtering of record metadata according to permissions

Following user-stories are basically about filtering record metadata according to what a given identity can access.

User stories

As an uploader/admin, I want to see the full metadata including files of a closed access record
As a visitor, I want to see the full metadata excluding files of a closed access record

Example

When getting a record:

GET /record/12345-12345 HTTP/1.1

The service layer will do the following actions:

  1. Resolve ID to a record
  2. Check permission to record
  3. Create a record state object

The presentation layer will then afterwards:

  1. Serialize the record state out to the user.

Considerations

  • The presentation layer (i.e. the serializer in this case), should not know anything about permissions. The sole job should be to seralize a record.
  • The service layer (either in the a) service or b) record state / resource unit) needs to filter the metadata.
  • This also has to be implemented for the search engine.
  • Until we have a proper query parser, the indexed records need to have all restricted fields removed.

Examples of filtering metadata

  • Remove _files from the metadata (closed access record - meaning a record that's publicly visible, but where the files are restricted).
  • Remove internal notes.

services: syntactic sugar

Syntactic sugar: isn't it nicer to have minter, data_validator as properties in the RecordService class so you can write e.g self.data_validator.validate(data)?

validation and marshmallow

@fenekku:

Serialization / Deserialization / Marshmallow

It dawns on me that marshmallow is both a record model (the fields + validation), a record serializer (json.py!), a loader (expects this shape)... There is a murkiness here, we might want to clear up by decree. We can say: here is what marshmallow will do for us and the things it won't.

ext: Register application error handler for HTTPExceptions

When we get rid of the dependency to invenio-rest we should apply an application errorhandler for the HTTPException. E.g

from flask_resources.errors import handle_http_exception
from werkzeug.exceptions import HTTPException

def init_app():
    ...
    app.register_error_handler(HTTPException, handle_http_exception)
    ...

Add back sorting to search

The search URL should accept sort as a URL args and sort results appropriately.

Example: https://localhost/api/records/?sort=mostrecent&size=10&page=2

[Discussion] Meaning of "sort" and "order" in search URLs

Is your feature request related to a problem? Please describe.

The search URLs with sorting are now:

Browser: https://127.0.0.1:5000/search?q=Richard%20smith&sort=bestmatch&order=asc&p=1&s=10&l=list
API: https://127.0.0.1:5000/api/rdm-records?q=Richard%20smith&sort=bestmatch&page=1&size=10

Note how this is different from what it used to be:

Browser: https://127.0.0.1:5000/search?q=Richard%20smith&sort=bestmatch&order=desc&p=1&s=10&l=list
API: https://127.0.0.1:5000/api/rdm-records?q=Richard%20smith&sort=-bestmatch&page=1&size=10

In other words: the meaning of sort=bestmatch on the API changed from meaning "lowest best match to highest best match" to mean "highest best match to lowest best match" just like sort=mostrecent means "highest updated date to lowest updated date".

Do we keep this change, or try to reproduce what Invenio-records-rest had?

I made those changes (target me ๐ŸŽฏ) because ascending / descending logic didn't seem to make sense for both mostrecent and bestmatch. My take is that bestmatch and mostrecent imply an ordering, so "order" should be used to reverse that order (or keep it the same)... The goal was also to get something working for the release quickly. But I don't mind trying to keep same logic as invenio-records-rest (some more work in records-resources needed to achieve that though).

re-enable search preference

Package version (if known): v0.6.2

Describe the bug

Search preference is not enabled and it could potentially lead to bouncing effect.

Steps to Reproduce

  1. see services/records/service.py::search()

Additional context

It was disabled due to the lack of user-agent and ip related information. This info should be extracted and set in the resource_requestctx, then the resource should pass it to the service.

The old search object uses the request object https://github.com/inveniosoftware/invenio-search/blob/81f4d23a1f24fe0d1c2fb7b1a9c64fd7870daed3/invenio_search/api.py#L150 it should be migrated in a way that is not accessed and just passed https://github.com/inveniosoftware/invenio-search/blob/81f4d23a1f24fe0d1c2fb7b1a9c64fd7870daed3/invenio_search/api.py#L270

Values should be accessed at the resource level and passed to the service search? (maybe is the service who calculates the preference (hash) based on what the resource passes.

Why:

Implement stubs for REST API endpoints

In Invenio-Records-Resources and Invenio-Drafts-Resources.

  • /records: GET, POST
  • /records/:id: GET, DELETE
  • /records/:id/versions GET, POST
  • /records/:id/files: GET
  • /records/:id/files/:key GET
  • /records/:id/files/:key/download GET
  • /user/records GET
  • /records/:id/draft: GET, POST, PUT, DELETE
  • /records/:id/draft/actions/:action: POST
  • /records/:id/draft/files GET, POST, PUT, DELETE
  • /records/:id/draft/files/:key: GET, PUT, DELETE
  • /records/:id/draft/files/:key/download GET

service: indexing enrichments

Currently, with an index receiver, you're able to enrich/filter a record prior to having it indexed in Elasticsearch.

Currently, this is being used to e.g.:

  • Remove restricted fields (see #41)
  • Enrich the record with versioning relationships
  • Enrich the record with usage statistics.
  • Compute fields like file count/size to make them searchable.
  • Massage geolocations points from one format into the mapping format.

Extract link generation and pagination from serializer logic as much as possible

Generating the data that will be serialized in the case of pagination or hypermedia API links (link generation) is business logic. How this data is serialized is serializer logic. This was apparent by XML serializer wanting to reuse business logic used by JSON serializer.

Where to place this business logic is not clear cut. Either in the resource unit or in the service...

service+resource: Elasticsearch completion suggester

Migrate the existing Elasticsearch completion suggester from Invenio-Records-REST:

class SuggestResource(MethodView):
    """Resource for records suggests."""

    view_name = '{0}_suggest'

    def __init__(self, suggesters, search_class=None, **kwargs):
        """Constructor."""
        self.suggesters = suggesters
        self.search_class = search_class

    def get(self, **kwargs):
        """Get suggestions."""
        completions = []
        size = request.values.get('size', type=int)

        for k in self.suggesters.keys():
            val = request.values.get(k)
            if val:
                # Get completion suggestions
                opts = copy.deepcopy(self.suggesters[k])
                # Context suggester compatibility adjustment
                if 'context' in opts.get('completion', {}):
                    context_key = 'context'
                elif 'contexts' in opts.get('completion', {}):
                    context_key = 'contexts'
                else:
                    context_key = None

                if context_key:
                    ctx_field = opts['completion'][context_key]
                    ctx_val = request.values.get(ctx_field)
                    if not ctx_val:
                        raise SuggestMissingContextRESTError
                    opts['completion'][context_key] = {
                        ctx_field: ctx_val
                    }

                if size:
                    opts['completion']['size'] = size

                completions.append((k, val, opts))

        if not completions:
            raise SuggestNoCompletionsRESTError(
                ', '.join(sorted(self.suggesters.keys())))

        # Add completions
        s = self.search_class()
        for field, val, opts in completions:
            source = opts.pop('_source', None)
            if source is not None and ES_VERSION[0] >= 5:
                s = s.source(source).suggest(field, val, **opts)
            else:
                s = s.suggest(field, val, **opts)

        if ES_VERSION[0] == 2:
            # Execute search
            response = s.execute_suggest().to_dict()
            for field, _, _ in completions:
                for resp in response[field]:
                    for op in resp['options']:
                        if 'payload' in op:
                            op['_source'] = copy.deepcopy(op['payload'])
        elif ES_VERSION[0] >= 5:
            response = s.execute().to_dict()['suggest']

        result = dict()
        for field, val, opts in completions:
            result[field] = response[field]

        return make_response(jsonify(result))

Service API

Suggestion for service API could look like this:

grant_service.suggest('<value>', 'suggest-conf-name', context={'funder': '...'})

revision_id value is incorrect from creation

Package version (if known): latest master - v0.6.0

Describe the bug

When creating a record, the record.commit operation sets the record.model.version_id to 2, consequently the record.revision_id's value is 1. The increment happens when merging the session (https://github.com/inveniosoftware/invenio-records/blob/master/invenio_records/api.py#L408). There is another operation performing potential operations over the record.

Steps to Reproduce

  1. Run the tests/services/test_service.py::test_simple_flow. Adding assert item.revision_id == 0 after the creation line 38. Can be seen in the travis build of #112

Expected behavior

revision_id should be 0.

Additional context

Tried deactivating components. For example the AccessComponent performas a record.update operation. My thinking was: This might flag as modified but not commit, then when merging there are two operations that are flushed. It was not the case.

responses: support prettyprint

It is part of json.dumps but the serializer should not care about the response formatting
old code:

    @staticmethod
    def _format_args(prettyprint=False):
        """Get JSON dump indentation and separates."""
        if prettyprint:
            return dict(indent=2, separators=(", ", ": "),)
        else:
            return dict(indent=None, separators=(",", ":"),)

serializers: common interface to serialize drafts, records, tombstone pages, etc.

Example of implementation in flask-resources, it was removed from there because no default class needed to implement it.

class SerializableMixin:
    """Serializable Interface.

    Objects that will be serialized must implement it.
    """

    @property
    def object():
        """Returns the object itself."""
        raise NotImplementedError()

    @property
    def id():
        """Returns the object id."""
        raise NotImplementedError()

    @property
    def version():
        """Returns the object version or revision."""
        raise NotImplementedError()

    @property
    def last_modified():
        """Returns the date of the last modification."""
        raise NotImplementedError()

global: validate all kind of config classes

Is your feature request related to a problem? Please describe.

In the new implementation of resources,services, etc. there are a lot of config classes passed that they require specific attributes to be set. It would be useful to validate these classes and throw appropriate errors to users and avoid having implicit errors that can distract us from finding the source of the problem.

Describe the solution you'd like

A solution could be a basic metaclass that these config classes are inheriting from. That metaclass has access to all attributes of the subclasses, thus it could run validation on them

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.