biopragmatics / curies Goto Github PK

View Code? Open in Web Editor NEW

21.0 4.0 6.0 477 KB

🐸 Idiomatic conversion between URIs and compact URIs (CURIEs) in Python

Home Page: https://curies.readthedocs.io

License: MIT License

Python 100.00%

curies obofoundry semantic-web uris

curies's Introduction

curies

Idiomatic conversion between URIs and compact URIs (CURIEs).

import curies

converter = curies.load_prefix_map({
    "CHEBI": "http://purl.obolibrary.org/obo/CHEBI_",
    # ... and so on
})

>>> converter.compress("http://purl.obolibrary.org/obo/CHEBI_1")
'CHEBI:1'

>>> converter.expand("CHEBI:1")
'http://purl.obolibrary.org/obo/CHEBI_1'

Full documentation is available at curies.readthedocs.io.

CLI Usage

This package comes with a built-in CLI for running a resolver web application or a IRI mapper web application:

# Run a resolver
python -m curies resolver --host 0.0.0.0 --port 8764 bioregistry

# Run a mapper
python -m curies mapper --host 0.0.0.0 --port 8764 bioregistry

The positional argument can be one of the following:

A pre-defined prefix map to get from the web (bioregistry, go, obo, monarch, prefixcommons)
A local file path or URL to a prefix map, extended prefix map, or one of several formats. Requires specifying a --format.

The framework can be swapped to use Flask (default) or FastAPI with --framework. The server can be swapped to use Werkzeug (default) or Uvicorn with --server. These functionalities are also available programmatically, see the docs for more information.

🧑‍🤝‍🧑 Related

Other packages that convert between CURIEs and URIs:

https://github.com/prefixcommons/prefixcommons-py (Python)
https://github.com/prefixcommons/curie-util (Java)
https://github.com/geneontology/curie-util-py (Python)
https://github.com/geneontology/curie-util-es5 (Node.js)
https://github.com/endoli/curie.rs (Rust)
https://github.com/cthoyt/curies4j (Java)
https://github.com/biopragmatics/curies.rs (Rust, Node.js, Python)

🚀 Installation

The most recent release can be installed from PyPI with:

$ pip install curies

This package currently supports both Pydantic v1 and v2. See the Pydantic migration guide for updating your code.

👐 Contributing

Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.md for more information on getting involved.

👋 Attribution

🙏 Acknowledgements

This package heavily builds on the trie data structure implemented in pytrie.

⚖️ License

The code in this package is licensed under the MIT License.

🍪 Cookiecutter

This package was created with @audreyfeldroy's cookiecutter package using @cthoyt's cookiecutter-snekpack template.

🛠️ For Developers

See developer instructions

The final section of the README is for if you want to get involved by making a code contribution.

Development Installation

To install in development mode, use the following:

$ git clone git+https://github.com/cthoyt/curies.git
$ cd curies
$ pip install -e .

🥼 Testing

After cloning the repository and installing tox with pip install tox, the unit tests in the tests/ folder can be run reproducibly with:

$ tox

Additionally, these tests are automatically re-run with each commit in a GitHub Action.

📖 Building the Documentation

The documentation can be built locally using the following:

$ git clone git+https://github.com/cthoyt/curies.git
$ cd curies
$ tox -e docs
$ open docs/build/html/index.html

The documentation automatically installs the package as well as the docs extra specified in the setup.cfg. sphinx plugins like texext can be added there. Additionally, they need to be added to the extensions list in docs/source/conf.py.

📦 Making a Release

After installing the package in development mode and installing tox with pip install tox, the commands for making a new release are contained within the finish environment in tox.ini. Run the following from the shell:

$ tox -e finish

This script does the following:

Uses Bump2Version to switch the version number in the setup.cfg, src/curies/version.py, and docs/source/conf.py to not have the -dev suffix
Packages the code in both a tar archive and a wheel using build
Uploads to PyPI using twine. Be sure to have a .pypirc file configured to avoid the need for manual input at this step
Push to GitHub. You'll need to make a release going with the commit where the version was bumped.
Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can use tox -e bumpversion minor after.

curies's People

Contributors

Stargazers

Watchers

Forkers

cmungall jervenbolleman vemonet hrshdhgd matentzn sneakers-the-rat

curies's Issues

`Converter.prefixmap` should be a bimap

Right now, the prefixmap in the converter object (converter.prefixmap) is a 1:n object, which means that any prefix can be linked to a number of prefixes, which makes it ambiguous (or rather, for those who like splitting hairs, order-dependent). In my opinion, this here should pass (but it does not):

def test_bimap(self):
    epm = [{
    "prefix": "Orphanet",
    "prefix_synonyms": [
        "orphanet.ordo"
    ],
    "uri_prefix": "http://www.orpha.net/ORDO/Orphanet_" }]
    converter = Converter.from_extended_prefix_map(epm)
    self.assertTrue('Orphanet' in converter.prefix_map)
    self.assertFalse('orphanet.ordo' in converter.prefix_map)

This is important, because otherwise I cannot control, as a user, which prefix (not, uri-prefix) should be used in SSSOM. Right now, both are included in the exported curie map, eg.

#   Orphanet: http://www.orpha.net/ORDO/Orphanet_
#   orphanet.ordo: http://www.orpha.net/ORDO/Orphanet_

but only the second, the one I do not want, decides over which prefix should be used during compression. So there are two issues here:

I want a prefix map that is a bimap to ship with my data asset (i.e. the sssom file)
I want to be certain that the "prefix", not the "prefix_synonyms" get to dictate the prefix during compression.

Is this an implementation issue with the prefixmap, or do we need a special extension, converter.bimap to cover this.

See mapping-commons/sssom-py#469

curies convertor returns none instead of ignoring when it does not match anything

Hi, not sure if this is a bug of feature, but something that I thought I should write here

I am running bioregistry_converter.pd_compress and it shows up none when it does not match anything.
Probably not ideal (unless that is exactly what you intended) - would expect it to ignore and leave the URL if it does not match

Similar happens with bioregistry_converter.pd_expand

Updating `monarch_context`

Hey @cthoyt, thanks for putting all of this together. I think for now I will pass prefix_map manually. I tried out the monarch_context, but as expected, some of it is out of date.

All I really checked for was to see if the OMIMPS (OMIM Phenotypic Series) prefix appeared, but it doesn't; I didn't really check for anything else.

I could place the onus on @matentzn to help with this at some point.

Prefix reconciliation with transitive mappings

I don't think we've yet solved the simultaneous mapping issue, which would apply if we need to simultaneously move geo to become like ncbi.geo and then geogeo to become geo

    def test_simultaneous(self):
        """Test simultaneous remapping"""
        records = [
            Record(prefix="geo", uri_prefix="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc="),
            Record(prefix="geogeo", uri_prefix="http://purl.obolibrary.org/obo/GEO_"),
        ]
        converter = Converter(records)
        curie_remapping = {"geo": "ncbi.geo", "geogeo": "geo"}
        converter = remap_curie_prefixes(converter, curie_remapping)
        self.assertEqual(
            [
                Record(
                    prefix="ncbi.geo",
                    uri_prefix="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=",
                ),
                Record(
                    prefix="geo",
                    prefix_synonyms=["geogeo"],
                    uri_prefix="http://purl.obolibrary.org/obo/GEO_",
                ),
            ],
            converter.records,
        )

What to do when my preferred capitalization is different from the data and my prefix map has no synonyms?

From #63

I have a similar issue. Here's my code:

import pyobo
from curies import get_obo_converter

obo_converter = get_obo_converter()
df = pyobo.get_sssom_df("umls", names=False)

# Remove bananas
df["object_id"] = df["object_id"].apply(lambda x: ":".join(x.split(":")[-2:]) if str(x).count(":") > 1 else x)

obo_converter.pd_standardize_curie(df, column="object_id")
print(df)

This gives object_id column to be Nones.

Questions:

I commented the code for banana removal and ran the code above. It yielded same results. (I understand the banana removal process may not be accurate but trying to get this working for now).
I know all rows in the object_id column are CURIEs.
In a situation where not all but some elements in the column of interest are not legitimate CURIEs , would this truncate all to Nones?

Any suggestions would be super valuable!

I'd like to add: bioregistry_converter.pd_standardize_curie(df, column="object_id") does yield results but I specifically need OBO format for prefixes.
get_monarch_converter() returns Nones as well

Semantic prefix requirements for OAK and other libraries

add semweb prefixes (owl, skos, dcterms, oio, etc)
distribute prefixes as alongside python and use these by default (makes behavior deterministic for any given version)
when prefixes are retrieved by network use PURLs rather than raw github URLs
remove dependency on repos in prefixcommons org (the br/curies stack should be able to replace them when these requirements are satisfied)
ensure preferred prefixes are used even for non-OBO namespaces (e.g. FlyBase)
ensure preferred expansions are used for any semantic context

Improve baked in converter docs

monarch-initiative/monarch-gene-mapping#34 (comment)

Reconciliation based on synonym doesn't work as expected

r1 = Record(
    prefix="geo",
    prefix_synonyms=["GEO"],
    uri_prefix="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=",
    pattern="^G(PL|SM|SE|DS)\\d+$",
)
r2 = Record(
    prefix="geogeo", uri_prefix="http://purl.obolibrary.org/obo/GEO_", pattern="^\\d{9}$"
)
c1 = Converter([r1, r2])
remapping = {"GEO": "ncbi.geo", "geogeo": "GEO"}
c2 = remap_curie_prefixes(c1, remapping)

r3 = Record(
    prefix="ncbi.geo",
    uri_prefix="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=",
    pattern="^G(PL|SM|SE|DS)\\d+$",
)
r4 = Record(
    prefix="GEO",
    prefix_synonyms=["geo", "geogeo"],
    uri_prefix="http://purl.obolibrary.org/obo/GEO_",
    pattern="^\\d{9}$",
)
self.assertEqual([r4, r3], c2.records)

does not retain geo as synonym in r4

How does `curies` handle uri prefix ordings?

I tried to understand it from the code but I was not sure.

'GO': 'http://purl.obolibrary.org/obo/GO_`
'obo': 'http://purl.obolibrary.org/obo/`

How can I parse http://purl.obolibrary.org/obo/GO_ to:

GO:123

obo:GO_123?

I am not saying I want to I just need to know how this is handled. Thanks :)

(my preference would be sorting the prefixmap so the longest match always gets precedence, i.e. obo:GO_123 never happens.).

Feature requests for `curies.chain()`

priority is preserved
chain should be case-insensitive when composing (so no "go" prefix leaks in when "GO" is present)
add strictness modes (some prefixes can be expected to chain with no clashes, in which case we want to throw an error if there is one, otherwise priority order is preserved)

Hashing an EPM

It might be worth having a way to hash an EPM so they can be quickly indexed / compared for equality

def get_hash(self) -> str:
    import hashlib
    from operator import attrgetter

    h = hashlib.md5()
    for record in sorted(self.records, key=attrgetter("prefix")):
        h.update(record.prefix.encode("utf8"))
        h.update(record.uri_prefix.encode("utf8"))
        for s in itt.chain(sorted(record.prefix_synonyms), sorted(record.uri_prefix_synonyms)):
            h.update(s.encode("utf8"))
    return h.hexdigest()

Exporters to support Monarch use cases

There exist ecosystem some defined formats for prefix maps. It would be nice to have curies exporters for those.

SemanticSQL: prefixes.csv. See: https://github.com/search?q=repo%3AINCATools%2Fsemantic-sql%20prefixes.csv&type=code
robot --add-prefixes context.json
- https://robot.obolibrary.org/global.html#prefixes
- https://github.com/ontodev/robot/blob/master/robot-core/src/main/resources/obo_context.jsonld

Context: This came up originally in my work in mondo-ingest: monarch-initiative/mondolib#8

Handle fully specified triples in mapping service

Add the following else clause:

elif subj_query is not None and obj_query is not None:
    if obj_query in self._expand_pair_all(subj_query):
        yield subj_query, pred_query, obj_query

Handling of CURIEs that include square brackets don't conform to W3C specs and are incompatible with semantic web tools

There are some cases of bioregistry "CURIEs" allowing square brackets in the local id. This is questionable if we follow the (IMO frustratingly opaque) W3C specs.

Here are some examples of what is permitted in bioregistry

SMILES; e.g smiles:CC(=O)NC([H])(C)C(=O)O
UCUM; see biopragmatics/bioregistry#648

(it is of course a stretch to call these IDs (biopragmatics/bioregistry#460))

These work perfectly well in the context of bioregistry; clicking on this will resolve to a nice picture of a molecule, which is what most bioregistry users want.

https://bioregistry.io/reference/smiles:CC(=O)NC([H])(C)C(=O)O

Let's see what happens when we try and use this with tooling that actually supports W3C specs:

{
  "@context": {
    "@base": "http://example.org",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "smiles": "https://bioregistry.io/smiles:"
  },
  "@id": "smiles:CC(=O)NC([H])(C)C(=O)O",
  "@type": "Molecule",
  "rdfs:label": "Acetaminophen"
}

using Jena:

riot --strict smiles.jsonld
16:33:06 WARN  riot            :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
16:33:06 WARN  riot            :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
<https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Molecule> .
<https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> <http://www.w3.org/2000/01/rdf-schema#label> "Acetaminophen" .

not pretty.. but it does process it, even in strict mode

however, it refuses to validate it

riot --validate smiles.jsonld || echo fail
16:38:10 WARN  riot            :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
16:38:10 WARN  riot            :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
fail

In contrast, https://json-ld.org/playground/ does not complain

I suspect the rust toolchains are stricter

Removing or escaping the []s allows it to validate (note that ()s are frequently URL encoded but they are still valid)

What are our options?

Make curies always strict. Forbid [] or encodings thereof. These are poor choices for bona-fide IDs. Don't try and overload the CURIE concept for languages like HGVS, UCUM, SMILES, InChi, etc
go your own way. Explicitly document that curies isn't for CURIEs as defined by W3C specs, it's just prefixed IDs that expand to URLs that work in browsers with no commitments to any specifications outside those in this repo.
Make curies conform to W3C specs, and force []s to be encoded (as the UOM people are doing for UCUM, biopragmatics/bioregistry#648). This could retroactively break things, and confuse people who want to use curies in its intended YOLO fashion
Attempt some formalization where we have loose CURIEs and strict CURIEs and a formal mapping between them (basically URL encoding []s, probably spaces while we are at it)

I think these are all horrible but then I've always said the decision to couple identifiers to networking protocols was a terrible one.

I think 4 is likely the most practical, but this will take some careful planning. There will essentially be the following transforms:

 looseCURIE <-> strictCURIE
    ^.     \.  /.    ^
    |        X       |
    v      /  \.     v
 looseURI   <-> strictURI

(likely implemented with flags on existing expand/contract, with new methods for like-to-like)

What is annoying is that there is AFAICT no way to get json-ld-contexts to specify the diagonal conversion

Tests for remote federated services

I've excerpted this from #49 - this code tests public SPARQL interfaces are able to use the Bioregistry. I'm not sure if this belongs in this package, though

"""Tests for remote federated SPARQL."""

from textwrap import dedent

from curies.mapping_service import _handle_header
from tests.test_federated_sparql import FederationMixin

BIOREGISTRY_SPARQL_ENDPOINT = "http://bioregistry.io/sparql"


class TestPublicFederatedSPARQL(FederationMixin):
    """Test the identifier mapping service."""

    def setUp(self) -> None:
        """Set up the public federated SPARQL test case."""
        self.sparql = dedent(
            f"""\
        PREFIX owl: <http://www.w3.org/2002/07/owl#>
        SELECT DISTINCT ?o WHERE {{
            SERVICE <{BIOREGISTRY_SPARQL_ENDPOINT}> {{
                <http://purl.obolibrary.org/obo/CHEBI_24867> owl:sameAs ?o
            }}
        }}
        """.rstrip()
        )

    def query_endpoint(self, endpoint: str):
        """Query an endpoint."""
        self.assert_service_works(endpoint)

        accept = "application/sparql-results+json"
        resp = self.get(endpoint, self.sparql, accept=accept)
        self.assertEqual(
            200,
            resp.status_code,
            msg=f"SPARQL query failed at {endpoint}:\n\n{self.sparql}\n\nResponse:\n{resp.text}",
        )
        response_content_type = _handle_header(resp.headers["content-type"])
        self.assertEqual(accept, response_content_type, msg="Server sent incorrect content type")

        try:
            res = resp.json()
        except Exception:
            self.fail(msg=f"\n\nError running the federated query to {endpoint}:\n{resp.text}")
        self.assertGreater(
            len(res["results"]["bindings"]),
            0,
            msg=f"Federated query to {endpoint} gives no results",
        )
        self.assertIn(
            "https://bioregistry.io/chebi:24867",
            {binding["o"]["value"] for binding in res["results"]["bindings"]},
        )

    def test_public_federated_virtuoso(self):
        """Test sending a federated query to a public mapping service from Virtuoso."""
        self.query_endpoint("https://bio2rdf.org/sparql")

    def test_public_federated_blazegraph(self):
        """Test sending a federated query to a public mapping service from Blazegraph."""
        self.query_endpoint("http://kg-hub-rdf.berkeleybop.io/blazegraph/sparql")

    def test_public_federated_graphdb(self):
        """Test sending a federated query to a public mapping service from GraphDB."""
        self.query_endpoint("https://graphdb.dumontierlab.com/repositories/test")

Add model for list of records

Using Custom Root Types, Pydantic allows for defining models based on a derived Python type. As an extended prefix map is just a list of a records, it can be defined with the following:

class Records(BaseModel):
    __root__ = List[Record]

ideally, additional metadata is added such that better JSON schema / FastAPI endpoints can be generated using this

Add is_curie() and is_uri() methods as helper functions

Compression and expansion are not the only use cases for curies, we sometimes just want to know if a string is a curie or not. I keep writing wrappers with curies.parse... methods and it would be nice if this was somehow natively supported.

Identify places that could be using `curies`

As a follow-up to #91, this is a list of code that could be updated

https://github.com/monarch-initiative/mondo-ingest/blob/b2d3a10b926fb71dcd1cd6cadba0bbebf0cecafc/src/scripts/mirror_signature_via_oak.py#L38

URI/CURIE disambiguation

This package provides tools to handle CURIEs. But if I'm right, it starts from the assumption that what it's being provided is a CURIE and nothing else.

In the cases where both URIs and CURIEs are accepted, some ambiguities might appear between URIs and CURIEs (sorry, the original description got this link wrong); unless so-called SafeCURIEs are used. Meaning that your package might get called to work on what was wrongly supposed to be a CURIE, but it's a URI.

Now the question is if you plan to provide some validation functionality that might "ring some bells" on the users, if what is being provided might be a URI wrongly supposed to be a CURIE.

See follow-up PRs in linkml-runtime:

linkml/linkml-runtime#280

Public API to build a converter incrementally

Hello,

As discussed in linkml/linkml-runtime#244 there are usecases where it makes sense to build a converter incrementally (eg to have an API where not all prefixes are known at creation time): for instance by creating an empty converter and then adding the prefixes and schema with an add_prefix function.

Thanks
Frank