altmetric / identifiers Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 1.0 86 KB

Collection of utilities related to the extraction, validation and normalization of various scholarly identifiers.

Home Page: https://rubygems.org/gems/identifiers

License: MIT License

Ruby 99.77% Shell 0.23%

identifiers's Issues

ISBN with hyphens

For an ISBN [1], can the extractor return a value that retains the hyphens?

Given the variable length of ISBN parts, it's not possible to reconstruct the hyphens or spaces once they are removed. It's desirable to validate and extract an ISBN while retaining the spaces or hyphens in it.

Identifiers::ISBN.extract '978-92-95055-02-5'
=> ["9789295055025"]

[1] https://www.isbn-international.org/content/what-isbn

Cannot parse a PMID from an URI

> uri
=> "https://www.ncbi.nlm.nih.gov/pubmed/10002407"
> extractor
=> Identifiers::PubmedId
> extractor.extract(uri)
=> []

# It doesn't matter if it's HTTP or HTTPS
> uri = "http://www.ncbi.nlm.nih.gov/pubmed/10002407"
=> "http://www.ncbi.nlm.nih.gov/pubmed/10002407"
> extractor.extract(uri)
=> []

# The value in the URI works fine, alone
> value = '10002407'
=> "10002407"
> extractor.extract(value)
=> ["10002407"]

Annoying, yet valid, WorldBank DOI

The WorldBank has a DOI which ends in a dot.

https://doi.org/10.1596/1020-797X-10_2_7.

Unfortunately, DOI.extract doesn't appear to support that.

[1] pry(main)> Identifiers::DOI.extract('https://doi.org/10.1596/1020-797X-10_2_7.')
=> ["10.1596/1020-797x-10_2_7"]

Support different hyphens for the same ISBN

The current regular expression identifies the character used as hyphen and then allow for it to be repeated within the string:

https://github.com/altmetric/identifiers/blob/master/lib/identifiers/isbn.rb#L7-L10

( \p{Pd} matches any kind of hyphen or dash and \2? matches the same text as most recently matched by the 2nd capturing group)

However in some rare instances we found ISBNs containing different special characters used as hyphen within the same ISBN, e.g. 978–3−200–01908–9:

e.g.
-: 002D - UTF-8: 2D (hyphen / minus)
–: 2013 - UTF-8: E2 80 93 (en dash)
−: 2212 - UTF-8: E2 88 92 (minus sign; note: this is not recognised by \p{Pd})

The current behaviour is the result of an improvement introduced last year (2e8a138)

Stop extracting multiple Isbns from a string of digits separated by dashes.

Currently, gem extracts multiple isbns from a string of digits separated by '-'.
Here is the example:

pry(main)> Identifiers::ISBN.extract('0-1884-0-3140-0-4396-0-5652-0-4396-0-2826').uniq
=> ["9780188403145", "9780439605656"]

This should not be happening as valid isbns should be separate entities.

Isbn::extract doesn't work with space separated ISBNs

I tried to use the library like so:

Isbn::extract("9780854045877 9781847552518")

and got [] back.

If I use it on

Isbn::extract("9780854045877\n9781847552518")

though it gives me both ISBNs.
Is there a reason why it doesn't extract anything in the first scenario?

Extract ISBNs using same separators

Always extract using same separators ( , - or none). Fix:

Identifiers::ISBN.extract("978-3-319-18019-9 978-3-319-18020-5")
=> ["9783319180199", "9783319180205", "9781801999786"]

altmetric / identifiers Goto Github PK

identifiers's Issues

ISBN with hyphens

Cannot parse a PMID from an URI

Annoying, yet valid, WorldBank DOI

Support different hyphens for the same ISBN

Stop extracting multiple Isbns from a string of digits separated by dashes.

Isbn::extract doesn't work with space separated ISBNs

Extract ISBNs using same separators

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent