Git Product home page Git Product logo

pdf2doi's Introduction

Allan Lab Website

This is the website of our academic research group at Leiden University.

This website is powered by Jekyll and some Bootstrap, Bootwatch. We tried to make it simple yet adaptable, so that it is easy for you to use it as a template. Plese feel free to copy and modify for your own purposes. You don't have to link to us or mention us (but of course we appreciate it).

Go to aboutwebsite.md to learn how to copy and modidy this page for your purpose.

Copyright Allan Lab. Code released under the MIT License.

pdf2doi's People

Contributors

djrhails avatar duzabf avatar michelecotrufo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pdf2doi's Issues

Proxy?

Is it possible to add Proxy functionality? I am blocked from using the web requests without it. The code itself looks great.

Possible optimization for main.py

Currently, lines 425 to 428 in main read:

    for result in results:
        if result['identifier']:
            print('{:<15s} {:<40s} {:<10s}\n'.format(result['identifier_type'], result['identifier'],result['path']) ) 

    return

I suspect this could be improved/updated with an f-string:

    for result in results:
        if result['identifier']:
            print(f'{result['identifier_type']}, {result['identifier'], {result['path']} \n')

    return

Program returns error on encrypted files, would prefer if it skipped them.

A file in the list wasn't decrypted, and so it returned this error. Ideally, it should log a warning that it's encrypted, and then skip over it.

`[pdf2doi]: Trying to retrieve a DOI/identifier for the file: ...

[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...

Traceback (most recent call last):

File "/usr/local/bin/pdf2doi", line 8, in
sys.exit(main())

File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 410, in main
results = pdf2doi(target=target,

File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 112, in pdf2doi
result = pdf2doi( target=file, verbose=verbose, websearch=websearch, webvalidation=webvalidation,

File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 147, in pdf2doi
result = pdf2doi_singlefile(filename)

File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 190, in pdf2doi_singlefile
result = finders.find_identifier(filename,method="document_infos",keysToCheckFirst=['/doi','/identfier'])

File "/usr/local/lib/python3.9/site-packages/pdf2doi/finders.py", line 487, in find_identifier
identifier, desc, info = finder_methodsmethod

File "/usr/local/lib/python3.9/site-packages/pdf2doi/finders.py", line 587, in find_identifier_in_pdf_info
pdfinfo = get_pdf_info(path)

File "/usr/local/lib/python3.9/site-packages/pdf2doi/finders.py", line 275, in get_pdf_info
info = pdf.getDocumentInfo()

File "/usr/local/lib/python3.9/site-packages/PyPDF2/pdf.py", line 1101, in getDocumentInfo
obj = self.trailer['/Info']

File "/usr/local/lib/python3.9/site-packages/PyPDF2/generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()

File "/usr/local/lib/python3.9/site-packages/PyPDF2/generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()

File "/usr/local/lib/python3.9/site-packages/PyPDF2/pdf.py", line 1617, in getObject
raise utils.PdfReadError("file has not been decrypted")

PyPDF2.utils.PdfReadError: file has not been decrypted`

Running this on Mac Big Sur in VSC, v 0.5 returns this error. v 0.4 does not.

Traceback (most recent call last):
File "/Users/johnfallot/venv/210706_PDN_ScienceAssistant_v16.py", line 3, in
from pdf2doi.finders import validate
File "/usr/local/lib/python3.9/site-packages/pdf2doi/init.py", line 13, in
from .main import pdf2doi
File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 6, in
import pdf2doi.utils_registry as utils_registry
File "/usr/local/lib/python3.9/site-packages/pdf2doi/utils_registry.py", line 5, in
import winreg
ModuleNotFoundError: No module named 'winreg'

Not a bug, just results have 'false-positives'? (Test scenario)

Hello,
this is a very nifty tool, but I get the following results on a test case I set up containing a bundle of random PDFs from my library.

[pdf2doi]: ................
DOI             10.1109/MS.2018.2141038                  ./[email protected]

DOI             10.1145/3341227                          ./10.1145@3341227 MUST and MUST NOT.pdf

DOI             10.1145/38807.38824                      ./120158- Use Case Template-20160821_0954877.pdf

DOI             10.1016/j.jss.2016.02.047                ./120216- Software Requirements Specification Template-20160821_0951179.pdf

DOI             10.1007/978-3-319-09816-6                ./2014_Book_Autonomy Requirements Engineering for Space Missions NASA Springer.pdf

DOI             10.1007/978-1-4614-5377-2                ./293233main_62651main_1_pmchallenge_hraster.pdf

The first answer is pretty cool, extracted from filename. The 1st, 2nd and 5th are correct. The rest is false. Specifically the last one is close to target, but I am yet about to understand how. The file is a presentation, without mentioning the extracted DOI, but has similar contents.

Sincerely

Clash with other pdf extractions libraries

I use a bunch of other pdf extraction tools like tabula, camelot and layout parser and it seems that pdf2doi is using an older version of pdfminer-six which gives problems when coexisting with these libraries. When installing with pip in the same env in which i use layoutparser and camelot i get this error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.6.0 requires pdfminer.six==20211012, but you have pdfminer-six 20181108 which is incompatible.
google-api-core 1.31.5 requires six>=1.13.0, but you have six 1.12.0 which is incompatible.
camelot-py 0.10.1 requires pdfminer.six>=20200726, but you have pdfminer-six 20181108 which is incompatible.

Is there a workaround to this problem?

[Suggestion] Look into PDF text-annotations for valid DOIs

First of all, thanks for the awesome tool! It saved me lots of time during my bibliography/SOTA runs, or by batch-renaming 100s of PDF files for easier indexing.

Now, to the point:

a) Some background: I disabled Google-searching (Methods #4 and #5) as they rarely worked on old/no-DOI papers in my field (I am an electromagnetics engineering, working with journals from IEEE, OSA/Optica, AIP, APS, etc.). It's faster for me to open the PDF file w/ Chrome, select title, R-click it to google-search and get the DOI. Now, to pass this DOI to PDF2DOI, I presently rename the file using the DOI as a name-string (replacing slashes with dashes), and then R-clicking it with PDF_renamer, done. So, it works with Method#2.

b) The Suggestion: I sometimes also copy the DOI (as URL or plain DOI, with slashes etc) into the top of the first page, for easier reference, as a text-annotation ("typewriter tool") or inside a bubble/note/comment annotation. Could PDF2DOI be made to look into these first-page annotations for the DOI, e.g., during Method#3? It would be really handy (for me)...

Thanks for your time!

TypeError: 'NoneType' object is not iterable

There appears to be a type error in "finder.py" that only emerges on certain PDF files. This one, for example:
paper12.2009_unknown_040916_440842.pdf

A miniumn code snippet for reproducing this error:

from pathlib import Path
import pdf2doi

pdf2doi.config.set("verbose", False)
PDF_name = "paper12.2009_unknown_040916_440842.pdf"
results = pdf2doi.pdf2doi(str(Path("examples", PDF_name)))

Where the PDF is placed in the example folder.

Here is the error message:

Traceback (most recent call last):
  File "/Users/donyin/Desktop/pdf2doi-master/main.py", line 15, in <module>
    results = pdf2doi.pdf2doi(str(Path("examples", i)))
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/main.py", line 90, in pdf2doi
    result = pdf2doi_singlefile(filename)
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/main.py", line 134, in pdf2doi_singlefile
    result = finders.find_identifier(file,method="document_infos",keysToCheckFirst=['/doi','/identfier'])
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 548, in find_identifier
    identifier, desc, info = finder_methods[method](file,func_validate,**kwargs)
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 586, in find_identifier_in_pdf_info
    identifier,desc,info = find_identifier_in_text(pdfinfo[key],func_validate)
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 286, in find_identifier_in_text
    for identifier in identifiers:
TypeError: 'NoneType' object is not iterable

I thought I fixed this error by adding:

if identifiers is None:
     identifiers = []

at line 286 of your "finder.py", so that it becomes:

        #First we look for DOI
        for v in range(len(doi_regexp)):
            identifiers = extract_doi_from_text(text,version=v)
            if identifiers is None: # <- here
                identifiers = [] # <- here
            for identifier in identifiers:
                validation = func_validate(identifier,'doi')
                if validation: 
                    return identifier, 'DOI', validation

But this was a bit hacky and not the proper solution. You'd undoubtedly know more about what's going on, so I thought I'd let you know about this.

And by the way, there are some deprecated syntax that you might want to address:

UserWarning: Page.extractText is deprecated and will be removed in PyPDF2 2.0.0. Use Page.extract_text instead. [_page.py:1003]
UserWarning: addMetadata is deprecated and will be removed in PyPDF2 2.0.0. Use add_metadata instead. [_writer.py:793]
UserWarning: Page.extractText is deprecated and will be removed in PyPDF2 2.0.0. Use Page.extract_text instead. [_page.py:1003]

cheers,
Don

doi2bib function

Hi, Michele,

Nice work! I am trying to extract the bibTex strings into a .txt file but noticed that you have removed the bibTex_makers module from v1.1, would you make some suggestions on how I can achieve it under current version?

Thanks!

Pdf reading from file object rather than from path

Hello,

Amazing tool, I love it, is there a way to use a file object rather than an absolute path to feed to pdf2doi? Asking because I am trying to modify an app deployed on google cloud services to incorporate pdf2doi, but I can't find a way that doesn't involve downloading the files to local machine, which would be mildly inconvenient. The pdf files are stored on google clouds and it would be more elegant to open them as file objects and then manipulate them rather than to download it to local, run pdf2doi and re-upload the info.

Thank you very much for your work!

File not closed

Hi,

The function pdf2doi_singlefile does not close the opened pdf file. The close file statement is not executed due to return statements on successful identifier finding.

pdf2doi/pdf2doi/main.py

Lines 161 to 165 in d8e7117

if result['identifier']:
return result
if flag_closefile:
file.close()

This causes the issue with pdf-rename on Windows. The renaming attempt results in access error as file is opened by the script itself.

I'll make PR shortly to fix this.

All arXiv articles now have DOIs

Apparently the arXiv blog has announced that as of Feb 2022 all arxiv articles have DOIs.

Furthermore the DOI's share a unified prefix and the arXiv IDs as a suffix:

"An author can determine their article’s DOI by using the DOI prefix https://doi.org/10.48550/ followed by the arXiv ID (replacing the colon with a period). For example, the arXiv ID arXiv:2202.01037 will translate to the DOI link https://doi.org/10.48550/arXiv.2202.01037"

Perhaps could be grounds for a 2.0 release that returns DOI for arXiv articles.

Option for disabling document text method

First, thanks for this very helpful library!

For many of the papers I read your algorithm works fine and finds the correct doi.
But as you already mention in the README, for some papers the used document_text method results in a wrong doi as the doi of other papers appear first.
Unfortunately this is very often the case for papers of certain conferences I read often as they contain arxiv IDs in the references and do not contain their own doi anywhere else in the text. At the same time, when I comment out the document_text method, I get pretty good results with the fourth method.
I am wondering if one of the following features might help to reduce these type of errors:

  • only using the first pages to look for doi in text
  • having an option to disable certain steps in the search process
  • being able to customize the order of the search methods

Do you think one of these options (or smth else) is something which the library would benefit from and can be implemented with a reasonable effort? If so, I can see if I find the time to turn my current "comment-out-workaround" into a mergable feature.

Export/Save to CSV? Import From CSV?

May be a bit involved, but my hunch is that a factory pattern could be used to allow for either importing info from a directory OR a row in a CSV.

Similarly, it'd be great if this exported to a CSV/not just the console. This I've done in my own program. As I've mentioned to you already, however, it's had some problems (namely Pandas isn't happy with some of the keys I've given it).

Add file DOI check to URL paths

Often a search can surface DOI descriptors in the URL path alone, for instance:

[pdf2doi]: Performing google search with key "The Experimental Generation of Interpersonal Closeness: A Procedure and Some Preliminary Findings"
[pdf2doi]: and looking at the first 6 results...
[pdf2doi]: Looking for a valid identifier in the search result #1 : https://journals.sagepub.com/doi/pdf/10.1177/0146167297234003
[pdf2doi]: Looking for a valid identifier in the search result #2 : https://journals.sagepub.com/doi/abs/10.1177/0146167297234003

Supporting this would give quicker identifications, but also allow for occasions, such as this, where the DOI can't be extracted from the actual page.

https://doi.org/10.1177/0146167297234003

Not importing for Python 3.10

This is likely known/to be expected, but upon upgrading python to 3.10, pdf2doi no longer imports for VSCode on Mac.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.