michelecotrufo / pdf2doi Goto Github PK

A python library/command-line tool to extract the DOI or other identifiers of a scientific paper from a pdf file.

Python 98.01% Batchfile 1.99%

doi python pdf bibtex arxiv identifiers arxiv-identifiers bibtex-entry extract-doi extract

pdf2doi's Introduction

Allan Lab Website

This is the website of our academic research group at Leiden University.

This website is powered by Jekyll and some Bootstrap, Bootwatch. We tried to make it simple yet adaptable, so that it is easy for you to use it as a template. Plese feel free to copy and modify for your own purposes. You don't have to link to us or mention us (but of course we appreciate it).

Go to aboutwebsite.md to learn how to copy and modidy this page for your purpose.

pdf2doi's People

Contributors

Stargazers

Watchers

Forkers

alexmaehon johny-leo duzabf djrhails mana-bio neherdata pathos315 abrefael yebe-abe sabeehsaeed hectormz m0dd0

pdf2doi's Issues

Proxy?

Is it possible to add Proxy functionality? I am blocked from using the web requests without it. The code itself looks great.

Possible optimization for main.py

Currently, lines 425 to 428 in main read:

    for result in results:
        if result['identifier']:
            print('{:<15s} {:<40s} {:<10s}\n'.format(result['identifier_type'], result['identifier'],result['path']) ) 

    return

I suspect this could be improved/updated with an f-string:

    for result in results:
        if result['identifier']:
            print(f'{result['identifier_type']}, {result['identifier'], {result['path']} \n')

    return

Prints 'None' instead of the number of google results (only when called from command prompt)

"Doing a google search, looking at the first None results..."

Program returns error on encrypted files, would prefer if it skipped them.

A file in the list wasn't decrypted, and so it returned this error. Ideally, it should log a warning that it's encrypted, and then skip over it.

`[pdf2doi]: Trying to retrieve a DOI/identifier for the file: ...

[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...

Traceback (most recent call last):

File "/usr/local/bin/pdf2doi", line 8, in
sys.exit(main())

File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 410, in main
results = pdf2doi(target=target,

File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 112, in pdf2doi
result = pdf2doi( target=file, verbose=verbose, websearch=websearch, webvalidation=webvalidation,

File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 147, in pdf2doi
result = pdf2doi_singlefile(filename)

File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 190, in pdf2doi_singlefile
result = finders.find_identifier(filename,method="document_infos",keysToCheckFirst=['/doi','/identfier'])

File "/usr/local/lib/python3.9/site-packages/pdf2doi/finders.py", line 487, in find_identifier
identifier, desc, info = finder_methodsmethod

File "/usr/local/lib/python3.9/site-packages/pdf2doi/finders.py", line 587, in find_identifier_in_pdf_info
pdfinfo = get_pdf_info(path)

File "/usr/local/lib/python3.9/site-packages/pdf2doi/finders.py", line 275, in get_pdf_info
info = pdf.getDocumentInfo()

File "/usr/local/lib/python3.9/site-packages/PyPDF2/pdf.py", line 1101, in getDocumentInfo
obj = self.trailer['/Info']

File "/usr/local/lib/python3.9/site-packages/PyPDF2/generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()

File "/usr/local/lib/python3.9/site-packages/PyPDF2/generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()

File "/usr/local/lib/python3.9/site-packages/PyPDF2/pdf.py", line 1617, in getObject
raise utils.PdfReadError("file has not been decrypted")

PyPDF2.utils.PdfReadError: file has not been decrypted`

got the wrong DOI when query the pre-published version of pdf

It is an amazing tool, but when I search a new paper (1.pdf), I got the wrong DOI (10.1002/2016jc012583). Does anyone know the reason?

In fact, its DOI is on the first page, isn't it faster to directly query the first page of information?

1.pdf

Bibtex entries: the url fields contain %2F instead of /

The url field of each bibtex entry has some of the "/" escaped to %2F

Running this on Mac Big Sur in VSC, v 0.5 returns this error. v 0.4 does not.

Traceback (most recent call last):
File "/Users/johnfallot/venv/210706_PDN_ScienceAssistant_v16.py", line 3, in
from pdf2doi.finders import validate
File "/usr/local/lib/python3.9/site-packages/pdf2doi/init.py", line 13, in
from .main import pdf2doi
File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 6, in
import pdf2doi.utils_registry as utils_registry
File "/usr/local/lib/python3.9/site-packages/pdf2doi/utils_registry.py", line 5, in
import winreg
ModuleNotFoundError: No module named 'winreg'

Not a bug, just results have 'false-positives'? (Test scenario)

Hello,
this is a very nifty tool, but I get the following results on a test case I set up containing a bundle of random PDFs from my library.

[pdf2doi]: ................
DOI             10.1109/MS.2018.2141038                  ./[email protected]

DOI             10.1145/3341227                          ./10.1145@3341227 MUST and MUST NOT.pdf

DOI             10.1145/38807.38824                      ./120158- Use Case Template-20160821_0954877.pdf

DOI             10.1016/j.jss.2016.02.047                ./120216- Software Requirements Specification Template-20160821_0951179.pdf

DOI             10.1007/978-3-319-09816-6                ./2014_Book_Autonomy Requirements Engineering for Space Missions NASA Springer.pdf

DOI             10.1007/978-1-4614-5377-2                ./293233main_62651main_1_pmchallenge_hraster.pdf

The first answer is pretty cool, extracted from filename. The 1st, 2nd and 5th are correct. The rest is false. Specifically the last one is close to target, but I am yet about to understand how. The file is a presentation, without mentioning the extracted DOI, but has similar contents.

Sincerely

Clash with other pdf extractions libraries

I use a bunch of other pdf extraction tools like tabula, camelot and layout parser and it seems that pdf2doi is using an older version of pdfminer-six which gives problems when coexisting with these libraries. When installing with pip in the same env in which i use layoutparser and camelot i get this error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.6.0 requires pdfminer.six==20211012, but you have pdfminer-six 20181108 which is incompatible.
google-api-core 1.31.5 requires six>=1.13.0, but you have six 1.12.0 which is incompatible.
camelot-py 0.10.1 requires pdfminer.six>=20200726, but you have pdfminer-six 20181108 which is incompatible.

Is there a workaround to this problem?

[Suggestion] Look into PDF text-annotations for valid DOIs

First of all, thanks for the awesome tool! It saved me lots of time during my bibliography/SOTA runs, or by batch-renaming 100s of PDF files for easier indexing.

Now, to the point:

a) Some background: I disabled Google-searching (Methods #4 and #5) as they rarely worked on old/no-DOI papers in my field (I am an electromagnetics engineering, working with journals from IEEE, OSA/Optica, AIP, APS, etc.). It's faster for me to open the PDF file w/ Chrome, select title, R-click it to google-search and get the DOI. Now, to pass this DOI to PDF2DOI, I presently rename the file using the DOI as a name-string (replacing slashes with dashes), and then R-clicking it with PDF_renamer, done. So, it works with Method#2.

b) The Suggestion: I sometimes also copy the DOI (as URL or plain DOI, with slashes etc) into the top of the first page, for easier reference, as a text-annotation ("typewriter tool") or inside a bubble/note/comment annotation. Could PDF2DOI be made to look into these first-page annotations for the DOI, e.g., during Method#3? It would be really handy (for me)...

Thanks for your time!

TypeError: 'NoneType' object is not iterable

There appears to be a type error in "finder.py" that only emerges on certain PDF files. This one, for example:
paper12.2009_unknown_040916_440842.pdf

A miniumn code snippet for reproducing this error:

from pathlib import Path
import pdf2doi

pdf2doi.config.set("verbose", False)
PDF_name = "paper12.2009_unknown_040916_440842.pdf"
results = pdf2doi.pdf2doi(str(Path("examples", PDF_name)))

Where the PDF is placed in the example folder.

Here is the error message:

Traceback (most recent call last):
  File "/Users/donyin/Desktop/pdf2doi-master/main.py", line 15, in <module>
    results = pdf2doi.pdf2doi(str(Path("examples", i)))
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/main.py", line 90, in pdf2doi
    result = pdf2doi_singlefile(filename)
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/main.py", line 134, in pdf2doi_singlefile
    result = finders.find_identifier(file,method="document_infos",keysToCheckFirst=['/doi','/identfier'])
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 548, in find_identifier
    identifier, desc, info = finder_methods[method](file,func_validate,**kwargs)
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 586, in find_identifier_in_pdf_info
    identifier,desc,info = find_identifier_in_text(pdfinfo[key],func_validate)
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 286, in find_identifier_in_text
    for identifier in identifiers:
TypeError: 'NoneType' object is not iterable

I thought I fixed this error by adding:

if identifiers is None:
     identifiers = []

at line 286 of your "finder.py", so that it becomes:

        #First we look for DOI
        for v in range(len(doi_regexp)):
            identifiers = extract_doi_from_text(text,version=v)
            if identifiers is None: # <- here
                identifiers = [] # <- here
            for identifier in identifiers:
                validation = func_validate(identifier,'doi')
                if validation: 
                    return identifier, 'DOI', validation

But this was a bit hacky and not the proper solution. You'd undoubtedly know more about what's going on, so I thought I'd let you know about this.

And by the way, there are some deprecated syntax that you might want to address:

UserWarning: Page.extractText is deprecated and will be removed in PyPDF2 2.0.0. Use Page.extract_text instead. [_page.py:1003]
UserWarning: addMetadata is deprecated and will be removed in PyPDF2 2.0.0. Use add_metadata instead. [_writer.py:793]
UserWarning: Page.extractText is deprecated and will be removed in PyPDF2 2.0.0. Use Page.extract_text instead. [_page.py:1003]

cheers,
Don

doi2bib function

Hi, Michele,

Nice work! I am trying to extract the bibTex strings into a .txt file but noticed that you have removed the bibTex_makers module from v1.1, would you make some suggestions on how I can achieve it under current version?

Thanks!

Pdf reading from file object rather than from path

Hello,

Amazing tool, I love it, is there a way to use a file object rather than an absolute path to feed to pdf2doi? Asking because I am trying to modify an app deployed on google cloud services to incorporate pdf2doi, but I can't find a way that doesn't involve downloading the files to local machine, which would be mildly inconvenient. The pdf files are stored on google clouds and it would be more elegant to open them as file objects and then manipulate them rather than to download it to local, run pdf2doi and re-upload the info.

Thank you very much for your work!

File not closed

Hi,

The function pdf2doi_singlefile does not close the opened pdf file. The close file statement is not executed due to return statements on successful identifier finding.

pdf2doi/pdf2doi/main.py

Lines 161 to 165 in d8e7117

 if result['identifier']: 

 return result 

 if flag_closefile: 

 file.close()

This causes the issue with pdf-rename on Windows. The renaming attempt results in access error as file is opened by the script itself.

I'll make PR shortly to fix this.

All arXiv articles now have DOIs

Apparently the arXiv blog has announced that as of Feb 2022 all arxiv articles have DOIs.

Furthermore the DOI's share a unified prefix and the arXiv IDs as a suffix:

"An author can determine their article’s DOI by using the DOI prefix https://doi.org/10.48550/ followed by the arXiv ID (replacing the colon with a period). For example, the arXiv ID arXiv:2202.01037 will translate to the DOI link https://doi.org/10.48550/arXiv.2202.01037"

Perhaps could be grounds for a 2.0 release that returns DOI for arXiv articles.

Option for disabling document text method

First, thanks for this very helpful library!

For many of the papers I read your algorithm works fine and finds the correct doi.
But as you already mention in the README, for some papers the used document_text method results in a wrong doi as the doi of other papers appear first.
Unfortunately this is very often the case for papers of certain conferences I read often as they contain arxiv IDs in the references and do not contain their own doi anywhere else in the text. At the same time, when I comment out the document_text method, I get pretty good results with the fourth method.
I am wondering if one of the following features might help to reduce these type of errors:

only using the first pages to look for doi in text
having an option to disable certain steps in the search process
being able to customize the order of the search methods

Do you think one of these options (or smth else) is something which the library would benefit from and can be implemented with a reasonable effort? If so, I can see if I find the time to turn my current "comment-out-workaround" into a mergable feature.

bibtex entries: the current tag is [author]_[year], should be changed to [author][year][firstwordtitle]

Export/Save to CSV? Import From CSV?

May be a bit involved, but my hunch is that a factory pattern could be used to allow for either importing info from a directory OR a row in a CSV.

Similarly, it'd be great if this exported to a CSV/not just the console. This I've done in my own program. As I've mentioned to you already, however, it's had some problems (namely Pandas isn't happy with some of the keys I've given it).

bibtex entries: {\"u} becomes Ã¼

Add file DOI check to URL paths

Often a search can surface DOI descriptors in the URL path alone, for instance:

[pdf2doi]: Performing google search with key "The Experimental Generation of Interpersonal Closeness: A Procedure and Some Preliminary Findings"
[pdf2doi]: and looking at the first 6 results...
[pdf2doi]: Looking for a valid identifier in the search result #1 : https://journals.sagepub.com/doi/pdf/10.1177/0146167297234003
[pdf2doi]: Looking for a valid identifier in the search result #2 : https://journals.sagepub.com/doi/abs/10.1177/0146167297234003

Supporting this would give quicker identifications, but also allow for occasions, such as this, where the DOI can't be extracted from the actual page.

https://doi.org/10.1177/0146167297234003

AttributeError: 'function' object has no attribute 'config'

in google colab

Not importing for Python 3.10

This is likely known/to be expected, but upon upgrading python to 3.10, pdf2doi no longer imports for VSCode on Mac.

	if result['identifier']:
	return result

	if flag_closefile:
	file.close()