michelecotrufo / pdf2doi Goto Github PK
View Code? Open in Web Editor NEWA python library/command-line tool to extract the DOI or other identifiers of a scientific paper from a pdf file.
A python library/command-line tool to extract the DOI or other identifiers of a scientific paper from a pdf file.
I use a bunch of other pdf extraction tools like tabula, camelot and layout parser and it seems that pdf2doi is using an older version of pdfminer-six which gives problems when coexisting with these libraries. When installing with pip in the same env in which i use layoutparser and camelot i get this error:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.6.0 requires pdfminer.six==20211012, but you have pdfminer-six 20181108 which is incompatible.
google-api-core 1.31.5 requires six>=1.13.0, but you have six 1.12.0 which is incompatible.
camelot-py 0.10.1 requires pdfminer.six>=20200726, but you have pdfminer-six 20181108 which is incompatible.
Is there a workaround to this problem?
Hello,
this is a very nifty tool, but I get the following results on a test case I set up containing a bundle of random PDFs from my library.
[pdf2doi]: ................
DOI 10.1109/MS.2018.2141038 ./[email protected]
DOI 10.1145/3341227 ./10.1145@3341227 MUST and MUST NOT.pdf
DOI 10.1145/38807.38824 ./120158- Use Case Template-20160821_0954877.pdf
DOI 10.1016/j.jss.2016.02.047 ./120216- Software Requirements Specification Template-20160821_0951179.pdf
DOI 10.1007/978-3-319-09816-6 ./2014_Book_Autonomy Requirements Engineering for Space Missions NASA Springer.pdf
DOI 10.1007/978-1-4614-5377-2 ./293233main_62651main_1_pmchallenge_hraster.pdf
The first answer is pretty cool, extracted from filename. The 1st, 2nd and 5th are correct. The rest is false. Specifically the last one is close to target, but I am yet about to understand how. The file is a presentation, without mentioning the extracted DOI, but has similar contents.
Sincerely
There appears to be a type error in "finder.py" that only emerges on certain PDF files. This one, for example:
paper12.2009_unknown_040916_440842.pdf
A miniumn code snippet for reproducing this error:
from pathlib import Path
import pdf2doi
pdf2doi.config.set("verbose", False)
PDF_name = "paper12.2009_unknown_040916_440842.pdf"
results = pdf2doi.pdf2doi(str(Path("examples", PDF_name)))
Where the PDF is placed in the example folder.
Here is the error message:
Traceback (most recent call last):
File "/Users/donyin/Desktop/pdf2doi-master/main.py", line 15, in <module>
results = pdf2doi.pdf2doi(str(Path("examples", i)))
File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/main.py", line 90, in pdf2doi
result = pdf2doi_singlefile(filename)
File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/main.py", line 134, in pdf2doi_singlefile
result = finders.find_identifier(file,method="document_infos",keysToCheckFirst=['/doi','/identfier'])
File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 548, in find_identifier
identifier, desc, info = finder_methods[method](file,func_validate,**kwargs)
File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 586, in find_identifier_in_pdf_info
identifier,desc,info = find_identifier_in_text(pdfinfo[key],func_validate)
File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 286, in find_identifier_in_text
for identifier in identifiers:
TypeError: 'NoneType' object is not iterable
I thought I fixed this error by adding:
if identifiers is None:
identifiers = []
at line 286 of your "finder.py", so that it becomes:
#First we look for DOI
for v in range(len(doi_regexp)):
identifiers = extract_doi_from_text(text,version=v)
if identifiers is None: # <- here
identifiers = [] # <- here
for identifier in identifiers:
validation = func_validate(identifier,'doi')
if validation:
return identifier, 'DOI', validation
But this was a bit hacky and not the proper solution. You'd undoubtedly know more about what's going on, so I thought I'd let you know about this.
And by the way, there are some deprecated syntax that you might want to address:
UserWarning: Page.extractText is deprecated and will be removed in PyPDF2 2.0.0. Use Page.extract_text instead. [_page.py:1003]
UserWarning: addMetadata is deprecated and will be removed in PyPDF2 2.0.0. Use add_metadata instead. [_writer.py:793]
UserWarning: Page.extractText is deprecated and will be removed in PyPDF2 2.0.0. Use Page.extract_text instead. [_page.py:1003]
cheers,
Don
Is it possible to add Proxy functionality? I am blocked from using the web requests without it. The code itself looks great.
First of all, thanks for the awesome tool! It saved me lots of time during my bibliography/SOTA runs, or by batch-renaming 100s of PDF files for easier indexing.
Now, to the point:
a) Some background: I disabled Google-searching (Methods #4 and #5) as they rarely worked on old/no-DOI papers in my field (I am an electromagnetics engineering, working with journals from IEEE, OSA/Optica, AIP, APS, etc.). It's faster for me to open the PDF file w/ Chrome, select title, R-click it to google-search and get the DOI. Now, to pass this DOI to PDF2DOI, I presently rename the file using the DOI as a name-string (replacing slashes with dashes), and then R-clicking it with PDF_renamer, done. So, it works with Method#2.
b) The Suggestion: I sometimes also copy the DOI (as URL or plain DOI, with slashes etc) into the top of the first page, for easier reference, as a text-annotation ("typewriter tool") or inside a bubble/note/comment annotation. Could PDF2DOI be made to look into these first-page annotations for the DOI, e.g., during Method#3? It would be really handy (for me)...
Thanks for your time!
This is likely known/to be expected, but upon upgrading python to 3.10, pdf2doi no longer imports for VSCode on Mac.
First, thanks for this very helpful library!
For many of the papers I read your algorithm works fine and finds the correct doi.
But as you already mention in the README, for some papers the used document_text
method results in a wrong doi as the doi of other papers appear first.
Unfortunately this is very often the case for papers of certain conferences I read often as they contain arxiv IDs in the references and do not contain their own doi anywhere else in the text. At the same time, when I comment out the document_text method, I get pretty good results with the fourth method.
I am wondering if one of the following features might help to reduce these type of errors:
Do you think one of these options (or smth else) is something which the library would benefit from and can be implemented with a reasonable effort? If so, I can see if I find the time to turn my current "comment-out-workaround" into a mergable feature.
Traceback (most recent call last):
File "/Users/johnfallot/venv/210706_PDN_ScienceAssistant_v16.py", line 3, in
from pdf2doi.finders import validate
File "/usr/local/lib/python3.9/site-packages/pdf2doi/init.py", line 13, in
from .main import pdf2doi
File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 6, in
import pdf2doi.utils_registry as utils_registry
File "/usr/local/lib/python3.9/site-packages/pdf2doi/utils_registry.py", line 5, in
import winreg
ModuleNotFoundError: No module named 'winreg'
Currently, lines 425 to 428 in main read:
for result in results:
if result['identifier']:
print('{:<15s} {:<40s} {:<10s}\n'.format(result['identifier_type'], result['identifier'],result['path']) )
return
I suspect this could be improved/updated with an f-string:
for result in results:
if result['identifier']:
print(f'{result['identifier_type']}, {result['identifier'], {result['path']} \n')
return
May be a bit involved, but my hunch is that a factory pattern could be used to allow for either importing info from a directory OR a row in a CSV.
Similarly, it'd be great if this exported to a CSV/not just the console. This I've done in my own program. As I've mentioned to you already, however, it's had some problems (namely Pandas isn't happy with some of the keys I've given it).
AttributeError: 'function' object has no attribute 'config'
in google colab
"Doing a google search, looking at the first None results..."
Hi, Michele,
Nice work! I am trying to extract the bibTex strings into a .txt file but noticed that you have removed the bibTex_makers module from v1.1, would you make some suggestions on how I can achieve it under current version?
Thanks!
Hello,
Amazing tool, I love it, is there a way to use a file object rather than an absolute path to feed to pdf2doi? Asking because I am trying to modify an app deployed on google cloud services to incorporate pdf2doi, but I can't find a way that doesn't involve downloading the files to local machine, which would be mildly inconvenient. The pdf files are stored on google clouds and it would be more elegant to open them as file objects and then manipulate them rather than to download it to local, run pdf2doi and re-upload the info.
Thank you very much for your work!
Hi,
The function pdf2doi_singlefile
does not close the opened pdf file. The close file statement is not executed due to return statements on successful identifier finding.
Lines 161 to 165 in d8e7117
This causes the issue with pdf-rename
on Windows. The renaming attempt results in access error as file is opened by the script itself.
I'll make PR shortly to fix this.
Often a search can surface DOI descriptors in the URL path alone, for instance:
[pdf2doi]: Performing google search with key "The Experimental Generation of Interpersonal Closeness: A Procedure and Some Preliminary Findings"
[pdf2doi]: and looking at the first 6 results...
[pdf2doi]: Looking for a valid identifier in the search result #1 : https://journals.sagepub.com/doi/pdf/10.1177/0146167297234003
[pdf2doi]: Looking for a valid identifier in the search result #2 : https://journals.sagepub.com/doi/abs/10.1177/0146167297234003
Supporting this would give quicker identifications, but also allow for occasions, such as this, where the DOI can't be extracted from the actual page.
Apparently the arXiv blog has announced that as of Feb 2022 all arxiv articles have DOIs.
Furthermore the DOI's share a unified prefix and the arXiv IDs as a suffix:
"An author can determine their article’s DOI by using the DOI prefix https://doi.org/10.48550/ followed by the arXiv ID (replacing the colon with a period). For example, the arXiv ID arXiv:2202.01037 will translate to the DOI link https://doi.org/10.48550/arXiv.2202.01037"
Perhaps could be grounds for a 2.0 release that returns DOI for arXiv articles.
The url field of each bibtex entry has some of the "/" escaped to %2F
It is an amazing tool, but when I search a new paper (1.pdf), I got the wrong DOI (10.1002/2016jc012583). Does anyone know the reason?
In fact, its DOI is on the first page, isn't it faster to directly query the first page of information?
A file in the list wasn't decrypted, and so it returned this error. Ideally, it should log a warning that it's encrypted, and then skip over it.
`[pdf2doi]: Trying to retrieve a DOI/identifier for the file: ...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
Traceback (most recent call last):
File "/usr/local/bin/pdf2doi", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 410, in main
results = pdf2doi(target=target,
File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 112, in pdf2doi
result = pdf2doi( target=file, verbose=verbose, websearch=websearch, webvalidation=webvalidation,
File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 147, in pdf2doi
result = pdf2doi_singlefile(filename)
File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 190, in pdf2doi_singlefile
result = finders.find_identifier(filename,method="document_infos",keysToCheckFirst=['/doi','/identfier'])
File "/usr/local/lib/python3.9/site-packages/pdf2doi/finders.py", line 487, in find_identifier
identifier, desc, info = finder_methodsmethod
File "/usr/local/lib/python3.9/site-packages/pdf2doi/finders.py", line 587, in find_identifier_in_pdf_info
pdfinfo = get_pdf_info(path)
File "/usr/local/lib/python3.9/site-packages/pdf2doi/finders.py", line 275, in get_pdf_info
info = pdf.getDocumentInfo()
File "/usr/local/lib/python3.9/site-packages/PyPDF2/pdf.py", line 1101, in getDocumentInfo
obj = self.trailer['/Info']
File "/usr/local/lib/python3.9/site-packages/PyPDF2/generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()
File "/usr/local/lib/python3.9/site-packages/PyPDF2/generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()
File "/usr/local/lib/python3.9/site-packages/PyPDF2/pdf.py", line 1617, in getObject
raise utils.PdfReadError("file has not been decrypted")
PyPDF2.utils.PdfReadError: file has not been decrypted`
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.