illdepence / unarxive Goto Github PK

View Code? Open in Web Editor NEW

254.0 6.0 18.0 14.59 MB

A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network

License: MIT License

Python 100.00%

unarxive's Introduction

unarXive

Access

Data Set on Zenodo: full / permissively licensed subset
Data Sample
ML Data on Hugging Face: citation recommendation / IMRaD classification

Documentation

Publications
- Scientometrics (author copy) (2020)
- JCDL 2023 (author copy) (2023)
Data Format
Usage
Development
Cite

Data

unarXive contains

1.9 M structured paper full-texts, containing
- 63 M references (28 M linked to OpenAlex)
- 134 M in-text citation markers (65 M linked)
- 9 M figure captions
- 2 M table captions
- 742 M pieces of mathematical notation preserved as LaTeX

A comprehensive documentation of the data format can be found here.

You can find a data sample here.

Usage

Hugging Face Datasets

If you want to use unarXive for citation recommendation or IMRaD classification, you can simply use our Hugging Face datasets:

For example, in the case of citation recommendation:

from datasets import load_dataset

citrec_data = load_dataset('saier/unarxive_citrec')
citrec_data = citrec_data.class_encode_column('label')  # assign target label column
citrec_data = citrec_data.remove_columns('_id')         # remove sample ID column

Development

For instructions how to re-create or extend unarXive, see src/.

Versions

Current release (1991–2022): see Access section above
Previous releases (old format):
- 1991–Jul 2020
- 1991–2019

Development Status

See issues.

Cite as

Current version

@inproceedings{Saier2023unarXive,
  author        = {Saier, Tarek and Krause, Johan and F\"{a}rber, Michael},
  title         = {{unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network}},
  booktitle     = {2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL)},
  year          = {2023},
  pages         = {66--70},
  month         = jun,
  doi           = {10.1109/JCDL57899.2023.00020},
  publisher     = {IEEE Computer Society},
  address       = {Los Alamitos, CA, USA},
}

Initial publication

@article{Saier2020unarXive,
  author        = {Saier, Tarek and F{\"{a}}rber, Michael},
  title         = {{unarXive: A Large Scholarly Data Set with Publications’ Full-Text, Annotated In-Text Citations, and Links to Metadata}},
  journal       = {Scientometrics},
  year          = {2020},
  volume        = {125},
  number        = {3},
  pages         = {3085--3108},
  month         = dec,
  issn          = {1588-2861},
  doi           = {10.1007/s11192-020-03382-z}
}

unarxive's People

Contributors

Stargazers

Watchers

Forkers

allenai whuopm tomleung1996 aqeelferoze wethepeopleonline kemolo bluebirdback chagu-pjh evdcush b1sounours johankit maxsu hurricanejin jimgoo bnewm0609 fayssal-el-elmofatiche zhangbeibei1991 kimcando

unarxive's Issues

Accessing actual figure image files

Hello,

Is there a way to access to actual figure image files using, for instance, their id ?

{'c575cbb5-2504-4327-aa59-d1e6c97c0a53': {'caption': 'Quantum trajectories for harmonic oscillators.In eachcase,the oscillation period,τ=2π/ω=888.57au\\tau =2 \\pi /\\omega =888.57 au.In cases A & D ,m=2000aum=2000 au while in case B,m=200aum =200 au. Case D is a set of classical trajectories (Q=0Q=0)for this system.',
  'type': 'figure'}

Thanks !

Is there any efficient way to retrieve the OpenAlex label in the IMRaD set?

Hi there,
I'm interested in classifying the IMRaD dataset and using the OpenAlex ID for each of the entries, like the ones in citrec.
Is there any efficient way to retrieve the label for each of them?

Thanks 🙂

DOI based matching should be done directly against OpenAlex DOIs (not using title)

Current procedure when a DOI is found in a reference:

reference string →　DOI → crossref web API call → title → DB lookup in OpenAlex table (based on title)

More sensible procedure:

reference string →　DOI → DB lookup in OpenAlex table (based on DOI)

Full dataset approximate size

Hello, I didn't see a discussion tab, so I will open an "issue". What's the approximate size in GB or TB of the full processed dataset? Thank you!

How to separate the context sentences and the main citation sentence?

Hi,

This is an issue about the structure of the context.csv:

It seemed that the context.csv put the context sentences and the main citation sentence together without any delimiters, but I want to do some experiments which need to separate and encode the sentences respectively.

By the way, do all the context string include three sentences? What if the main citation sentence is the first or last sentence?

The error in paper structure

This dataset is very helpful for NLP research in the scientific domain.

When I checked the parsed paper structure, I found some errors in the aspect of the paper structure.
For the paper "2212.00253" in this dataset, the subsection "Deep Reinforcement Learning" is actually in section 2.
However, the parsed result shows that the subsection "Deep Reinforcement Learning" is in section 1.

the section information in pdf file:

The reason might be that the section 2 head text "BACKGROUND" does not have the sub-paragraph, which is lost in the tex file process.

About citation matching

Dear developers,
I am now utilizing your unarxive dataset for my project. However, I have found it hard to match a paper with its citing papers. To be more specific, many papers' bib_entreis don't contaion much information related to the cited papers and most of them only have the 'bib_entry_raw'. I firstly constructed a list containing all the papers' titles. Then I looped through all the papers's bib_entries. In a loop, I scanned the list to see if a certain paper's title is in the string of the cited paper's bib_entry_raw. However, some bib_entry_raws don't contain the cited papers' titles but have other information such as venues or year of being published, making it difficult to match papers.
Could you please shed some light on how to match a paper with its citing papers. Your reply is highly appreciated!

Is the data open source?

Hi, it is a great work to create such a comprehensive and latex-based paper dataset! May I ask that whether this whole dataset is open source and where I can download it if it is? Or only the extraction code is open source?

Thanks!

For some papers, references are only matched up to part of the bib_entries list

An example is the first line in arXiv_src_2105_034.jsonl (paper 2105.05883) for which only the first 14 out of 20 entries in bib_entries were extended with an ids field (i.e. processed by the matching script).

Notes:

Reason is most likely a too “coarse grained” try/catch block in the matching script.
Should be retroactively salvageable by running the matching selectively for “skipped” references

Handling of footnotes

Putting this here as an open question: how should footnotes best be handled?

Current state:
- figures, tables, citation markers. LaTeX math mode: replaced by marker with ID, content/caption saved separate from text
- references (sec/fig/tab/...): replaced by REF token
- footnotes: treated as text when converting a paragraph node in the intermediate XML to plain text (this leads to un-intuitive results like <normal text><footnote text><normal text> without any indication that the footnote text was in a footnote)
Possible ways to treat footnotes
- introduce new maker {{footnote:<uuid>}} and save footnote text separately
- keep in text but put in brackets
- ...

Dataset sample

Hi all,

thank you very much for making this dataset public!

Could you please provide a small sample of the data (max. 100MB)? That would make it much easier to get started with the dataset.

Best,
malte

Questions about the authors in this dataset

Dear developers, I have got a question. Is there any information like descriptions of the authors of each paper in the dataset.

FORMULAS

Hi, congrats for the very nice work!

I saw that the formulas are not downloaded and I think many would be interested in that part of the articles. If you don't have access to the original latex, you can consider using this library

https://mathpix.com/

to extract the formulas from any format, like from the pdf of the article.

Replicate for recent data from Arxiv and Openalex

Hi!
Thanks for open-sourcing the code!
I would like to replicate the data for a small dataset of recently published work.
I have a list of recent arxiv ids. How do I align them with the OpenAlex ids? I would also like to extract the citation graph and align with the arxiv ids for getting parsed full content.

Is there an easy way without downloading huge amount of data (6TB arxiv and 300GB OpenAlex) for papers published say in August 2023?

PDF version not specified

PDF version not specified in metadata.
When multiple versions exist, it is hard to determinate the correct pdf.

How can I get OpenAlex dump files?

I am trying to re-produce the dataset. Following the instructions in /src, I have finished the step 1.
In step 2, which call generate_openalex_db.py, in line 87
input_dir_openalex_works_files = r'/opt/unarXive_2022/openalex/openalex-works-2022-11-28/*'
Where can I get those .gz dumps, did you miss some of the steps between 1 and 2?
Thanks!

dataset

Does the Data Set on Zenodo: full include year 2023?