Git Product home page Git Product logo

unarxive's Introduction

unarXive

Access

Documentation

Data

unarXive schema

unarXive contains

  • 1.9 M structured paper full-texts, containing
    • 63 M references (28 M linked to OpenAlex)
    • 134 M in-text citation markers (65 M linked)
    • 9 M figure captions
    • 2 M table captions
    • 742 M pieces of mathematical notation preserved as LaTeX

A comprehensive documentation of the data format can be found here.

You can find a data sample here.

Usage

Hugging Face Datasets

If you want to use unarXive for citation recommendation or IMRaD classification, you can simply use our Hugging Face datasets:

For example, in the case of citation recommendation:

from datasets import load_dataset

citrec_data = load_dataset('saier/unarxive_citrec')
citrec_data = citrec_data.class_encode_column('label')  # assign target label column
citrec_data = citrec_data.remove_columns('_id')         # remove sample ID column

Development

For instructions how to re-create or extend unarXive, see src/.

Versions

Development Status

See issues.

Cite as

Current version

@inproceedings{Saier2023unarXive,
  author        = {Saier, Tarek and Krause, Johan and F\"{a}rber, Michael},
  title         = {{unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network}},
  booktitle     = {2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL)},
  year          = {2023},
  pages         = {66--70},
  month         = jun,
  doi           = {10.1109/JCDL57899.2023.00020},
  publisher     = {IEEE Computer Society},
  address       = {Los Alamitos, CA, USA},
}

Initial publication

@article{Saier2020unarXive,
  author        = {Saier, Tarek and F{\"{a}}rber, Michael},
  title         = {{unarXive: A Large Scholarly Data Set with Publications’ Full-Text, Annotated In-Text Citations, and Links to Metadata}},
  journal       = {Scientometrics},
  year          = {2020},
  volume        = {125},
  number        = {3},
  pages         = {3085--3108},
  month         = dec,
  issn          = {1588-2861},
  doi           = {10.1007/s11192-020-03382-z}
}

unarxive's People

Contributors

dginev avatar illdepence avatar johankit avatar johannes-reber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

unarxive's Issues

Accessing actual figure image files

Hello,

Is there a way to access to actual figure image files using, for instance, their id ?

{'c575cbb5-2504-4327-aa59-d1e6c97c0a53': {'caption': 'Quantum trajectories for harmonic oscillators.In eachcase,the oscillation period,τ=2π/ω=888.57au\\tau =2 \\pi /\\omega =888.57 au.In cases A & D ,m=2000aum=2000 au while in case B,m=200aum =200 au. Case D is a set of classical trajectories (Q=0Q=0)for this system.',
  'type': 'figure'}

Thanks !

Full dataset approximate size

Hello, I didn't see a discussion tab, so I will open an "issue". What's the approximate size in GB or TB of the full processed dataset? Thank you!

How to separate the context sentences and the main citation sentence?

Hi,

This is an issue about the structure of the context.csv:

It seemed that the context.csv put the context sentences and the main citation sentence together without any delimiters, but I want to do some experiments which need to separate and encode the sentences respectively.

By the way, do all the context string include three sentences? What if the main citation sentence is the first or last sentence?

The error in paper structure

This dataset is very helpful for NLP research in the scientific domain.

When I checked the parsed paper structure, I found some errors in the aspect of the paper structure.
For the paper "2212.00253" in this dataset, the subsection "Deep Reinforcement Learning" is actually in section 2.
However, the parsed result shows that the subsection "Deep Reinforcement Learning" is in section 1.

image

the section information in pdf file:
image

The reason might be that the section 2 head text "BACKGROUND" does not have the sub-paragraph, which is lost in the tex file process.

About citation matching

Dear developers,
I am now utilizing your unarxive dataset for my project. However, I have found it hard to match a paper with its citing papers. To be more specific, many papers' bib_entreis don't contaion much information related to the cited papers and most of them only have the 'bib_entry_raw'. I firstly constructed a list containing all the papers' titles. Then I looped through all the papers's bib_entries. In a loop, I scanned the list to see if a certain paper's title is in the string of the cited paper's bib_entry_raw. However, some bib_entry_raws don't contain the cited papers' titles but have other information such as venues or year of being published, making it difficult to match papers.
Could you please shed some light on how to match a paper with its citing papers. Your reply is highly appreciated!

Is the data open source?

Hi, it is a great work to create such a comprehensive and latex-based paper dataset! May I ask that whether this whole dataset is open source and where I can download it if it is? Or only the extraction code is open source?

Thanks!

For some papers, references are only matched up to part of the bib_entries list

An example is the first line in arXiv_src_2105_034.jsonl (paper 2105.05883) for which only the first 14 out of 20 entries in bib_entries were extended with an ids field (i.e. processed by the matching script).

Notes:

  • Reason is most likely a too “coarse grained” try/catch block in the matching script.
  • Should be retroactively salvageable by running the matching selectively for “skipped” references

Handling of footnotes

Putting this here as an open question: how should footnotes best be handled?

  • Current state:

    • figures, tables, citation markers. LaTeX math mode: replaced by marker with ID, content/caption saved separate from text
    • references (sec/fig/tab/...): replaced by REF token
    • footnotes: treated as text when converting a paragraph node in the intermediate XML to plain text (this leads to un-intuitive results like <normal text><footnote text><normal text> without any indication that the footnote text was in a footnote)
  • Possible ways to treat footnotes

    • introduce new maker {{footnote:<uuid>}} and save footnote text separately
    • keep in text but put in brackets
    • ...

Dataset sample

Hi all,

thank you very much for making this dataset public!

Could you please provide a small sample of the data (max. 100MB)? That would make it much easier to get started with the dataset.

Best,
malte

FORMULAS

Hi, congrats for the very nice work!

I saw that the formulas are not downloaded and I think many would be interested in that part of the articles. If you don't have access to the original latex, you can consider using this library

https://mathpix.com/

to extract the formulas from any format, like from the pdf of the article.

Replicate for recent data from Arxiv and Openalex

Hi!
Thanks for open-sourcing the code!
I would like to replicate the data for a small dataset of recently published work.
I have a list of recent arxiv ids. How do I align them with the OpenAlex ids? I would also like to extract the citation graph and align with the arxiv ids for getting parsed full content.

Is there an easy way without downloading huge amount of data (6TB arxiv and 300GB OpenAlex) for papers published say in August 2023?

PDF version not specified

PDF version not specified in metadata.
When multiple versions exist, it is hard to determinate the correct pdf.

How can I get OpenAlex dump files?

I am trying to re-produce the dataset. Following the instructions in /src, I have finished the step 1.
In step 2, which call generate_openalex_db.py, in line 87
input_dir_openalex_works_files = r'/opt/unarXive_2022/openalex/openalex-works-2022-11-28/*'
Where can I get those .gz dumps, did you miss some of the steps between 1 and 2?
Thanks!

dataset

Does the Data Set on Zenodo: full include year 2023?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.