Hi Team, I'm wondering how CS/BioMed corpus is filtered from S2ORC dataset? I didn't f

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

thanks for submitting this! <a class="user-mention notranslate" data-hovercard-type="u

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

How is CS and BioMed corpus filtered from S2ORC dataset about dont-stop-pretraining HOT 24 CLOSED

allenai commented on June 27, 2024

How is CS and BioMed corpus filtered from S2ORC dataset

from dont-stop-pretraining.

Comments (24)

kyleclo commented on June 27, 2024 1

@shizhediao To clarify, the pretraining corpus is formatted like:

Sent1 from Paper1
Sent2 from Paper1
...
Sent500 from Paper1

Sent1 from Paper2
Sent2 from Paper2
...

with a newline separating individual documents.

Sorry I wasn't being clear about paragraphs. Please ignore that comment; it's a very minor detail pertaining to how S2ORC is distributed w/ paragraphs, and in hindsight more confusing than following the format I've pasted above.

from dont-stop-pretraining.

stevezheng23 commented on June 27, 2024

@kernelmachine / @amarasovic , I tried to filter the docs with 'grobid_parse' and 'latex_parse' fields based on 'mag_fos' in metadata, but got more 2.68M docs for 'BioMed' domain and only ~600K docs for 'CS' domain, which is less than 2.22M mentioned in paper. Are there any additional information I'm missing or you have other tricks for generating the unlabeled corpus?

from dont-stop-pretraining.

kernelmachine commented on June 27, 2024

thanks for submitting this! @kyleclo, can you help?

from dont-stop-pretraining.

kyleclo commented on June 27, 2024

Hey @stevezheng23, sorry about the lack of clarity here. The current public release of S2ORC only contains papers that could be released while adhering to strict copyright regulations. Unfortunately, we had finished LM pretraining experiments with an earlier version of S2ORC that contained substantially more papers before learning about this. Many of those papers are unfortunately not available in S2ORC currently -- It'll take more negotiation, and we can't promise when/if it'll happen. Thanks for catching this -- We'll update the paper to make this more clear.

from dont-stop-pretraining.

stevezheng23 commented on June 27, 2024

@kyleclo thanks a lot for the explanation!

from dont-stop-pretraining.

shizhediao commented on June 27, 2024

@kyleclo thanks a lot for the explanation!

Hi Would you like to share the dataset after preprocessing?
Thanks!

from dont-stop-pretraining.

kyleclo commented on June 27, 2024

@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?

from dont-stop-pretraining.

shizhediao commented on June 27, 2024

@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?

Hi, what does the script here mean? Does that mean the example provided at https://github.com/allenai/s2orc?

` import json
.....
if paper_id in paper_id_to_pdf_parse:
# (1) get the full pdf parse from the previously computed lookup dict
pdf_parse = paper_id_to_pdf_parse[paper_id]
# (2) pull out fields we need from the pdf parse, including bibliography & text
bib_entries = pdf_parse['bib_entries']
paragraphs = pdf_parse['abstract'] + pdf_parse['body_text']

        # (3) loop over paragraphs, grabbing citation contexts
        for paragraph in paragraphs:
            
            # (4) loop over each inline citation in this paragraph
            for cite_span in paragraph['cite_spans']:
                
                # (5) each inline citation can be resolved to a bib entry
                cited_bib_entry = bib_entries[cite_span['ref_id']]
                
                # (6) that bib entry *may* be linked to a S2ORC paper.  if so, grab paragraph
                linked_paper_id = cited_bib_entry['link']
                if linked_paper_id:
                    citation_contexts.append({
                        'citing_paper_id': paper_id,
                        'cited_paper_id': linked_paper_id,
                        'context': paragraph['text'],
                        'citation_mention_start': cite_span['start'],
                        'citation_mention_end': cite_span['end'],
                    })

from dont-stop-pretraining.

shizhediao commented on June 27, 2024

@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?

If so, actually, I have checked this example. And I try to filter the dataset into pretraining corpus by adding conditions:

if not ("Medicine" in metadata_dict['mag_field_of_study'] and "Biology" in metadata_dict['mag_field_of_study']):
continue
if metadata_dict["has_pdf_parse"] == False:
continue
if metadata_dict["has_pdf_parse"] == True and metadata_dict["has_pdf_parsed_body_text"] ==False:
continue

Am I correct?

from dont-stop-pretraining.

kyleclo commented on June 27, 2024

That's right. And if you wanted the computer science subset, you can switch the mag_field_of_study tag to "Computer Science".

As for the has_pdf_parsed_body_text tag, I think it's fine to include papers that don't have body text as long as they also have abstracts.

from dont-stop-pretraining.

shizhediao commented on June 27, 2024

That's right. And if you wanted the computer science subset, you can switch the mag_field_of_study tag to "Computer Science".

As for the has_pdf_parsed_body_text tag, I think it's fine to include papers that don't have body text as long as they also have abstracts.

OK, Thanks!
One last question, biomed means "bio" and "med" are in the mag_field at the same time right?
I think It needs "bio and med" instead of “bio or med"?
Just want to make sure.

from dont-stop-pretraining.

kyleclo commented on June 27, 2024

The mag_field_of_study field is a list of strings that looks like: "mag_field_of_study": ["Biology", "Medicine"]

A paper can be Bio or Medicine or both.

For the pretraining experiment, we allowed papers that were Bio-only or Medicine-only or both

from dont-stop-pretraining.

shizhediao commented on June 27, 2024

Got it.
Thanks!

from dont-stop-pretraining.

shizhediao commented on June 27, 2024

The mag_field_of_study field is a list of strings that looks like: "mag_field_of_study": ["Biology", "Medicine"]

A paper can be Bio or Medicine or both.

For the pretraining experiment, we allowed papers that were Bio-only or Medicine-only or both

Hi Kyle,
May I ask are there any extra data cleaning steps performing?
Or, you just use the raw text from s2orc?

from dont-stop-pretraining.

kyleclo commented on June 27, 2024

We pulled text from the abstract and body_text fields, when available and preserved paragraph breaks. That is, we didn't allow sequences to be built up consisting of sentences from separate paragraphs. abstract is its own paragraph, and the body_text should be divided into paragraphs. Since there are paragraphs longer than the max sequence length for RoBERTa, we used ScispaCy to pre-sentencize everything so it was a single sentence per line. This made it possible to use the --line-by-line flag for Huggingface.

Didn't do any other processing besides this. RoBERTa pretraining is pretty robust to even really poorly-formatted text parsed from paper PDFs.

from dont-stop-pretraining.

shizhediao commented on June 27, 2024

OK, Thanks!
Have you encountered with the RAM memoryError problem?
Because there is no lazyDataLoader right now from Huggingface library, my 128 GB RAM memory could not load all 48 GB data.
Could you provide any hints about how to deal with a large dataset?
Thanks!

from dont-stop-pretraining.

shizhediao commented on June 27, 2024

Since there are paragraphs longer than the max sequence length for RoBERTa, we used ScispaCy to pre-sentencize everything so it was a single sentence per line

By the way, I don't quite understand what does that means. Actually, I have checked the ScispaCy repo but I do not find something related to this operation.
Could you provide an example and point out the function of this operation in ScispaCy?

For example, if the paragraph is

"Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily."

What output do you want to get via ScispaCy and how?

from dont-stop-pretraining.

kyleclo commented on June 27, 2024

Sentencization meaning the output will look like:

# first sentence
"Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR)."

# second sentence
"SBMA can be caused by this easily."

The code looks something like:

nlp = spacy.load('en_core_sci_sm')
text = "..."
for sentence in nlp(text).sents:
    # do something with sentence.text

As for debugging your memory issues with Huggingface code, that might be best handled by opening an Issue on their library's GitHub repo.

from dont-stop-pretraining.

shizhediao commented on June 27, 2024

Thanks!
So you view a sentence as a unit, right?
After sentencization, every sentence will be one line. Are there spaces between sentences from different paragraphs?

from dont-stop-pretraining.

shizhediao commented on June 27, 2024

when available and preserved paragraph breaks. That is, we didn't allow sequences to be built up consisting of sentences from separate paragraphs

Hi Kyle,
I didn't get the point why do you preserve paragraph breaks and how to do that.
Why: In my understanding, the RoBerta do not employ NSP task, so I think we do not need to preserve paragraph breaks.
How: From your comments, I understand that you split a paragraph into lines, each line is a sentence from the paragraph. I was wondering is there space line between sentences from different paragraphs? In my understanding, if we want to preserve the breaks, we need to add a space line?

Thanks!

from dont-stop-pretraining.

shizhediao commented on June 27, 2024

OK, Got it!
Thanks for your time and patience.
I agree with you it's a very minor detail. Now I fully understand that.
Thanks again!

from dont-stop-pretraining.

shizhediao commented on June 27, 2024

Hi @kyleclo
I was wondering will you use the unlabelled valid/test set in TAPT?

from dont-stop-pretraining.

kyleclo commented on June 27, 2024

@shizhediao can you create a new issue for this? It makes it easier for others to search for answered questions. thanks

from dont-stop-pretraining.

shizhediao commented on June 27, 2024

yes, sure. Thanks for pointing out and here is the link:
#17

from dont-stop-pretraining.

How is CS and BioMed corpus filtered from S2ORC dataset about dont-stop-pretraining HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent