Git Product home page Git Product logo

Comments (24)

kyleclo avatar kyleclo commented on June 27, 2024 1

@shizhediao To clarify, the pretraining corpus is formatted like:

Sent1 from Paper1
Sent2 from Paper1
...
Sent500 from Paper1

Sent1 from Paper2
Sent2 from Paper2
...

with a newline separating individual documents.

Sorry I wasn't being clear about paragraphs. Please ignore that comment; it's a very minor detail pertaining to how S2ORC is distributed w/ paragraphs, and in hindsight more confusing than following the format I've pasted above.

from dont-stop-pretraining.

stevezheng23 avatar stevezheng23 commented on June 27, 2024

@kernelmachine / @amarasovic , I tried to filter the docs with 'grobid_parse' and 'latex_parse' fields based on 'mag_fos' in metadata, but got more 2.68M docs for 'BioMed' domain and only ~600K docs for 'CS' domain, which is less than 2.22M mentioned in paper. Are there any additional information I'm missing or you have other tricks for generating the unlabeled corpus?

from dont-stop-pretraining.

kernelmachine avatar kernelmachine commented on June 27, 2024

thanks for submitting this! @kyleclo, can you help?

from dont-stop-pretraining.

kyleclo avatar kyleclo commented on June 27, 2024

Hey @stevezheng23, sorry about the lack of clarity here. The current public release of S2ORC only contains papers that could be released while adhering to strict copyright regulations. Unfortunately, we had finished LM pretraining experiments with an earlier version of S2ORC that contained substantially more papers before learning about this. Many of those papers are unfortunately not available in S2ORC currently -- It'll take more negotiation, and we can't promise when/if it'll happen. Thanks for catching this -- We'll update the paper to make this more clear.

from dont-stop-pretraining.

stevezheng23 avatar stevezheng23 commented on June 27, 2024

@kyleclo thanks a lot for the explanation!

from dont-stop-pretraining.

shizhediao avatar shizhediao commented on June 27, 2024

@kyleclo thanks a lot for the explanation!

Hi Would you like to share the dataset after preprocessing?
Thanks!

from dont-stop-pretraining.

kyleclo avatar kyleclo commented on June 27, 2024

@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?

from dont-stop-pretraining.

shizhediao avatar shizhediao commented on June 27, 2024

@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?

Hi, what does the script here mean? Does that mean the example provided at https://github.com/allenai/s2orc?

` import json
.....
if paper_id in paper_id_to_pdf_parse:
# (1) get the full pdf parse from the previously computed lookup dict
pdf_parse = paper_id_to_pdf_parse[paper_id]
# (2) pull out fields we need from the pdf parse, including bibliography & text
bib_entries = pdf_parse['bib_entries']
paragraphs = pdf_parse['abstract'] + pdf_parse['body_text']

        # (3) loop over paragraphs, grabbing citation contexts
        for paragraph in paragraphs:
            
            # (4) loop over each inline citation in this paragraph
            for cite_span in paragraph['cite_spans']:
                
                # (5) each inline citation can be resolved to a bib entry
                cited_bib_entry = bib_entries[cite_span['ref_id']]
                
                # (6) that bib entry *may* be linked to a S2ORC paper.  if so, grab paragraph
                linked_paper_id = cited_bib_entry['link']
                if linked_paper_id:
                    citation_contexts.append({
                        'citing_paper_id': paper_id,
                        'cited_paper_id': linked_paper_id,
                        'context': paragraph['text'],
                        'citation_mention_start': cite_span['start'],
                        'citation_mention_end': cite_span['end'],
                    })

`

from dont-stop-pretraining.

shizhediao avatar shizhediao commented on June 27, 2024

@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?

If so, actually, I have checked this example. And I try to filter the dataset into pretraining corpus by adding conditions:

if not ("Medicine" in metadata_dict['mag_field_of_study'] and "Biology" in metadata_dict['mag_field_of_study']):
continue
if metadata_dict["has_pdf_parse"] == False:
continue
if metadata_dict["has_pdf_parse"] == True and metadata_dict["has_pdf_parsed_body_text"] ==False:
continue

Am I correct?

from dont-stop-pretraining.

kyleclo avatar kyleclo commented on June 27, 2024

That's right. And if you wanted the computer science subset, you can switch the mag_field_of_study tag to "Computer Science".

As for the has_pdf_parsed_body_text tag, I think it's fine to include papers that don't have body text as long as they also have abstracts.

from dont-stop-pretraining.

shizhediao avatar shizhediao commented on June 27, 2024

That's right. And if you wanted the computer science subset, you can switch the mag_field_of_study tag to "Computer Science".

As for the has_pdf_parsed_body_text tag, I think it's fine to include papers that don't have body text as long as they also have abstracts.

OK, Thanks!
One last question, biomed means "bio" and "med" are in the mag_field at the same time right?
I think It needs "bio and med" instead of “bio or med"?
Just want to make sure.

from dont-stop-pretraining.

kyleclo avatar kyleclo commented on June 27, 2024

The mag_field_of_study field is a list of strings that looks like: "mag_field_of_study": ["Biology", "Medicine"]

A paper can be Bio or Medicine or both.

For the pretraining experiment, we allowed papers that were Bio-only or Medicine-only or both

from dont-stop-pretraining.

shizhediao avatar shizhediao commented on June 27, 2024

Got it.
Thanks!

from dont-stop-pretraining.

shizhediao avatar shizhediao commented on June 27, 2024

The mag_field_of_study field is a list of strings that looks like: "mag_field_of_study": ["Biology", "Medicine"]

A paper can be Bio or Medicine or both.

For the pretraining experiment, we allowed papers that were Bio-only or Medicine-only or both

Hi Kyle,
May I ask are there any extra data cleaning steps performing?
Or, you just use the raw text from s2orc?

from dont-stop-pretraining.

kyleclo avatar kyleclo commented on June 27, 2024

We pulled text from the abstract and body_text fields, when available and preserved paragraph breaks. That is, we didn't allow sequences to be built up consisting of sentences from separate paragraphs. abstract is its own paragraph, and the body_text should be divided into paragraphs. Since there are paragraphs longer than the max sequence length for RoBERTa, we used ScispaCy to pre-sentencize everything so it was a single sentence per line. This made it possible to use the --line-by-line flag for Huggingface.

Didn't do any other processing besides this. RoBERTa pretraining is pretty robust to even really poorly-formatted text parsed from paper PDFs.

from dont-stop-pretraining.

shizhediao avatar shizhediao commented on June 27, 2024

OK, Thanks!
Have you encountered with the RAM memoryError problem?
Because there is no lazyDataLoader right now from Huggingface library, my 128 GB RAM memory could not load all 48 GB data.
Could you provide any hints about how to deal with a large dataset?
Thanks!

from dont-stop-pretraining.

shizhediao avatar shizhediao commented on June 27, 2024

Since there are paragraphs longer than the max sequence length for RoBERTa, we used ScispaCy to pre-sentencize everything so it was a single sentence per line

By the way, I don't quite understand what does that means. Actually, I have checked the ScispaCy repo but I do not find something related to this operation.
Could you provide an example and point out the function of this operation in ScispaCy?

For example, if the paragraph is

"Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily."

What output do you want to get via ScispaCy and how?

from dont-stop-pretraining.

kyleclo avatar kyleclo commented on June 27, 2024

Sentencization meaning the output will look like:

# first sentence
"Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR)."

# second sentence
"SBMA can be caused by this easily."

The code looks something like:

nlp = spacy.load('en_core_sci_sm')
text = "..."
for sentence in nlp(text).sents:
    # do something with sentence.text

As for debugging your memory issues with Huggingface code, that might be best handled by opening an Issue on their library's GitHub repo.

from dont-stop-pretraining.

shizhediao avatar shizhediao commented on June 27, 2024

Thanks!
So you view a sentence as a unit, right?
After sentencization, every sentence will be one line. Are there spaces between sentences from different paragraphs?

from dont-stop-pretraining.

shizhediao avatar shizhediao commented on June 27, 2024

when available and preserved paragraph breaks. That is, we didn't allow sequences to be built up consisting of sentences from separate paragraphs

Hi Kyle,
I didn't get the point why do you preserve paragraph breaks and how to do that.
Why: In my understanding, the RoBerta do not employ NSP task, so I think we do not need to preserve paragraph breaks.
How: From your comments, I understand that you split a paragraph into lines, each line is a sentence from the paragraph. I was wondering is there space line between sentences from different paragraphs? In my understanding, if we want to preserve the breaks, we need to add a space line?

Thanks!

from dont-stop-pretraining.

shizhediao avatar shizhediao commented on June 27, 2024

OK, Got it!
Thanks for your time and patience.
I agree with you it's a very minor detail. Now I fully understand that.
Thanks again!

from dont-stop-pretraining.

shizhediao avatar shizhediao commented on June 27, 2024

Hi @kyleclo
I was wondering will you use the unlabelled valid/test set in TAPT?

from dont-stop-pretraining.

kyleclo avatar kyleclo commented on June 27, 2024

@shizhediao can you create a new issue for this? It makes it easier for others to search for answered questions. thanks

from dont-stop-pretraining.

shizhediao avatar shizhediao commented on June 27, 2024

yes, sure. Thanks for pointing out and here is the link:
#17

from dont-stop-pretraining.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.