Comments (24)
@shizhediao To clarify, the pretraining corpus is formatted like:
Sent1 from Paper1
Sent2 from Paper1
...
Sent500 from Paper1
Sent1 from Paper2
Sent2 from Paper2
...
with a newline separating individual documents.
Sorry I wasn't being clear about paragraphs. Please ignore that comment; it's a very minor detail pertaining to how S2ORC is distributed w/ paragraphs, and in hindsight more confusing than following the format I've pasted above.
from dont-stop-pretraining.
@kernelmachine / @amarasovic , I tried to filter the docs with 'grobid_parse' and 'latex_parse' fields based on 'mag_fos' in metadata, but got more 2.68M docs for 'BioMed' domain and only ~600K docs for 'CS' domain, which is less than 2.22M mentioned in paper. Are there any additional information I'm missing or you have other tricks for generating the unlabeled corpus?
from dont-stop-pretraining.
thanks for submitting this! @kyleclo, can you help?
from dont-stop-pretraining.
Hey @stevezheng23, sorry about the lack of clarity here. The current public release of S2ORC only contains papers that could be released while adhering to strict copyright regulations. Unfortunately, we had finished LM pretraining experiments with an earlier version of S2ORC that contained substantially more papers before learning about this. Many of those papers are unfortunately not available in S2ORC currently -- It'll take more negotiation, and we can't promise when/if it'll happen. Thanks for catching this -- We'll update the paper to make this more clear.
from dont-stop-pretraining.
@kyleclo thanks a lot for the explanation!
from dont-stop-pretraining.
@kyleclo thanks a lot for the explanation!
Hi Would you like to share the dataset after preprocessing?
Thanks!
from dont-stop-pretraining.
@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?
from dont-stop-pretraining.
@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?
Hi, what does the script here mean? Does that mean the example provided at https://github.com/allenai/s2orc?
` import json
.....
if paper_id in paper_id_to_pdf_parse:
# (1) get the full pdf parse from the previously computed lookup dict
pdf_parse = paper_id_to_pdf_parse[paper_id]
# (2) pull out fields we need from the pdf parse, including bibliography & text
bib_entries = pdf_parse['bib_entries']
paragraphs = pdf_parse['abstract'] + pdf_parse['body_text']
# (3) loop over paragraphs, grabbing citation contexts
for paragraph in paragraphs:
# (4) loop over each inline citation in this paragraph
for cite_span in paragraph['cite_spans']:
# (5) each inline citation can be resolved to a bib entry
cited_bib_entry = bib_entries[cite_span['ref_id']]
# (6) that bib entry *may* be linked to a S2ORC paper. if so, grab paragraph
linked_paper_id = cited_bib_entry['link']
if linked_paper_id:
citation_contexts.append({
'citing_paper_id': paper_id,
'cited_paper_id': linked_paper_id,
'context': paragraph['text'],
'citation_mention_start': cite_span['start'],
'citation_mention_end': cite_span['end'],
})
`
from dont-stop-pretraining.
@shizhediao It looks like you already requested download access to S2ORC. Are you looking for the script for converting that into the format for pretraining?
If so, actually, I have checked this example. And I try to filter the dataset into pretraining corpus by adding conditions:
if not ("Medicine" in metadata_dict['mag_field_of_study'] and "Biology" in metadata_dict['mag_field_of_study']):
continue
if metadata_dict["has_pdf_parse"] == False:
continue
if metadata_dict["has_pdf_parse"] == True and metadata_dict["has_pdf_parsed_body_text"] ==False:
continue
Am I correct?
from dont-stop-pretraining.
That's right. And if you wanted the computer science subset, you can switch the mag_field_of_study
tag to "Computer Science"
.
As for the has_pdf_parsed_body_text
tag, I think it's fine to include papers that don't have body text as long as they also have abstracts.
from dont-stop-pretraining.
That's right. And if you wanted the computer science subset, you can switch the
mag_field_of_study
tag to"Computer Science"
.As for the
has_pdf_parsed_body_text
tag, I think it's fine to include papers that don't have body text as long as they also have abstracts.
OK, Thanks!
One last question, biomed means "bio" and "med" are in the mag_field at the same time right?
I think It needs "bio and med" instead of “bio or med"?
Just want to make sure.
from dont-stop-pretraining.
The mag_field_of_study
field is a list of strings that looks like: "mag_field_of_study": ["Biology", "Medicine"]
A paper can be Bio or Medicine or both.
For the pretraining experiment, we allowed papers that were Bio-only or Medicine-only or both
from dont-stop-pretraining.
Got it.
Thanks!
from dont-stop-pretraining.
The
mag_field_of_study
field is a list of strings that looks like:"mag_field_of_study": ["Biology", "Medicine"]
A paper can be Bio or Medicine or both.
For the pretraining experiment, we allowed papers that were Bio-only or Medicine-only or both
Hi Kyle,
May I ask are there any extra data cleaning steps performing?
Or, you just use the raw text from s2orc?
from dont-stop-pretraining.
We pulled text from the abstract
and body_text
fields, when available and preserved paragraph breaks. That is, we didn't allow sequences to be built up consisting of sentences from separate paragraphs. abstract
is its own paragraph, and the body_text
should be divided into paragraphs. Since there are paragraphs longer than the max sequence length for RoBERTa, we used ScispaCy to pre-sentencize everything so it was a single sentence per line. This made it possible to use the --line-by-line
flag for Huggingface.
Didn't do any other processing besides this. RoBERTa pretraining is pretty robust to even really poorly-formatted text parsed from paper PDFs.
from dont-stop-pretraining.
OK, Thanks!
Have you encountered with the RAM memoryError problem?
Because there is no lazyDataLoader right now from Huggingface library, my 128 GB RAM memory could not load all 48 GB data.
Could you provide any hints about how to deal with a large dataset?
Thanks!
from dont-stop-pretraining.
Since there are paragraphs longer than the max sequence length for RoBERTa, we used ScispaCy to pre-sentencize everything so it was a single sentence per line
By the way, I don't quite understand what does that means. Actually, I have checked the ScispaCy repo but I do not find something related to this operation.
Could you provide an example and point out the function of this operation in ScispaCy?
For example, if the paragraph is
"Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily."
What output do you want to get via ScispaCy and how?
from dont-stop-pretraining.
Sentencization meaning the output will look like:
# first sentence
"Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR)."
# second sentence
"SBMA can be caused by this easily."
The code looks something like:
nlp = spacy.load('en_core_sci_sm')
text = "..."
for sentence in nlp(text).sents:
# do something with sentence.text
As for debugging your memory issues with Huggingface code, that might be best handled by opening an Issue on their library's GitHub repo.
from dont-stop-pretraining.
Thanks!
So you view a sentence as a unit, right?
After sentencization, every sentence will be one line. Are there spaces between sentences from different paragraphs?
from dont-stop-pretraining.
when available and preserved paragraph breaks. That is, we didn't allow sequences to be built up consisting of sentences from separate paragraphs
Hi Kyle,
I didn't get the point why do you preserve paragraph breaks and how to do that.
Why: In my understanding, the RoBerta do not employ NSP task, so I think we do not need to preserve paragraph breaks.
How: From your comments, I understand that you split a paragraph into lines, each line is a sentence from the paragraph. I was wondering is there space line between sentences from different paragraphs? In my understanding, if we want to preserve the breaks, we need to add a space line?
Thanks!
from dont-stop-pretraining.
OK, Got it!
Thanks for your time and patience.
I agree with you it's a very minor detail. Now I fully understand that.
Thanks again!
from dont-stop-pretraining.
Hi @kyleclo
I was wondering will you use the unlabelled valid/test set in TAPT?
from dont-stop-pretraining.
@shizhediao can you create a new issue for this? It makes it easier for others to search for answered questions. thanks
from dont-stop-pretraining.
yes, sure. Thanks for pointing out and here is the link:
#17
from dont-stop-pretraining.
Related Issues (20)
- Reproduce the result of Chemprot using RoBERTa HOT 4
- Dataset for DAPT HOT 1
- Datasets for DAPT HOT 1
- This problem occurs when running the script:allennlp.common.checks.ConfigurationError:key 'data_loader' is required HOT 1
- Pytorch-transformer and Allennlp Compatibility HOT 9
- Accessing data: 403 Forbidden HOT 1
- Regarding MLM Loss of Lrob and Ldapt HOT 1
- Extend to T5 models HOT 2
- About data selection
- ImportError SpacyTokenizer on vampire branch allennlp-1.0
- Fail to reproduce the work HOT 4
- TAPT dataset HOT 1
- pre-train commands,where is `ADAPTIVE_PRETRAINING.md`file for DAPT/TAPT commands? HOT 1
- when do domain-adaptive pretraining, seems can not extend the vocabulary?
- TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType HOT 2
- How to preprocess the data ?
- 您好!我运行时为啥老出现各种奇葩问题?显示 /bin/sh:1: allennlp:not found “Command allenlp train --include-apckage dont_stop_pretraining training_config/classifier.jsonnet -s model_logs/citation-intent-base” returned non-zero exit status 127 HOT 1
- Does DAPT lead to forgetting over the original LM domain or overfitting over the target domain?
- allennlp.common.checks.ConfigurationError: Extra parameters passed to PretrainedTransformerIndexer: {'do_lowercase': False} HOT 6
- Does more steps of pretraining lead to better encoder for downstream tasks?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dont-stop-pretraining.