Checklist [ x ] I have verified that the i

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

AllenNLP biased towards BERT about allennlp HOT 12 OPEN

pvcastro commented on June 5, 2024 1

AllenNLP biased towards BERT

from allennlp.

Comments (12)

pvcastro commented on June 5, 2024 1

Ok, thanks!
I'll try testing something like this and will report back.

from allennlp.

epwalsh commented on June 5, 2024

Hey @pvcastro, a couple questions:

In all experiments (BERT-AllenNLP, RoBERTa-AllenNLP, BERT-transformers, RoBERTa-transformers) were you using the same optimizer?
When you used transformers directly (for BERT-transformers and RoBERTa-transformers) was that a CRF model as well, or was that just using the (Ro|B)ertaForSequenceClassification models?

from allennlp.

pvcastro commented on June 5, 2024

Hi @epwalsh , thanks for the feedback!

Yes, I was using the huggingface_adamw optimizer.
No, it wasn't an adaptation with CRF, I used the straight run_ner script script from the hf's examples. But I believe the CRF layer would only improve results, as they usually do with bert models.

from allennlp.

epwalsh commented on June 5, 2024

Gotcha! Oh yes, I meant BertForTokenClassification, not BertForSequenceClassification 🤦

So I think the most likely source for a bug would be in the PretrainedTransformerMismatched(Embedder|TokenIndexer). And any differences between BERT and RoBERTa would probably have to do with tokenization. See, for example:

allennlp/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py

Lines 295 to 311 in 8571d93

 def _estimate_character_indices( 

 self, text: str, token_ids: List[int] 

 ) -> List[Optional[Tuple[int, int]]]: 

 """ 

  The huggingface tokenizers produce tokens that may or may not be slices from the 

  original text. Differences arise from lowercasing, Unicode normalization, and other 

  kinds of normalization, as well as special characters that are included to denote 

  various situations, such as "##" in BERT for word pieces from the middle of a word, or 

  "Ġ" in RoBERTa for the beginning of words not at the start of a sentence. 

  This code attempts to calculate character offsets while being tolerant to these 

  differences. It scans through the text and the tokens in parallel, trying to match up 

  positions in both. If it gets out of sync, it backs off to not adding any token 

  indices, and attempts to catch back up afterwards. This procedure is approximate. 

  Don't rely on precise results, especially in non-English languages that are far more 

  affected by Unicode normalization. 

  """

from allennlp.

pvcastro commented on June 5, 2024

I was assuming that just running some unit tests from the AllenNLP repository, to confirm that these embedders/tokenizers are producing tokens with the same special tokens as RoBERTa architecture would be enough to discard these. I ran some tests using RoBERTa and confirmed that it's not relying on CLS. Was this too superficial to reach any conclusions?

from allennlp.

epwalsh commented on June 5, 2024

I'm not sure. I mean, I thought we did have pretty good test coverage there, but I know for a fact that's one of the most brittle pieces of code in the whole library. It would break all of the time with new releases of transformers. So that's my best guess.

from allennlp.

pvcastro commented on June 5, 2024

Do you think it makes sense for me to run additional tests for the embedder comparing embeddings produced by a raw RobertaModel and the actual PretrainedTransformerMismatchedEmbedder? To try to see if they are somehow getting "corrupted" in the framework.

from allennlp.

epwalsh commented on June 5, 2024

I guess I would start by looking very closely at the exact tokens that are being used for each word by the PretrainedTransformerMismatchedEmbedder. Maybe pick out a couple test instances to check where the performance gap between the BERT and RoBERTa predictions is largest.

from allennlp.

github-actions commented on June 5, 2024

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

from allennlp.

github-actions commented on June 5, 2024

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

from allennlp.

pvcastro commented on June 5, 2024

Sorry, I'll try to get back to this next week, haven't had the time yet 😞

from allennlp.

epwalsh commented on June 5, 2024

No rush, I thought adding the "question" label would stop @github-actions bot from closing this, but I guess not.

from allennlp.

AllenNLP biased towards BERT about allennlp HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	def _estimate_character_indices(
	self, text: str, token_ids: List[int]
	) -> List[Optional[Tuple[int, int]]]:
	"""
	The huggingface tokenizers produce tokens that may or may not be slices from the
	original text. Differences arise from lowercasing, Unicode normalization, and other
	kinds of normalization, as well as special characters that are included to denote
	various situations, such as "##" in BERT for word pieces from the middle of a word, or
	"Ġ" in RoBERTa for the beginning of words not at the start of a sentence.

	This code attempts to calculate character offsets while being tolerant to these
	differences. It scans through the text and the tokens in parallel, trying to match up
	positions in both. If it gets out of sync, it backs off to not adding any token
	indices, and attempts to catch back up afterwards. This procedure is approximate.
	Don't rely on precise results, especially in non-English languages that are far more
	affected by Unicode normalization.
	"""