안녕하세요. 깃허브 이슈는 처음 써보네요. 덕분에 자연어 처리 공부를 잘 하고 있습니다. 다름이 아니라 7장 ber

7장 bert_finetune_NER 에서 해결이 안되는 오류가 뜨네요 ㅠㅠ about tensorflow-ml-nlp-tf2 HOT 5 CLOSED

nlp-kr commented on August 22, 2024

7장 bert_finetune_NER 에서 해결이 안되는 오류가 뜨네요 ㅠㅠ

from tensorflow-ml-nlp-tf2.

Comments (5)

Eom-taeseon commented on August 22, 2024

대답을 해주실지는 모르겠지만.....
지푸라기라도 잡는 심정으로 다시 글 남깁니다.

우선 np.array가 안되는 이유가 input_ids, attention_masks, token_type_ids(이하 input)가 list형식이 아닌가 싶어서 확인해봤습니다.

80000개의 데이터 전부 list 타입이 맞더군요.

그래서 우선 np.array를 돌리는게 맞다고 생각해 for문 안에서 직접 inputs.append(np.array(input), dtype=int) 형식으로 했습니다.

def create_inputs_targets(df):
    for i, data in enumerate(df[['sentence', 'label']].values):
        sentence, labels = data
        words = sentence.split()
        labels = labels.split()
        labels_idx = []
        
        for label in labels:
            labels_idx.append(ner_labels.index(label) if label in ner_labels else ner_labels.index("UNK"))

        assert len(words) == len(labels_idx)

        input_id, attention_mask, token_type_id = bert_tokenizer(sentence, MAX_LEN)
        convert_label_id = convert_label(words, labels_idx, ner_begin_label, MAX_LEN)

        label_list.append(convert_label_id)        

        # for문 안에서 직접 np.array를 실행
        # 원래는 for문 안에서 토크나이징->input[]에 append->for문 밖에서 np.array를 실행해야 하는데 에러가 나서 이렇게 해봤습니다.
        input_ids.append(np.array(input_id, dtype=int)) 
        attention_masks.append(np.array(attention_mask, dtype=int))
        token_type_ids.append(np.array(token_type_id, dtype=int))
        
        if i <= 2: # 2번째 데이터까지 직접 input을 확인해보고 싶었습니다.
            print(i)
            print("input_id\n", input_id, "\n")
            print("attention_mask\n", attention_mask, "\n")
            print("token_type_id\n", token_type_id, "\n")
    
    '''
    train_input_ids = np.array(input_ids, dtype=int)
    train_attention_masks = np.array(attention_masks, dtype=int)
    train_token_type_ids = np.array(token_type_ids, dtype=int)
    '''
    train_label_list = np.asarray(label_list, dtype=int) #레이블 토크나이징 리스트
    inputs = (input_ids, attention_masks, token_type_ids)
    
    print("input_ids\n", input_ids[0], "\n")
    print("attention_ids\n", attention_masks[0],"\n")
    print("token_type_ids\n", token_type_ids[0], "\n")
        
    return inputs, train_label_list

train_inputs, train_labels = create_inputs_targets(train_ner_df)
test_inputs, test_labels = create_inputs_targets(test_ner_df)

그런데 결과가 놀랍네요.....

우선 갓 토크나이징된 따끈따끈한 3개의 bert input값들입니다.

0
input_id
[101, 8928, 40958, 118617, 119196, 30085, 37712, 117, 8848, 12945, 15001, 35115, 48345, 119, 102]
attention_mask
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
token_type_id
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

1
input_id
[101, 9638, 12310, 108056, 9954, 118802, 9722, 10622, 9266, 11664, 10150, 37712, 10003, 119244, 9993, 17730, 83200, 9266, 11018, 27487, 9685, 85634, 16139, 119, 102]
attention_mask
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
token_type_id
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

2
input_id
[101, 128, 118, 129, 19855, 8881, 16605, 89326, 8935, 41693, 76036, 9294, 12605, 46150, 22695, 113, 9417, 34951, 26444, 25503, 28188, 114, 8843, 9735, 10892, 9341, 20479, 10622, 9032, 81220, 68495, 70122, 17196, 18471, 15891, 25347, 14423, 14863, 9768, 45465, 57952, 9668, 42815, 12490, 119, 102]
attention_mask
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
token_type_id
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

padding이 전혀 안되어있었습니다.

확신은 없지만 아마 그동안 에러가 났던 것도 이 때문이 아니었을까......

참고로 이 상황에서도 컴파일은 쭉쭉되지만
마지막 학습에서도 이와 관련된 오류가 났습니다.

ValueError: Data cardinality is ambiguous:
x sizes: 15, 25, 46, ...

x sizes가 제각각이라 ner_model.fit 학습이 안됐습니다.
그런데 그 크기가 bert input 데이터의 토큰 갯수와 일치하네요.

아마 버트 토크나이저 함수에서 문제가 생겨서 그런 것 아닐까..... 싶습니다.

from tensorflow-ml-nlp-tf2.

rainmaker712 commented on August 22, 2024

안녕하세요, 우선 제안 주신 이슈 확인 전에 환경 설정을 README에 설명되어 있는 패키지 Version에 있는 requirements.txt에 따라 설치 하셨는지 여쭤보고 싶습니다.

버트 토크나이저 함수 Version 업데이트에 따라 관련 이슈가 있었던 듯 해서요.

만약, 환경 세팅 이후에도 같은 이슈가 발생하시면 코멘트 부탁 드립니다.

from tensorflow-ml-nlp-tf2.

Eom-taeseon commented on August 22, 2024

답장 주셔서 감사합니다!!
좀 바빠서 확인이 늦어졌네요 ㅠㅠ

결과적으로는 컴파일 문제는 해결 됐습니다.

토크나이저 함수 version 업데이트는 아니었던 것 같아요.
왠지 모르겠지만 자연어 텍스트 분류 예제에 있었던 bert_tokenizer 함수를 사용했더니 문제없이 실행 됐습니다.

사실 bert_tokenizer() 함수의 encoded_plus 안에 패딩 관련해서 pad=True였나...? 이렇게 바꾸라는 지시가 있었길래 실했했다가 에러가 났던 것 같습니다.

아직 코드 확인은 못했지만 그 부분과 create_inputs_target() 함수를 다시 실행해보니 해결됐네요.

일단 예제 학습을 마치고 어떻게 된 일인지 공부 후에 다시 코멘트하겠습니다.
감사합니다. ^^

from tensorflow-ml-nlp-tf2.

rainmaker712 commented on August 22, 2024

답장 주셔서 감사합니다!!
좀 바빠서 확인이 늦어졌네요 ㅠㅠ

결과적으로는 컴파일 문제는 해결 됐습니다.

토크나이저 함수 version 업데이트는 아니었던 것 같아요.
왠지 모르겠지만 자연어 텍스트 분류 예제에 있었던 bert_tokenizer 함수를 사용했더니 문제없이 실행 됐습니다.

사실 bert_tokenizer() 함수의 encoded_plus 안에 패딩 관련해서 pad=True였나...? 이렇게 바꾸라는 지시가 있었길래 실했했다가 에러가 났던 것 같습니다.

아직 코드 확인은 못했지만 그 부분과 create_inputs_target() 함수를 다시 실행해보니 해결됐네요.

일단 예제 학습을 마치고 어떻게 된 일인지 공부 후에 다시 코멘트하겠습니다.
감사합니다. ^^

해결 되셨다니 다행이네요! 허깅페이스의 transformers나 관련 토크나이저에서 버전 차이로 인한 오류가 지속적으로 발생하여, requirements.txt와 도커로 테스트 해보고 가이드 드렸었거든요.

혹시 추가적으로 궁금한 사항이나 이슈가 있으면 답변 드리겠습니다!

from tensorflow-ml-nlp-tf2.

Taekyoon commented on August 22, 2024

해당 이슈에 대해서는 해결이 된 것으로 보여 close 하겠습니다.
다시 이야기를 나눠야 하면 해당 이슈에서 reopen 하셔서 진행하시면 됩니다.

from tensorflow-ml-nlp-tf2.

7장 bert_finetune_NER 에서 해결이 안되는 오류가 뜨네요 ㅠㅠ about tensorflow-ml-nlp-tf2 HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent