대화 텍스트로 감정 예측하기 대회 실습 (1)

지난번에 포스팅했던 감정 인식 대회를 조금 진행해 보았다.

아직도 한참 모자르지만 조금 했다는 것에 의미를 담아...

주최측에서 제공한 baseline 코드를 (class 문은 진짜 어렵다)

나같은 파이썬 초짜가 이해하기는 힘들어서

구글링을 통해 baseline을 새로 구축해보았다.

마침 구름에서 팀 프로젝트용 Google Corab Pro + 를 지원해줘서

google corab으로 진행하였다.

(대부분의 패키지가 설치되어 있기도 하고)

코랩에 없는 transformer모델과

여러 패키지를 설치해 주었다.

1
2

from google.colab import drive
drive.mount('/content/mydrive')

cs

1

!pip install transformers

cs

1
2
3
4
5
6
7
8
9
10
11
12
13
14

import pandas as pd
import numpy as np
import random
import time
import datetime
 
import torch
 
from transformers import BertTokenizer
from transformers import BertForSequenceClassification, AdamW, BertConfig
from transformers import get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
Colored by Color Scripter

cs

 
 
 
● 데이터를 불러오기
 
 

1
2
3
4
5
6

#test data는 label이 없어서 모델 성능 파악용으로는 필요 없어서 불러오지 않았다.
 
tr = pd.read_csv('/content/mydrive/MyDrive/competitions/kerc/train_data.tsv',delimiter='\t')
label1 = pd.read_csv('/content/mydrive/MyDrive/competitions/kerc/train_labels.csv')
#test = pd.read_csv('/content/mydrive/MyDrive/competitions/kerc/public_test_data.tsv',delimiter='\t')
 
Colored by Color Scripter

cs

1
2
3
4

#train 데이터를 label과 합
tr = pd.concat([tr,label1],axis=1)
tr.drop([tr.columns[5]],  axis=1, inplace = True)
tr.replace({'dysphoria': 0,'neutral': 1, 'euphoria': 2}, inplace=True)

cs

1
2
3
4

#제공된 train 데이터를 성능을 측정하기 위해 7:3 비율로 train, test로 랜덤하게 섞어준다.
 
train = tr.sample(frac=0.70, random_state=2022)
test = tr.drop(train.index)
Colored by Color Scripter

cs

 
 
● 전처리 및 토크나이징

1
2

sentence_bert = ["[CLS] " + str(s) + " [SEP]" for s in train.sentence]
sentence_bert[:10]
Colored by Color Scripter

cs


['[CLS] 어떡해? 올에이야. 장학생. [SEP]',
 '[CLS] 엄마! 가지마. 엄마! [SEP]',
 '[CLS] 그래도 내가 형님이지. [SEP]',
 '[CLS] 너넨 예의도 없니? 내가 그렇게 신경쓰는줄 알면 조심하는 시늉이라도 해줘야지. 어쩜 이렇게... 사람을.. 지옥으로 몰고가니? [SEP]',
 '[CLS] 수고하십시요. [SEP]',
 '[CLS] 그러게요. 전화도 안받아요. [SEP]',
 '[CLS] 터미널에 간보러 갔는데 이놈이 그냥 날 보자마자 내빼길래 바로 잡아왔습니다. [SEP]',
 '[CLS] 그럼 여기 동서후보 또있어? 말귀 못알아 듣기는. 대학 나온거 맞어? [SEP]',
 '[CLS] 여자 손커봤자 살림만 말아먹는다면서요. [SEP]',
 '[CLS] 입만 까졌어. [SEP]']
 

1
2
3

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=False)
tokenized_texts = [tokenizer.tokenize(s) for s in sentence_bert]
print(tokenized_texts[0])
Colored by Color Scripter

cs

['[CLS]', '어', '##떡', '##해', '?', '올', '##에', '##이', '##야', '.', '장', '##학', '##생', '.', '[SEP]']​
 
 ● 패딩 및 어텐션 마스크 적용

1
2
3
4

MAX_LEN = 128
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype='long', truncating='post', padding='post')
input_ids[0]
Colored by Color Scripter




cs

array([   101,   9546, 118834,  14523,    136,   9583,  10530,  10739,
        21711,    119,   9657,  23321,  24017,    119,    102,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0])

1
2
3
4
5
6
7

attention_masks = []
 
for seq in input_ids:
    seq_mask = [float(i>0) for i in seq]
    attention_masks.append(seq_mask)
    
print(attention_masks[0])

cs


[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 

● train-validation set 분리
 
1
2
3
4
5
6
7
8

#test set은 이미 나눴으니
train_inputs, validation_inputs, train_labels, validation_labels = \
train_test_split(input_ids, train['label'].values, random_state=42, test_size=0.1)
 
train_masks, validation_masks, _, _ = train_test_split(attention_masks, 
                                                       input_ids,
                                                       random_state=42, 
                                                       test_size=0.1)

cs

● 텐서 변환 및 배치 설정

1
2
3
4
5
6

#파이토치 텐서로 변환
train_inputs = torch.tensor(train_inputs)
train_labels = torch.tensor(train_labels)
train_masks = torch.tensor(train_masks)
validation_inputs = torch.tensor(validation_inputs)
validation_labels = torch.tensor(validation_labels)
validation_masks = torch.tensor(validation_masks)

cs

1
2
3
4
5
6
7
8
9
10

#GPU에 맞게 
BATCH_SIZE = 32
 
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)
 
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=BATCH_SIZE)

cs

● 테스트셋 전처리

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

sentences = test['sentence']
sentences = ["[CLS] " + str(sentence) + " [SEP]" for sentence in sentences]
labels = test['label'].values
 
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
 
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
 
attention_masks = []
for seq in input_ids:
    seq_mask = [float(i>0) for i in seq]
    attention_masks.append(seq_mask)
 
test_inputs = torch.tensor(input_ids)
test_labels = torch.tensor(labels)
test_masks = torch.tensor(attention_masks)
 
test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = RandomSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)

cs

● BERT 모델 생성 및 하이퍼파라미터 설정

1
2
3

#num_labels는 분류에 따라 (BertForSequenceClassfication 활용) 
model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-cased", num_labels=3)
model.cuda()
Colored by Color Scripter

cs

1
2
3
4
5
6
7
8
9
10
11
12
13
14

# 옵티마이저 설정
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # 학습률
                  eps = 1e-8 # 0으로 나누는 것을 방지하기 위한 epsilon 값
                )
 
epochs = 4
 
total_steps = len(train_dataloader) * epochs
 
# 학습률을 조금씩 감소시키는 스케줄러
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)

cs

● 학습

1
2
3
4
5
6
7
8
9
10
11
12

# 정확도 계산 함수
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)
 
# 시간 표시 함수
def format_time(elapsed):
    # 반올림
    elapsed_rounded = int(round((elapsed)))
    # hh:mm:ss으로 형태 변경
    return str(datetime.timedelta(seconds=elapsed_rounded))

cs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126

# 재현을 위해 랜덤시드 고정
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
 
# 그래디언트 초기화
model.zero_grad()
 
# 에폭만큼 반복
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')
 
    # 시작 시간 설정
    t0 = time.time()
 
    # 로스 초기화
    total_loss = 0
 
    # 훈련모드로 변경
    model.train()
        
    # 데이터로더에서 배치만큼 반복하여 가져옴
    for step, batch in enumerate(train_dataloader):
        # 경과 정보 표시
        if step % 500 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
 
        # 배치를 GPU에 넣음
        batch = tuple(t.to(device) for t in batch)
        
        # 배치에서 데이터 추출
        b_input_ids, b_input_mask, b_labels = batch
 
        # Forward 수행                
        outputs = model(b_input_ids, 
                        token_type_ids=None, 
                        attention_mask=b_input_mask, 
                        labels=b_labels)
        
        # 로스 구함
        loss = outputs[0]
 
        # 총 로스 계산
        total_loss += loss.item()
 
        # Backward 수행으로 그래디언트 계산
        loss.backward()
 
        # 그래디언트 클리핑
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
 
        # 그래디언트를 통해 가중치 파라미터 업데이트
        optimizer.step()
 
        # 스케줄러로 학습률 감소
        scheduler.step()
 
        # 그래디언트 초기화
        model.zero_grad()
 
    # 평균 로스 계산
    avg_train_loss = total_loss / len(train_dataloader)            
 
    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))
        
    # ========================================
    #               Validation
    # ========================================
 
    print("")
    print("Running Validation...")
 
    #시작 시간 설정
    t0 = time.time()
 
    # 평가모드로 변경
    model.eval()
 
    # 변수 초기화
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
 
    # 데이터로더에서 배치만큼 반복하여 가져옴
    for batch in validation_dataloader:
        # 배치를 GPU에 넣음
        batch = tuple(t.to(device) for t in batch)
        
        # 배치에서 데이터 추출
        b_input_ids, b_input_mask, b_labels = batch
        
        # 그래디언트 계산 안함
        with torch.no_grad():     
            # Forward 수행
            outputs = model(b_input_ids, 
                            token_type_ids=None, 
                            attention_mask=b_input_mask)
        
        # 로스 구함
        logits = outputs[0]
 
        # CPU로 데이터 이동
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # 출력 로짓과 라벨을 비교하여 정확도 계산
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1
 
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))
 
print("")
print("Training complete!")
Colored by Color Scripter

cs

● 테스트셋 평가

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

#시작 시간 설정
t0 = time.time()
 
# 평가모드로 변경
model.eval()
 
# 변수 초기화
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
 
# 데이터로더에서 배치만큼 반복하여 가져옴
for step, batch in enumerate(test_dataloader):
    # 경과 정보 표시
    if step % 100 == 0 and not step == 0:
        elapsed = format_time(time.time() - t0)
        print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(test_dataloader), elapsed))
 
    # 배치를 GPU에 넣음
    batch = tuple(t.to(device) for t in batch)
    
    # 배치에서 데이터 추출
    b_input_ids, b_input_mask, b_labels = batch
    
    # 그래디언트 계산 안함
    with torch.no_grad():     
        # Forward 수행
        outputs = model(b_input_ids, 
                        token_type_ids=None, 
                        attention_mask=b_input_mask)
    
    # 로스 구함
    logits = outputs[0]
 
    # CPU로 데이터 이동
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()
    
    # 출력 로짓과 라벨을 비교하여 정확도 계산
    tmp_eval_accuracy = flat_accuracy(logits, label_ids)
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1
 
print("")
print("Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
print("Test took: {:}".format(format_time(time.time() - t0)))
Colored by Color Scripter

cs

 

평가 결과

0.64의 Accuracy가 나왔다.

기본 baseline인게 전처리 부분에서도 구분자 토큰을 제외하고는 처리를 하지 않았고

epoch 횟수, batch size를 포함한 여러 하이퍼 파라미터를 건들지 못하였다.

또한 kobert등 다양한 모델을 사용하면 더 나은 성능을 자랑할 수 있을 것이다.

그건 나중에!

Reference :

bert_naver_movie colab ipynb

yonghee.io

'자연어처리 > 실습' 카테고리의 다른 글

MRC(기계독해) 실습 1 : JSON 데이터셋 불러오기 (Groom Competition) (0)	2022.10.06
구름 AI 자연어처리 team project bug search (0)	2022.09.28
한국 대중 가요 가사 분석 프로젝트 (1) 빈도 분석 (0)	2022.09.18
한국어 토크나이징 아주 간단하게! (복습용) (0)	2022.09.17
LSTM 모델 간단 실습 (0)	2022.09.11

니은니은니은 데이터공부

대화 텍스트로 감정 예측하기 대회 실습 (1)

'자연어처리 > 실습' 카테고리의 다른 글

댓글

티스토리툴바

대화 텍스트로 감정 예측하기 대회 실습 (1)

'자연어처리 > 실습' 카테고리의 다른 글

관련글

댓글

티스토리툴바