[Pytorch] 다양한 Learning Rate Scheduler

Deep learning은 gradient descent를 통해 loss를 줄이는 방식으로 학습하고 이 과정에서 learning rate를 설정하게 됩니다. learing rate를 너무 크게 하면 global minima에 수렴하지 못하고 너무 작게 하면 수렴 속도가 늦고 local minima에 빠질 수 있습니다. 따라서 학습을 진행하면서 고정된 learning rate가 아닌 learning rate에 변화를 주면서 학습화는 방법이 연구되고 있습니다.

0. Library

import torch
import torch.nn as nn
from transformers import get_cosine_with_hard_restarts_schedule_with_warmup
import matplotlib.pyplot as plt

def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

1. StepLR

일정한 Step 마다 learning rate에 gamma를 곱해주는 방식입니다.

model = nn.Linear(2, 1)
optimizer = torch.optim.AdamW(model.parameters())
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=200, gamma=0.5)
lr = []

for i in range(1000):
    optimizer.step()
    lr.append(get_lr(optimizer))
    scheduler.step()

plt.plot(range(1000),lr)

2. ExponentialLR

gamma 비율로 지수적으로 감소하는 방식입니다.

scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.99)

3. ReduceLROnPlateau

validate metric의 향상이 없으면 감소합니다.

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min')
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)

4. CosineAnnealingLR

cosine 그래프를 그리면서 learning rate 변하는 방식입니다.

- T_max : 반주기

- eta_min : lr의 최솟값

torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100, eta_min=0.0001)

특히 transformer에서 warm-up과 cosine annealing을 함께 사용한다고 GPT-2 논문에서는 나와있습니다.

warm-up을 사용할 경우 초반에 초기화된 weight값을 안정적으로 정렬해준다고 합니다.

5. CosineAnnealingWarmRestarts

warm-up후 cosine그래프를 따라 학습률이 감소하며, 한 epoch의 학습이 끝나면 hard_restart를 통해 lr를 초기화합니다.

pytorch에서는 warm-up start가 구현되어 있지 않아 transformers에서 사용했습니다.

warm-up은 처음 한 번만 적용합니다.

from transformers import get_cosine_with_hard_restarts_schedule_with_warmup

scheduler = get_cosine_with_hard_restarts_schedule_with_warmup(optimizer, num_warmup_steps=10, num_training_steps=100, num_cycles=3)

6. SGDR, Stochastic Gradient Descent with Warm Restarts

bag of tricks for image classification에서 소개된 방법입니다. max 값의 감소 기능이 추가된 scheduler입니다.

https://github.com/katsura-jp/pytorch-cosine-annealing-with-warmup에서 받아서 사용할 수 있습니다.

CosineAnnealingWarmupRestarts(optimizer, first_cycle_steps=200, cycle_mult=1.0, max_lr=0.1, min_lr=0.001, warmup_steps=50, gamma=0.5)

※ 개인적으로 transformer계열의 모델을 학습할 때는 CosineAnnealingWarmRestarts을 사용하면 어느 정도 성능이 잘 나오는 것 같습니다.

'Pytorch' 카테고리의 다른 글

[Pytorch] DataParallel vs DIstributedDataParallel (0)	2022.07.17
[Pytorch] Onnx로 모델 export하기 (0)	2022.06.12
[Pytorch] 유용한 method (view,reshape,squeeze,permute,stack,repeat,gather...) (0)	2022.01.16

0. Library

1. StepLR

2. ExponentialLR

3. ReduceLROnPlateau

4. CosineAnnealingLR

5. CosineAnnealingWarmRestarts

6. SGDR, Stochastic Gradient Descent with Warm Restarts

'Pytorch' 카테고리의 다른 글

티스토리툴바