TransWikia.com

How to calculate the memory requirement of Bert?

Stack Overflow Asked by Lei Hao on December 30, 2021

I am curious about the memory usage of transformers.BertModel. I would like to use the pretrained model to transform text and save the output of token [CLS]. No training, only inference.

My input to bert is 511 tokens. With the batch size being 16, my code runs out of memory. The GPU has 32GB memory. My question is how to estimate the memory usage of Bert.

Strangely the other job having batch size 32 finished successfully, with the same set up. My code are listed below.

# Create dataloader
bs = 16
train_comb = ConcatDataset([train_data, valid_data])
train_dl = DataLoader(train_comb, sampler=RandomSampler(train_data), batch_size=bs)

model = BertModel.from_pretrained('/my_dir/bert_base_uncased/',
                                  output_attentions=False,
                                  output_hidden_states=False)
model.cuda()
out_list = []
model.eval()
with torch.no_grad():
    for d in train_dl:
        d = [i.cuda() for i in d].      # d = [input_ids, attention_mask, token_type_ids, labels] 
        inputs, labels = d[:3], d[3]    # input_ids has shape 16 x 511
        output = model(*inputs)[0][:, 0, :]
        out_list.append(output)

outputs = torch.cat(out_list)

Later I changed the for loop to below

with torch.no_grad():
    for d in train_dl:
        d = [i.cuda() for i in d[:3]]          # don't care about the labels
        out_list.append(model(*d)[0][:, 0, :]) # remove the intermediary variables
    del d

To summarize, my questions are:

  1. How to estimate the memory usage of Bert? I want to use it to estimate the batch size.
  2. My 2nd job having batch size 32 finished successfully. Is it because it has more paddings?
  3. Is there any suggestion on improving the efficiency of the memory usage in my code?

One Answer

After some searching, it turns out the error was caused by appending the output to list in GPU. With following code, the error is gone.

with torch.no_grad():
    for d in train_dl:
        d = [i.cuda() for i in d[:3]]          
        out_list.append(model(*d)[0][:, 0, :].cpu()) 
    del d

Without .cpu(), the memory keep increasing

Tensor size: torch.Size([4, 511]), Memory allocated: 418.7685546875MB
Tensor size: torch.Size([4, 768]), Memory allocated: 424.7568359375MB


Tensor size: torch.Size([4, 511]), Memory allocated: 424.7568359375MB
Tensor size: torch.Size([4, 768]), Memory allocated: 430.7451171875MB


Tensor size: torch.Size([4, 511]), Memory allocated: 430.7451171875MB
Tensor size: torch.Size([4, 768]), Memory allocated: 436.7333984375MB

With .cpu(), the memory doesn't change.

Tensor size: torch.Size([128, 511]), Memory allocated: 420.21875MB
Tensor size: torch.Size([128, 768]), Memory allocated: 420.21875MB


Tensor size: torch.Size([128, 511]), Memory allocated: 420.21875MB
Tensor size: torch.Size([128, 768]), Memory allocated: 420.21875MB


Tensor size: torch.Size([128, 511]), Memory allocated: 420.21875MB
Tensor size: torch.Size([128, 768]), Memory allocated: 420.21875MB

Answered by Lei Hao on December 30, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP