This snippet will cause memory usage to rise indefinitely: <div class="highlight h

Related to <a class="issue-link js-issue-link" data-error-text="Failed to load title"

Memory leak for large strings about tokenizers HOT 8 OPEN

noamgai21 commented on July 20, 2024

Memory leak for large strings

from tokenizers.

Comments (8)

tomaarsen commented on July 20, 2024 2

Hello!

I am also experiencing a memory leak with these tokenizers when processing long sequences without any spaces. This has been reported as a memory leak in Sentence Transformers, and affects some of my users: UKPLab/sentence-transformers#1795

Reproduction

import random
import string
import time
import psutil
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('xlm-roberta-base')

def random_string(length: int) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))

for iteration in range(99999999):
    start_t = time.time()
    tokenizer.encode_batch([random_string(12345) for _ in range(200)])
    memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
    delta_t = time.time() - start_t
    print(f"{iteration:02d}: {memory_usage_in_MiB:.2f}MB, {delta_t:.2f}s")

Outputs

00: 353.12MB, 0.35s
01: 421.64MB, 0.51s
02: 492.77MB, 0.68s
03: 571.88MB, 0.93s
04: 623.66MB, 1.02s
05: 710.28MB, 1.35s
06: 803.41MB, 1.31s
07: 859.77MB, 1.43s
08: 912.55MB, 1.69s
09: 1014.13MB, 1.78s
10: 1081.04MB, 1.95s
11: 1133.04MB, 2.29s
12: 1208.43MB, 2.56s
13: 1413.81MB, 2.65s
14: 1495.07MB, 2.83s
15: 1575.66MB, 3.00s
16: 1646.78MB, 3.19s
17: 1720.24MB, 3.57s
18: 1793.95MB, 3.82s
19: 1862.75MB, 4.02s
20: 1939.91MB, 4.21s
21: 2008.09MB, 4.71s
22: 2084.01MB, 5.04s
23: 2157.63MB, 5.26s
24: 2228.05MB, 5.56s
25: 2304.84MB, 6.13s
26: 2374.40MB, 6.50s
27: 2445.36MB, 6.68s
28: 2517.31MB, 7.38s
29: 2590.93MB, 7.91s
30: 2432.09MB, 8.19s
31: 2645.64MB, 8.56s
32: 2720.85MB, 8.81s
33: 2801.12MB, 9.73s
34: 2874.08MB, 10.14s
35: 2949.19MB, 11.18s
36: 3017.41MB, 11.28s
37: 3094.99MB, 12.76s
38: 3164.58MB, 14.09s
39: 3232.37MB, 13.26s
40: 3309.48MB, 15.10s

This is rather severe, not just a massive growth in memory usage, but the tokenization speed is also much, much lower.