Git Product home page Git Product logo

Comments (8)

tomaarsen avatar tomaarsen commented on July 20, 2024 2

Hello!

I am also experiencing a memory leak with these tokenizers when processing long sequences without any spaces. This has been reported as a memory leak in Sentence Transformers, and affects some of my users: UKPLab/sentence-transformers#1795

Reproduction

import random
import string
import time
import psutil
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('xlm-roberta-base')

def random_string(length: int) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))

for iteration in range(99999999):
    start_t = time.time()
    tokenizer.encode_batch([random_string(12345) for _ in range(200)])
    memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
    delta_t = time.time() - start_t
    print(f"{iteration:02d}: {memory_usage_in_MiB:.2f}MB, {delta_t:.2f}s")

Outputs

00: 353.12MB, 0.35s
01: 421.64MB, 0.51s
02: 492.77MB, 0.68s
03: 571.88MB, 0.93s
04: 623.66MB, 1.02s
05: 710.28MB, 1.35s
06: 803.41MB, 1.31s
07: 859.77MB, 1.43s
08: 912.55MB, 1.69s
09: 1014.13MB, 1.78s
10: 1081.04MB, 1.95s
11: 1133.04MB, 2.29s
12: 1208.43MB, 2.56s
13: 1413.81MB, 2.65s
14: 1495.07MB, 2.83s
15: 1575.66MB, 3.00s
16: 1646.78MB, 3.19s
17: 1720.24MB, 3.57s
18: 1793.95MB, 3.82s
19: 1862.75MB, 4.02s
20: 1939.91MB, 4.21s
21: 2008.09MB, 4.71s
22: 2084.01MB, 5.04s
23: 2157.63MB, 5.26s
24: 2228.05MB, 5.56s
25: 2304.84MB, 6.13s
26: 2374.40MB, 6.50s
27: 2445.36MB, 6.68s
28: 2517.31MB, 7.38s
29: 2590.93MB, 7.91s
30: 2432.09MB, 8.19s
31: 2645.64MB, 8.56s
32: 2720.85MB, 8.81s
33: 2801.12MB, 9.73s
34: 2874.08MB, 10.14s
35: 2949.19MB, 11.18s
36: 3017.41MB, 11.28s
37: 3094.99MB, 12.76s
38: 3164.58MB, 14.09s
39: 3232.37MB, 13.26s
40: 3309.48MB, 15.10s

This is rather severe, not just a massive growth in memory usage, but the tokenization speed is also much, much lower.

Notes

The memory usage is much more reasonable if the strings:

  1. are not arbitrary, e.g. repeated "abc"
  2. contain spaces, e.g. by adding + " " to the list of choices.

@n1t0 @Narsil @ArthurZucker

  • Tom Aarsen

from tokenizers.

noamgai21 avatar noamgai21 commented on July 20, 2024

Related to #1495

from tokenizers.

ArthurZucker avatar ArthurZucker commented on July 20, 2024

I will check this might be related to FFI (Foreign Function Interface) and the way string are passed to rust in the background.

from tokenizers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.