Git Product home page Git Product logo

text-splitter's Introduction

text-splitter

Docs Licence Crates.io codecov

Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.

This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.

Get Started

By Number of Characters

use text_splitter::{Characters, TextSplitter};

// Maximum number of characters in a chunk
let max_characters = 1000;
// Default implementation uses character count for chunk size
let splitter = TextSplitter::default()
    // Optionally can also have the splitter trim whitespace for you
    .with_trim_chunks(true);

let chunks = splitter.chunks("your document text", max_characters);
println!("{}", chunks.count())

With Huggingface Tokenizer

Requires the tokenizers feature to be activated and adding tokenizers to dependencies. The example below, using from_pretrained(), also requires tokenizers http feature to be enabled.

Click to show Cargo.toml.
[dependencies]
text-splitter = { version = "0.6", features = ["tokenizers"] }
tokenizers = { version = "0.15", features = ["http"] }
use text_splitter::TextSplitter;
// Can also use anything else that implements the ChunkSizer
// trait from the text_splitter crate.
use tokenizers::Tokenizer;

let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None).unwrap();
let max_tokens = 1000;
let splitter = TextSplitter::new(tokenizer)
    // Optionally can also have the splitter trim whitespace for you
    .with_trim_chunks(true);

let chunks = splitter.chunks("your document text", max_tokens);
println!("{}", chunks.count())

With Tiktoken Tokenizer

Requires the tiktoken-rs feature to be activated and adding tiktoken-rs to dependencies.

Click to show Cargo.toml.
text-splitter = { version = "0.6", features = ["tiktoken-rs"] }
tiktoken-rs = "0.5"
use text_splitter::TextSplitter;
// Can also use anything else that implements the ChunkSizer
// trait from the text_splitter crate.
use tiktoken_rs::cl100k_base;

let tokenizer = cl100k_base().unwrap();
let max_tokens = 1000;
let splitter = TextSplitter::new(tokenizer)
    // Optionally can also have the splitter trim whitespace for you
    .with_trim_chunks(true);

let chunks = splitter.chunks("your document text", max_tokens);
println!("{}", chunks.count())

Using a Range for Chunk Capacity

You also have the option of specifying your chunk capacity as a range.

Once a chunk has reached a length that falls within the range it will be returned.

It is always possible that a chunk may be returned that is less than the start value, as adding the next piece of text may have made it larger than the end capacity.

use text_splitter::{Characters, TextSplitter};

// Maximum number of characters in a chunk. Will fill up the
// chunk until it is somewhere in this range.
let max_characters = 500..2000;
// Default implementation uses character count for chunk size
let splitter = TextSplitter::default().with_trim_chunks(true);

let chunks = splitter.chunks("your document text", max_characters);
println!("{}", chunks.count())

Method

To preserve as much semantic meaning within a chunk as possible, a recursive approach is used, starting at larger semantic units and, if that is too large, breaking it up into the next largest unit. Here is an example of the steps used:

  1. Split the text by a given level
  2. For each section, does it fit within the chunk size?
    • Yes. Merge as many of these neighboring sections into a chunk as possible to maximize chunk length.
    • No. Split by the next level and repeat.

The boundaries used to split the text if using the top-level chunks method, in descending length:

  1. Descending sequence length of newlines. (Newline is \r\n, \n, or \r) Each unique length of consecutive newline sequences is treated as its own semantic level.
  2. Unicode Sentence Boundaries
  3. Unicode Word Boundaries
  4. Unicode Grapheme Cluster Boundaries
  5. Characters

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

Note on sentences: There are lots of methods of determining sentence breaks, all to varying degrees of accuracy, and many requiring ML models to do so. Rather than trying to find the perfect sentence breaks, we rely on unicode method of sentence boundaries, which in most cases is good enough for finding a decent semantic breaking point if a paragraph is too large, and avoids the performance penalties of many other methods.

Inspiration

This crate was inspired by LangChain's TextSplitter. But, looking into the implementation, there was potential for better performance as well as better semantic chunking.

A big thank you to the unicode-rs team for their unicode-segmentation crate that manages a lot of the complexity of matching the Unicode rules for words and sentences.

text-splitter's People

Contributors

benbrandt avatar dependabot[bot] avatar fullmetalmeowchemist avatar simoncw avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.