microsoft / blingfire Goto Github PK

A lightning fast Finite State machine and REgular expression manipulation library.

License: MIT License

CMake 0.07% C++ 93.86% C 0.28% Batchfile 2.00% Yacc 0.06% Shell 0.01% Python 1.12% Perl 1.82% Gnuplot 0.29% C# 0.39% Makefile 0.02% JavaScript 0.05% HTML 0.03%

blingfire's Introduction

Bling Fire

Introduction

Hi, we are a team at Microsoft called Bling (Beyond Language Understanding), we help Bing be smarter. Here we wanted to share with all of you our FInite State machine and REgular expression manipulation library (FIRE). We use Fire for many linguistic operations inside Bing such as Tokenization, Multi-word expression matching, Unknown word-guessing, Stemming / Lemmatization just to mention a few.

Bling Fire Tokenizer Overview

Bling Fire Tokenizer provides state of the art performance for Natural Language text tokenization. Bling Fire supports the following tokenization algorithms:

Pattern-based tokenization
WordPiece tokenization
SentencePiece Unigram LM
SentencePiece BPE
Induced/learned syllabification patterns (identifies possible hyphenation points within a token)

Bling Fire provides uniform interface for working with all four algorithms so there is no difference for the client whether to use tokenizer for XLNET, BERT or your own custom model.

Model files describe the algorithms they are built for and are loaded on demand from external file. There are also two default models for NLTK-style tokenization and sentence breaking, which does not need to be loaded. The default tokenization model follows logic of NLTK, except hyphenated words are split and a few "errors" are fixed.

Normalization can be added to each model, but is optional.

Diffrences between algorithms are summarized here.

Bling Fire Tokenizer high level API designed in a way that it requires minimal or no configuration, or initialization, or additional files and is friendly for use from languages like Python, Ruby, Rust, C#, JavaScript (via WASM), etc.

We have precompiled some popular models and listed with the source code reference below:

File Name	Models it should be used for	Algorithm	Source Code
wbd.bin	Default Tokenization Model	Pattern-based	src
sbd.bin	Default model for Sentence breaking	Pattern-based	src
bert_base_tok.bin	BERT Base/Large	WordPiece	src
bert_base_cased_tok.bin	BERT Base/Large Cased	WordPiece	src
bert_chinese.bin	BERT Chinese	WordPiece	src
bert_multi_cased.bin	BERT Multi Lingual Cased	WordPiece	src
xlnet.bin	XLNET Tokenization Model	Unigram LM	src
xlnet_nonorm.bin	XLNET Tokenization Model /wo normalization	Unigram LM	src
bpe_example.bin	A model to test BPE tokenization	BPE	src
xlm_roberta_base.bin	XLM Roberta Tokenization	Unigram LM	src
laser(100k\|250k\|500k).bin	Trained on balanced by language WikiMatrix corpus of 80+ languages	Unigram LM	src
uri(100k\|250k\|500k).bin	URL tokenization model trained on a large set of random URLs from the web	Unigram LM	src
gpt2.bin	Byte-BPE tokenization model for GPT-2	byte BPE	src
roberta.bin	Byte-BPE tokenization model for Roberta model	byte BPE	src
syllab.bin	Multi lingual model to identify allowed hyphenation points inside a word.	W2H	src

Oh yes, it is also the fastest! We did a comparison of Bling Fire with tokenizers from Hugging Face, Bling Fire runs 4-5 times faster than Hugging Face Tokenizers, see also Bing Blog Post. We did comparison of Bling Fire Unigram LM and BPE implementaion to the same one in SentencePiece library and our implementation is ~2x faster, see XLNET benchmark and BPE benchmark. Not to mention our default models are 10x faster than the same functionality from SpaCy, see benchmark wiki and this Bing Blog Post.

So if low latency inference is what you need then you have to try Bling Fire!

Python API Description

If you simply want to use it in Python, you can install the latest release using pip:

pip install -U blingfire

Examples

1. Python example, using default pattern-based tokenizer:

from blingfire import *

text = 'After reading this post, you will know: What "natural language" is and how it is different from other types of data. What makes working with natural language so challenging. [1]'

print(text_to_sentences(text))
print(text_to_words(text))

Expected output:

After reading this post, you will know: What "natural language" is and how it is different from other types of data.
What makes working with natural language so challenging. [1]
After reading this post , you will know : What " natural language " is and how it is different from other types of data . What makes working with natural language so challenging . [ 1 ]

2. Python example, load a custom model for a pattern-based tokenizer:

from blingfire import *

# load a custom model from file
h = load_model("./wbd_chuni.bin")

text = 'This is the Bling-Fire tokenizer. 2007年9月日历表_2007年9月农历阳历一览表-万年历'

# custom model output
print(text_to_words_with_model(h, text))

# default model output
print(text_to_words(text))

free_model(h)

Expected output:

This is the Bling - Fire tokenizer . 2007 年 9 月 日 历 表 _2007 年 9 月 农 历 阳 历 一 览 表 - 万 年 历
This is the Bling - Fire tokenizer . 2007年9月日历表_2007年9月农历阳历一览表 - 万年历

3. Python example, calling BERT BASE tokenizer

On one thread, it works 14x faster than orignal BERT tokenizer written in Python. Given this code is written in C++ it can be called from multiple threads without blocking on global interpreter lock thus achieving higher speed-ups for batch mode.

import os
import blingfire

s = "Эpple pie. How do I renew my virtual smart card?: /Microsoft IT/ 'virtual' smart card certificates for DirectAccess are valid for one year. In order to get to microsoft.com we need to type [email protected]."

# one time load the model (we are using the one that comes with the package)
h = blingfire.load_model(os.path.join(os.path.dirname(blingfire.__file__), "bert_base_tok.bin"))
print("Model Handle: %s" % h)

# use the model from one or more threads
print(s)
ids = blingfire.text_to_ids(h, s, 128, 100)  # sequence length: 128, oov id: 100
print(ids)                                   # returns a numpy array of length 128 (padded or trimmed)

# free the model at the end
blingfire.free_model(h)
print("Model Freed")

Expected output:

Model Handle: 2854016629088
Эpple pie. How do I renew my virtual smart card?: /Microsoft IT/ 'virtual' smart card certificates for DirectAccess are valid for one year. In order to get to microsoft.com we need to type [email protected].
[ 1208  9397  2571 11345  1012  2129  2079  1045 20687  2026  7484  6047
  4003  1029  1024  1013  7513  2009  1013  1005  7484  1005  6047  4003
 17987  2005  3622  6305  9623  2015  2024  9398  2005  2028  2095  1012
  1999  2344  2000  2131  2000  7513  1012  4012  2057  2342  2000  2828
 14255  1030  1015  1012  1016  1012  1015  1012  1016  1012     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0]
Model Freed

4. Python example, doing tokenization and hyphenation of a text

Since hyphenation API's take one word at a time with the limit of 300 Unicode characters, we need to break the text into words first and then run hyphenation for each token.

import os
import blingfire

# load a provided with the package model
h = blingfire.load_model(os.path.join(os.path.dirname(blingfire.__file__), "syllab.bin"))

# get a text
text = "Like Curiosity, the Perseverance rover was built by engineers and scientists at NASA's Jet Propulsion Laboratory in Pasadena, California. Roughly 85% of Perseverance's mass is based on Curiosity \"heritage hardware,\" saving NASA time and money and reducing risk considerably, agency officials have said.  Как и Curiosity, марсоход Perseverance был построен инженерами и учеными из Лаборатории реактивного движения НАСА в Пасадене, Калифорния. По словам официальных лиц агентства, примерно 85% массы Perseverance основано на «традиционном оборудовании» Curiosity, что экономит время и деньги NASA и значительно снижает риски."

# break text into words with default model and hyphenate each word
output = " ".join([blingfire.word_hyphenation_with_model(h, w) for w in blingfire.text_to_words(text).split(' ')])
print(output)

# free the model after we are all done
blingfire.free_model(h)

The output should be something like this:

Li-ke Cu-rios-i-ty , the Per-se-ve-rance ro-ver was built by en-gi-neers and sci-en-tists at NASA 's Jet Pro-pul-sion La-bo-ra-to-ry in Pa-sa-dena , Cali-for-nia . Roughly 85 % of Per-se-ve-rance 's mass is ba-se-d on Cu-rios-i-ty " he-r-i-tage hard-ware , " sa-ving NASA time and money and re-du-c-ing risk con-si-de-r-ably , agen-cy of-fi-cials ha-ve said . Ка-к и Cu-rios-i-ty , мар-со-ход Per-se-ve-rance бы-л построен ин-же-не-рами и у-че-ны-ми из Ла-бора-то-рии ре-актив-ного дви-же-ния НАСА в Па-са-дене , Ка-ли-фор-ния . По сло-вам офи-ци-аль-ных ли-ц агент-ства , при-мерно 85 % мас-сы Per-se-ve-rance осно-вано на « тра-ди-ци-он-ном обо-ру-до-ва-нии » Cu-rios-i-ty , что эко-но-мит вре-мя и деньги NASA и зна-чи-те-льно сни-жа-ет риски .

Note you can specify any other Unicode character as a hyphen that API inserts into the output string.

5. C# example, calling XLM Roberta tokenizer and getting ids and offsets

Note, everything that is supported in Python is supported by C# API as well. C# also has ability to use parallel computations since all models and functions are stateless you can share the same model across the threads without locks. Let's load XLM Roberta model and tokenize a string, for each token let's get ID and offsets in the original text.

using System;
using BlingFire;

namespace BlingUtilsTest
{
    class Program
    {
        static void Main(string[] args)
        {
            // load XLM Roberta tokenization model
            var h = BlingFireUtils.LoadModel("./xlm_roberta_base.bin");


            // input string
            string input = "Autophobia, also called monophobia, isolophobia, or eremophobia, is the specific phobia of isolation. I saw a girl with a telescope. Я увидел девушку с телескопом.";
            // get its UTF8 representation
            byte[] inBytes = System.Text.Encoding.UTF8.GetBytes(input);


            // allocate space for ids and offsets
            int[] Ids =  new int[128];
            int[] Starts =  new int[128];
            int[] Ends =  new int[128];

            // tokenize with loaded XLM Roberta tokenization and output ids and start and end offsets
            outputCount = BlingFireUtils.TextToIdsWithOffsets(h, inBytes, inBytes.Length, Ids, Starts, Ends, Ids.Length, 0);
            Console.WriteLine(String.Format("return length: {0}", outputCount));
            if (outputCount >= 0)
            {
                Console.Write("tokens from offsets: [");
                for(int i = 0; i < outputCount; ++i)
                {
                    int startOffset = Starts[i];
                    int surfaceLen = Ends[i] - Starts[i] + 1;

                    string token = System.Text.Encoding.UTF8.GetString(new ArraySegment<byte>(inBytes, startOffset, surfaceLen));
                    Console.Write(String.Format("'{0}'/{1} ", token, Ids[i]));
                }
                Console.WriteLine("]");
            }

            // free loaded models
            BlingFireUtils.FreeModel(h);
        }
    }
}

This code will print the following output:

return length: 49
tokens from offsets: ['Auto'/4396 'pho'/22014 'bia'/9166 ','/4 ' also'/2843 ' called'/35839 ' mono'/22460 'pho'/22014 'bia'/9166 ','/4 ' is'/83 'olo'/7537 'pho'/22014 'bia'/9166 ','/4 ' or'/707 ' '/6 'eremo'/102835 'pho'/22014 'bia'/9166 ','/4 ' is'/83 ' the'/70 ' specific'/29458 ' pho'/53073 'bia'/9166 ' of'/111 ' '/6 'isolation'/219488 '.'/5 ' I'/87 ' saw'/24124 ' a'/10 ' girl'/23040 ' with'/678 ' a'/10 ' tele'/5501 'scope'/70820 '.'/5 ' Я'/1509 ' увидел'/79132 ' дев'/29513 'у'/105 'шку'/46009 ' с'/135 ' теле'/18293 'скоп'/41333 'ом'/419 '.'/5 ]

See this project for more C# examples: https://github.com/microsoft/BlingFire/tree/master/nuget/test .

6. JavaScript example, fetching and loading model file, using the model to compute ids

The goal of integration with JavaScript is ability to run the code in a browser with ML frameworks like TensorFlow.js and FastText web assembly.

Note: this work is still in progress, we are likely to make some changes/improvements there.

import { GetVersion, TextToWords, TextToSentences, LoadModel, FreeModel, TextToIds } from './blingfire_wrapper.js';

$(document).ready(function() {

  var text = "I saw a girl with a telescope. Я видел девушку с телескопом.";

  var modelHandle1 = null;

  $("#btn4").click(function () {
    if(modelHandle1 == null) {
      (async function () {
        modelHandle1 = await LoadModel("./bert_base_tok.bin");
        console.log("Model handle: " + modelHandle1);
      })();
    }
  });

  $("#btn5").click(function () {
    if(modelHandle1 != null) {
      FreeModel(modelHandle1);
      modelHandle1 = null;
      console.log("Model Freed!");
    }
  });

  $("#btn6").click(function () {
    if(modelHandle1 != null) {
      console.log(TextToIds(modelHandle1, text, 128));
    } else {
      console.log("Load the model first!");
    }
  });

});

Full example code can be found here. Details of the API are described in the wasm folder.

7. Example of making a difference with using Bling Fire default tokenizer in a classification task

This notebook demonstrates how Bling Fire tokenizer helps in Stack Overflow posts classification problem.

8. Example of reaching 99% accuracy for language detection

This document describes how to improve FastText language detection model with Bling Fire and achive 99% accuracy in language detection task for 365 languages.

How to create your own models

If you want to create your own tokenization or any other finite-state model, you need to compile the C++ tools first. Then use these tools to compile linugusitc resources from human readble format into binary finite-state machines.

Setup your environment, once. You need to do this step once, it compiles retail version of the tools and adds the build directory to the PATH.
Adding BERT-like tokenization model is describing how to add new tokenization model similar to BERT.
How to add a new Unigram LM model.
How to add a new BPE model.

Note: please read the documents above in the order before creating your own model. If you have any questions please start an Issue in Github.

Support for other programming languages

Supported Platforms

Bling Fire is supported for Windows, Linux and Mac (Thanks to Andrew Kane!)

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Working Branch

To contribute directly to code base, you should create a personal fork and create feature branches there when you need them. This keeps the main repository clean and your personal workflow out of sight.

Pull Request

Before we can accept a pull request from you, you'll need to sign a Contributor License Agreement (CLA). It is an automated process and you only need to do it once.

However, you don't have to do this up-front. You can simply clone, fork, and submit your pull-request as usual. When your pull-request is created, it is classified by a CLA bot. If the change is trivial (i.e. you just fixed a typo) then the PR is labelled with cla-not-required. Otherwise, it's classified as cla-required. In that case, the system will also tell you how you can sign the CLA. Once you have signed a CLA, the current and all future pull-requests will be labelled as cla-signed.

To enable us to quickly review and accept your pull requests, always create one pull request per issue and link the issue in the pull request if possible. Never merge multiple requests in one unless they have the same root cause. Besides, keep code changes as small as possible and avoid pure formatting changes to code that has not been modified otherwise.

Feedback

Ask a question on Stack Overflow.
File a bug in GitHub Issues.

Reporting Security Issues

Security issues and bugs should be reported privately, via email, to the Microsoft Security Response Center (MSRC) at [email protected]. You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Further information, including the MSRC PGP key, can be found in the Security TechCenter.

License

Licensed under the MIT License.

blingfire's People

Contributors

Stargazers

Watchers

Forkers

acslk ram-msft mauna-ai interventionlabs koriel sunnyc7 mbrukman sirocyl lettergram xuesj eculver tchen0123 artisdom hhy5277 chenkovsky d3v3l0 neuroradiology shaunstanislauslau igordzreyev aiexperts yyht jingmouren zhouyonglong 0xflotus rafaelmri tidesq chenmoshushi entn-at kevinmiles bradparks intfrr longjohncoder ezhangle fengjixuchui jaasoos kkeric11368 jeannefukumaru ishine fatelei avsolatorio doneladams cloudbeatsch ktp-forked-repos koaly reinfer bluelight1324 carlos-pinto-coelho-microsoft sergeialonichau hongruixing rikima yjstyle quangvd3 feiyunwill ameyac-msft ari9dam michael-wzhu poiuytr92 ankane shiva1387 rogervaas tgarciai dotkt vishayv woailaosang xiaming9880 nready-rnd stjordanis bhaskers-blu-org2 taffywrinkle claudiusgonzo zuwei-zhao lfoppiano global-localhost global19 global19-atlassian-net shahidkhuram 1crazymoney palerdot sanketshah11 ngothehieu limingdeng awesome-interesting-projects bfyskuy zeta1999 toudsour zgj-gutou gowthamrang-ds laplacekorea cxz mattbraun-docusign shengzhang90 hcyang ben-childs-docusign shafi1908 gerhobbelt messiaen romandulman vivlimmsft leti367 pypae

blingfire's Issues

Where is spm_export_vocab?

It shows how to convert a sentencepiece model into vocabulary file and blingfire model in this blog https://github.com/microsoft/BlingFire/blob/master/ldbsrc/bpe_example/README.TXT .
But I can't find the 'spm_export_vocab' script in BlingFire repo.

BlingFire Nuget Package does not work on CentOS7

It seems that libblingfiretokdll.so is compiled on Ubuntu 16.04 and this does not work on CentOS7 where there is an older version of glibc.

We were able to work around this issue by recompiling the library for CentOS7.

Since nuget can have separate binaries for different linux distros would it be possible to add a CentOS7 target to the official blingfire build?

[Bug] Input/Output Error compiling linguistic sources into automata on Mac

Hello experts,
I am trying to compile blingfire wbd tokenizer (latest release) on Mac following https://github.com/Microsoft/BlingFire/wiki/How-to-change-linguistic-resources
I am encountering an “Input/Output error” when trying to execute
make -f Makefile.gnu lang=wbd all

fa_build_conf \
	 --in=wbd_chuni/ldb.conf.small \
	 --out=wbd_chuni/tmp/ldb.mmap.small.txt
fa_fsm2fsm_pack --type=mmap \
	 --in=wbd_chuni/tmp/ldb.mmap.small.txt \
	 --out=wbd_chuni/tmp/ldb.conf.small.dump \
	 --auto-test
fa_build_lex --dict-root=. --full-unicode --in=wbd_chuni/wbd.lex.utf8 \
	 --tagset=wbd_chuni/wbd.tagset.txt --out-fsa=wbd_chuni/tmp/wbd.rules.fsa.txt \
	 --out-fsa-iwmap=wbd_chuni/tmp/wbd.rules.fsa.iwmap.txt \
	 --out-map=wbd_chuni/tmp/wbd.rules.map.txt
Assigning non-zero to $[ is no longer possible at fa_preproc_mLJyysye line 2.
Assigning non-zero to $[ is no longer possible at fa_preproc_phTqt8Iv line 2.
Assigning non-zero to $[ is no longer possible at fa_preproc_Ou_JVTIV line 2.
ERROR: Input/Output error. in /Users/Gowtham.Raman/Desktop/document-vectorization/BlingFire/blingfirecompile.library/src/FAAutIOTools.cpp at line 47 in program fa_fsm2fsm_iwec
ERROR: Input/Output error. in /Users/Gowtham.Raman/Desktop/document-vectorization/BlingFire/blingfirecompile.library/src/FAAutIOTools.cpp at line 313 in program fa_nfa2dfa
ERROR: Input/Output error. in /Users/Gowtham.Raman/Desktop/document-vectorization/BlingFire/blingfirecompile.library/src/FAAutIOTools.cpp at line 84 in program fa_fsm2fsm
fa_fsm2fsm_pack --alg=triv --type=moore-dfa --remap-iws --use-iwia --in=wbd_chuni/tmp/wbd.rules.fsa.txt --iw-map=wbd_chuni/tmp/wbd.rules.fsa.iwmap.txt --out=wbd_chuni/tmp/wbd.fsa.small.dump
ERROR: Input/Output error. in /Users/Gowtham.Raman/Desktop/document-vectorization/BlingFire/blingfirecompile.library/src/FAAutIOTools.cpp at line 449 in program fa_fsm2fsm_pack
make: *** [wbd_chuni/tmp/wbd.fsa.small.dump] Error 2

Can someone help me understand whats happening ? :)

East Asian support

The readme says, "Currently released model supports most of the languages except East Asian (Chinese Simplified, Traditional, Japanese, Korean, Thai)."

Is East Asian support on the roadmap?

This could be useful for ML.NET, where only a whitespace tokenizer has been open sourced. One potential issue is ML.NET has a preference for pure C# code.

The whitespace tokenization (or breaking on any specific character) is not suitable for East Asian languages: dotnet/machinelearning#325

HuggingFace export

For .NET developer, when attempting to convert HuggingFace transformer models to ONNX, we lack the .NET version of tokenizers provided by the HuggingFace transformers framework.

To promote ONNX adoption for .NET developer, please provide or if better the .NET version for various tokenizers used in the HuggingFace transformers

e.g. is it possible to have BertTokenizer class with similar functions and semantic ussage to the HuggingFace transformer BertTokenizerFast shown below

Does it support Chinese or Japanese

Does uses stop words embeded?

Or do I have to use spacy or ntlk to remove it?

.NET wrapper

Is there an official .NET wrapper available? Couldn't find one at nuget.org...

how to use lemmaization?

How to add sentence splitting rules?

Hi all,

I am trying to use BlingFire for sentence splitting in Greek. As it makes many errors, I want to improve the "rules" it uses. How can I do this?

i want to try to port some rules from the Ellogon langauge engineering platform, which has a sentence splitter written in GNU Flex. It is a combination of a few lexicons and some regular expression patterns.

Error loading files with paths with non-ANSI characters

We're using BlingFire from C#, using BlingFireNuget version 0.1.8. Our product is running on Windows, on an NTFS file system. We cannot load model files that contain special non-ANSI characters in the path (such as ä in the user's name), for example: C:\Users\UserNäme\AppData\Local\Microsoft\VisualStudio\17.0_d1920296\IntelliCodeModels\all_line-completions2_ExtractedData\bpe.bin will fail to load even though it would succeed if the path didn't contain the ä character.

BlingFireUtils.LoadModel(string modelName) converts the string to a UTF-8 encoded byte array to be passed to the extern "C" implementation void* LoadModel(const char * pszLdbFileName). Somewhere inside it's failing and in the .NET side we get an error that reads:

System.Runtime.InteropServices.SEHException: External component has thrown an exception.
      at BlingFire.BlingFireUtils.LoadModel(Byte[] modelName)
      ...

We've tried encoding the model file name using Unicode (UTF-16) and ANSI, both fail the same as the default UTF-8 encoding, all with the same result. Note that NTFS file systems encode paths in a format close to UTF-16.

It looks like the call to fopen_s to read the model file is not going to allow for files with ä in the file path and could be replaced by a call to _wfopen_s instead. This change would also require BlingFireUitls.LoadModel to switch from UTF-8 encoding to UTF-16 encoding, but I might be wrong.

Running in .NET within Unity crashes

If anyone interesting about using in Unity - it does not work... A single line of code LoadModel crashes the editor.
Not much useful info either, as the error itself is very non-verbal.

using BlingFire;
using UnityEngine;

public class BlingTest : MonoBehaviour {
    string tokeniserI2WPath = GetPath("gpt2.i2w");
    void Start() {
            if (File.Exists(tokeniserI2WPath)) {
                Debug.Log("Path Found: " + tokeniserI2WPath);
            var handle = BlingFireUtils.LoadModel(tokeniserI2WPath);
        }
    }
}

Update: Code clarification (file.exists check).

How to generate bin model file based on specified vocab file for wordpiece tokenizer ?

For wordpiece tokenizer, do we support generate bin file based on specified vocab.txt ? If so, is there a wiki for detailed step ? Thanks !

Feature Request: ids_to_text method

Hi, it'd be nice if there was a method to convert ids to text. This would allow Bling Fire to used with text generation models. Here's similar code from Hugging Face's Transformers:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
print(tokenizer.decode([1188, 1110, 170, 2774])) # prints "This is a test"

Release schedule?

Out of curiosity, when are you planning to do the next release on pip?

What git commit will be in that release?

Word Tokenization - Unexpected Output

Is this expected?

text = '''Mr. G. B. Shaw, known at his insistence simply as Bernard Shaw, was an Irish playwright.'''
print(blingfire.text_to_words(text).split())
print(list(nlp(text))) ##spacy

['Mr', '.', 'G', '.', 'B', '.', 'Shaw', ',', 'known', 'at', 'his', 'insistence', 'simply', 'as', 'Bernard', 'Shaw', ',', 'was', 'an', 'Irish', 'playwright', '.']
[Mr., G., B., Shaw, ,, known, at, his, insistence, simply, as, Bernard, Shaw, ,, was, an, Irish, playwright, .]

The dot(.) in Mr. and G. should be not treated as distinct token, it should be a single token.

Ruby Library

Hi, thanks for this awesome library! I wanted to let you know there are now Ruby bindings for it. A few notes:

When calling LoadModel with a path that doesn't exist, an uncaught exception is raised that crashes the program (libc++abi.dylib: terminating with uncaught exception). It'd be great to catch this and have a method to retrieve the error.
It includes a shared library for Mac. I was able to get it to compile with a few changes: master...ankane:mac (#9)
It'd be great if there were tagged releases on GitHub. It currently compiles shared libraries from master (with this repo).

If you have any feedback, let me know or feel free to create an issue on the project. Thanks!

Cannot find "no_padding" option in C# ?

Hey @SergeiAlonichau and @ankane, I'm trying to get parity with HuggingFace tokeniser from the BlingFire C# bindings:

    public int[] TestTokenise(string input_str)
    {
        string tokeniserModelPath = "D:/Models/tokenizers/gpt2.bin";
        tokenizerHandle = BlingFireUtils.LoadModel(tokeniserModelPath);
        BlingFireUtils.SetNoDummyPrefix(tokenizerHandle, false);
        Debug.Log($"About to tokenize {input_str}");
        byte[] inBytes = System.Text.Encoding.UTF8.GetBytes(input_str);
        int[] ids = new int[128];
        int outputCount = BlingFireUtils.TextToIds(tokenizerHandle, inBytes, inBytes.Length, ids, ids.Length, 0);
        Debug.Log($"Found {outputCount} tokens [{string.Join(",",ids)}]");
        return ids.Take(outputCount).ToArray();
    }

I'm getting different tokens than what @ankane had earlier:

From your discussion, I think I'd need to set no_padding to true, but I do not find this option in the C# interface. Any clue where I should look?

Originally posted by @stephane-lallee in #82 (comment)

Why personal title abbreviation split

When using BlingFire, the sentence cannot be separated normally if there is title with person name.

For Example >>

(1) Original Text : On July 26, 2013, Michael G. Spinozzi, President of Sally Beauty Supply, notified the Company of his retirement from the Company with an anticipated effective date of November 8, 2013. Mr. Spinozzi has served as President of Sally Beauty Supply since 2006 and the Company is grateful for Mr. Spinozzis leadership and commitment to the success of the Company during his tenure as an officer of the Company.

Result:

:: On July 26, 2013, Michael G. Spinozzi, President of Sally Beauty Supply, notified the Company of his retirement from the Company with an anticipated effective date of November 8, 2013.

:: Mr.

:: Spinozzi has served as President of Sally Beauty Supply since 2006 and the Company is grateful for Mr. Spinozzis leadership and commitment to the success of the Company during his tenure as an officer of the Company.

(2) Original Text : The Company has promoted Claudia S. San Pedro, age 45, to Senior Vice President, Chief Financial Officer and Treasurer. Ms. Pedro served as Vice President of Investor Relations and Communications of the Company since January 2013 and as Vice President of Investor Relations from July 2010 until January 2013.

Result:

:: The Company has promoted Claudia S. San Pedro, age 45, to Senior Vice President, Chief Financial Officer and Treasurer.

:: Ms.

:: Pedro served as Vice President of Investor Relations and Communications of the Company since January 2013 and as Vice President of Investor Relations from July 2010 until January 2013.

My Code >>

fn = lambda x : blingfire.text_to_sentences( sentence ).split('\n')
y = fn('Original Text : The Company has promoted Claudia S. San Pedro, age 45, to Senior Vice President, Chief Financial Officer and Treasurer. Ms. Pedro served as Vice President of Investor Relations and Communications of the Company since January 2013 and as Vice President of Investor Relations from July 2010 until January 2013.')

Is there any particular problem with this?

thanks

Bert-like tokenization model compilation not finishing

Hi all,

I am trying to compile this model https://huggingface.co/sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking adapting the chinese tutorial. Unfortunately, the process hangs, meaning fa_fsm2fsm_iwec seems to be stuck, moreover no error information received .

Improving sentences' tokenization

Hi guys :)

I'm playing around with BlingFire and I've spotted a potential opportunity for improvement.
Let's say that I have this text:

"blabla lba!asdasdasd solely on creating ads to reach your potential customers or to re-engage your existing ones. some text!"

Here's the code:
`doc = "blabla lba!asdasdasd solely on creating ads to reach y"
output = text_to_sentences(doc).split('\n')

for sent in output:
print("Sentence text: ", sent)
`
This won't split the first two sentences since I don't have an empty space between them, but rather a character (! but it can be . or ? too). Now I know that it might be hard to do this properly (for example, decimal numbers will fall into this category and will influence creating 2 different sentences when it shouldn't).

However, in practice, people tend to make this kind of mistakes (writing new sentences without empty space in between), so I was curious whether the function can be improved.

Best,
Emma

Invalid parameters error FAMultiMapPack_fixed.cpp

I am new to BlingFire and I try to make the tokenizer of this model: https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment
I followed the following tutorial: https://github.com/Microsoft/BlingFire/wiki/How-to-add-a-new-BERT-tokenizer-model
The compile process took so long I decided to follow the advise of this issue: #92
I ended up with this script:

# Preparing a new tokenizer

#	Initial Steps

mkdir bert_multi_uncased
cp bert_base_tok/* bert_multi_uncased
cd bert_multi_uncased
chmod 777 ./options.small
sed -i "5s#.*#OUTPUT = bert_multi_uncased.bin#" "./options.small"
sed -i "9s#.*#opt_build_wbd = --dict-root=. --full-unicode --no-min#" "./options.small"

#	Enable Normalization

python gen_charmap.py > charmap.utf8

#	Add a New vocab.txt

wget https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment/resolve/main/vocab.txt -O ./vocab.txt
python3 vocab_to_fa_lex.py

chmod 777 ./wbd.lex.utf8
sed -i "21s#.*#_include bert_multi_uncased/vocab.falex#" "./wbd.lex.utf8"

#	Compile Your New Model

cd ..
make -f Makefile.gnu lang=bert_multi_uncased all

But after 5 days of compilation I have the following error:

ERROR: Invalid parameters. in /home/.../BlingFire/blingfirecompile.library/src/FAMultiMapPack_fixed.cpp at line 131 in program fa_fsm2fsm_pack

What did I miss?

Does it support tokens to id?

I have lists of list, and each list contains word token and is of different length. Is there a function token_to_ids like text_to_ids?
Bert tokenizer has similar function like "convert_tokens_to_ids".

Remove unnecessary null pointer checks

An extra null pointer check is not needed in functions like the following.

Documentation on Linguistic Resources format

Hey, thanks for a great library!

According to your wiki, the docs for linguistic resources will be on separate document, is there any plan to add the documentation soon?
Or this is the actual docs?

[WinError 193] %1 is not a valid Win32 application

Getting the below error after running pip install blingfire (install works fine). Any ideas?

from blingfire import *
Traceback (most recent call last):
File "", line 1, in
File "C:\Users...\AppData\Local\Programs\Python\Python37-32\lib\site-packages\blingfire_init_.py", line 16, in
blingfire = cdll.LoadLibrary(os.path.join(path, "blingfiretokdll.dll"))
File "C:\Users...\AppData\Local\Programs\Python\Python37-32\lib\ctypes_init_.py", line 434, in LoadLibrary
return self.dlltype(name)
File "C:\Users...\AppData\Local\Programs\Python\Python37-32\lib\ctypes_init.py", line 356, in init
self._handle = _dlopen(self._name, mode)
OSError: [WinError 193] %1 is not a valid Win32 application

Trouble installing for custom model creation

Hey,
So I've been struggling to prepare the system for integrating my custom model into BlingFire.
I'm following these steps.
It works until last step "make". I tried via VS Code terminal (essentially it's PowerShell) and regular cmd prompt. I've got Make installed, but it does not seem to find any makefiles anywhere. Tried in Release and in project's root folder.
Incidentally, fa_nfa2dfa --help command isn't recognized, as apparently some more compiling needs to be done...
OS: Windows 10

Please point me to the right direction...

A list of feature requests for BlingFire

Through a recent evaluation of the feasibility of using BlingFire to tokenize GPT2 for .NET, it seems practical that there is need for interoperability of BlingFire with Tensor Text manipulation through a .NET library.

This issue aims to gather feedback, as there are potential new .NET users here who are interested of deep NLP to consider using dotnet/TorchSharp for interoperability with BligFiure, in the same spirit as use cases in PyTorch.

For these .NET users, one tentative idea is to look at NLP features provided PyTorch/Text to do an evaluation that many of the PyTorch.Text NLP functionalities have already provided by BlingFire and perhaps with better performance.

We need feedback, by looking through the functionalities provided by PyTorch/Text and make these PyTorch NLP features (through BlingFire) available in TorchSharp.

==> Likewise, these unmet .NET NLP features found in PyTorch/Text could provide ideas/inspiration what else to develop to improve BlingFire

Requests

Could BlingFire address all the tokenization needs listed here by Onnxruntime.Extension

reserved identifier violation

I would like to point out that an identifier like “_FA_BRRESULTCA_H_” does eventually not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?

macOS support, and Python dependencies

Hi all, these are two related issues:

After running pip3 install blingfire in macOS, which appears to work fine, it's impossible to use the library since the .dylib isn't there. And the error message is rather meh: AttributeError: 'NoneType' object has no attribute 'TextToWords'
It appears to require NumPy, but doesn't specify it as a dependency, so running from a venv without numpy in it also fails.

Both can be fixed by tuning the setup.py metadata, but wrt macOS, can we please actually support it? I can't imagine it being very different from the Linux build, if at all.

Thanks for open sourcing this!

Roberta tokenizer - first word in sentence doesn't match huggingface tokenizer

In the the original roberta tokenizer words are treated differently if they appear in the beginning of a sentence, i.e. they don't have a space before them:

For example the following code:

tok_hugging_face = RobertaTokenizer.from_pretrained('roberta-base')
tok_blingfire = blingfire.load_model(os.path.join(os.path.dirname(blingfire.__file__), "roberta.bin"))

sentence = "test"
print(f'Sentence - {sentence}')  
print(f'Hugging Face - {tok_hugging_face(sentence)["input_ids"]}')  
print(f'BlingFire - {blingfire.text_to_ids(tok_blingfire, sentence, 1, 100)}')  
print()
sentence = "something test"
print(f'Sentence - {sentence}')
print(f'Hugging Face - {tok_hugging_face(sentence)["input_ids"]}')
print(f'BlingFire - {blingfire.text_to_ids(tok_blingfire, sentence, 2, 100)}')

Produces the following output:

Sentence - test
Hugging Face - [0, 21959, 2]
BlingFire - [1296]

Sentence - something test
Hugging Face - [0, 18891, 1296, 2]
BlingFire - [ 402 1296]

In hugging face 0 and 2 are start and end tokens so they can be ignored. As you can see, the word "test" received the same ID in both cases in BlingFire whereas in HuggingFace it's different.

Error during compile new BERT tokenizer model on Windows

Error during compile new bert model by follow the wiki: https://github.com/Microsoft/BlingFire/wiki/How-to-add-a-new-BERT-tokenizer-model

run cmd "make -f Makefile.gnu lang=bert_frde all" and get error message:

fa_build_conf
--in=bert_frde/ldb.conf.small
--out=bert_frde/tmp/ldb.mmap.small.txt
fa_fsm2fsm_pack --type=mmap
--in=bert_frde/tmp/ldb.mmap.small.txt
--out=bert_frde/tmp/ldb.conf.small.dump
--auto-test
fa_build_lex --dict-root=. --full-unicode --in=bert_frde/wbd.lex.utf8
--tagset=bert_frde/wbd.tagset.txt --out-fsa=bert_frde/tmp/wbd.rules.fsa.txt
--out-fsa-iwmap=bert_frde/tmp/wbd.rules.fsa.iwmap.txt
--out-map=bert_frde/tmp/wbd.rules.map.txt
Assigning non-zero to $[ is no longer possible at fa_preproc_Yd8j5mwj line 2.
Assigning non-zero to $[ is no longer possible at fa_preproc_An2GPBI4 line 2.
Assigning non-zero to $[ is no longer possible at fa_preproc_I_6dCrnx line 2.
ERROR: Unknown error in program fa_nfalist2nfa
ERROR: Input/Output error. in D:\GitHub\BlingFire\blingfirecompile.library\src\FAAutIOTools.cpp at line 47 in program fa_fsm2fsm_iwec
ERROR: Input/Output error. in D:\GitHub\BlingFire\blingfirecompile.library\src\FAAutIOTools.cpp at line 313 in program fa_nfa2dfa
ERROR: Unknown error in program fa_dfa2mindfa
ERROR: Unknown error in program fa_fsm_renum
ERROR: Input/Output error. in D:\GitHub\BlingFire\blingfirecompile.library\src\FAAutIOTools.cpp at line 84 in program fa_fsm2fsm
fa_fsm2fsm_pack --alg=triv --type=moore-dfa --remap-iws --use-iwia --in=bert_frde/tmp/wbd.rules.fsa.txt --iw-map=bert_frde/tmp/wbd.rules.fsa.iwmap.txt --out=bert_frde/tmp/wbd.fsa.small.dump
ERROR: Input/Output error. in D:\GitHub\BlingFire\blingfirecompile.library\src\FAAutIOTools.cpp at line 449 in program fa_fsm2fsm_pack
make: *** [bert_frde/tmp/wbd.fsa.small.dump] Error 2

"bert_chinese.bin" model gives wrong results

When tokenize by BlingFire model,

import os
import blingfire
h = blingfire.load_model(os.path.join(os.path.dirname(blingfire.__file__),'bert_chinese.bin'))
print(blingfire.text_to_words_with_model(h, '搭船前往惠阳，在进入该区域后，他们随即被守军拦下并关押于当地军营。'))

The above code prints:

搭 搭 船 船 前 前 往 往 惠 惠 阳 阳 ， ， 在 在 进 进 入 入 该 该 区 区 域 域 后 后 ， ， 他 他 们

While Huggingface tokenizer gives the correct result:

from transformers import BertTokenizer
tknz = BertTokenizer.from_pretrained('bert_vocab_chinese.txt')
print(' '.join(tknz.tokenize('搭船前往惠阳，在进入该区域后，他们随即被守军拦下并关押于当地军营。')))

搭 船 前 往 惠 阳 ， 在 进 入 该 区 域 后 ， 他 们 随 即 被 守 军 拦 下 并 关 押 于 当 地 军 营 。

Calling LoadModel from .net core WebAPI throws SEHException

exception:
at BlingFire.BlingFireUtils.LoadModel(Byte[] modelName) at BlingFire.BlingFireUtils.LoadModel(String modelName) in C:\Users\vanac\.nuget\packages\blingfirenuget\0.1.4\contentFiles\cs\any\BlingFireUtils.cs:line 23

The same code was tested with unit tests and works perfectly. It only throws the exception when i send a request to a webAPI using the same component. (this behaviour is the same for running the api through IIS as well as running it directly.)

EDIT: code causing this issue:
public TextTokenizer() { _EncoderModel = BlingFireUtils.LoadModel("./bert_base_tok.bin"); }

[Bug] Broken master build on mac

Followed https://github.com/Microsoft/BlingFire/wiki/How-to-change-linguistic-resources to set up the environment, failed to execute make with
ld: unknown option: --gc-sections clang: error: linker command failed with exit code 1 (use -v to see invocation) error

    > git clone Bling-Fire-Git-Path
    > cd BlingFire
    > mkdir Release
    > cd Release
    > cmake -DCMAKE_BUILD_TYPE=Release ..
-- The C compiler identification is AppleClang 12.0.5.12050022
-- The CXX compiler identification is AppleClang 12.0.5.12050022
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
    > make
 ....
 2 warnings generated.
[ 57%] Linking CXX executable test_fsm
ld: unknown option: --gc-sections
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [test_fsm] Error 1
make[1]: *** [CMakeFiles/test_fsm.dir/all] Error 2
make: *** [all] Error 2

Which vocab file you are using for xlm_roberta_base

Could you share the vocab file that for xlm_roberta_base? We are trying to convert Id to tokens. Thank you.

Byte offsets for original input bytes to allow non-destructive tokenization

Hi all,

Is there a way to use blingfire to get byte offsets of tokens or sentence boundaries in the original input bytes rather than the constructed output byte array? I'm interested in non-destructively storing my text content and my token boundaries so I can do things like pre-process sentence breaks and tokens offline and then slice my content at runtime without having to store both the original content (which I may want to do further processing on) and the output of blingfire.

-- Eric

Convert SentencePiece Model to BlingFire.

Is there a way to convert a existing SentencePiece model to BlingFire format?

Blingfire in longrunning process

Hi all,

I have a long-running process (a web service) that I'd like to use blingfire in. In all of the code examples I have seen, the model is freed after it's called and that makes me a little nervous that there might be some particular reason that the model is freed since that's pretty unusual to do in Python. Usually, once something passes out of scope, and its refcount drops to zero, it's freed by the garbage collector, so freeing an object manually is a little odd (thought I don't really work with ctypes much so maybe that's just how it works when you load a DLL).

Is it ok for me to just never free the model? Will I get memory leaks?

Support for GPT-2, RoBERTa, and general Hugging Face models

Hi, thanks for this great library! I'm trying to add GPT-2 but am having trouble. The model compiles, but only returns a single unknown token when testing it.

Model code: master...ankane:gpt2
Model type: BPE

Test code

import blingfire

h = blingfire.load_model("gpt2.bin")
print(blingfire.text_to_ids(h, "This is a test", 128, 100))
# outputs [100   0   0  ...]

If you have time, any advice would be great (I'm having trouble understanding how it works).

Edit: fwiw, I was able to build a model for T5, but had a SentencePiece model to start from.

BlingFire installation: fa_build_lex missing!?

I am installing BlingFire on my mac (Mojave) so I can create my own tokenizers.

I followed the steps described here: https://github.com/Microsoft/BlingFire/wiki/How-to-change-linguistic-resources

During the chapter "Edit linguistic sources and compile them into automata" I have the following output:

`fa_build_conf
--in=wbd/ldb.conf.small
--out=wbd/tmp/ldb.mmap.small.txt

fa_fsm2fsm_pack --type=mmap
--in=wbd/tmp/ldb.mmap.small.txt
--out=wbd/tmp/ldb.conf.small.dump
--auto-test

fa_build_lex --dict-root=. --full-unicode --in=wbd/wbd.lex.utf8
--tagset=wbd/wbd.tagset.txt --out-fsa=wbd/tmp/wbd.rules.fsa.txt
--out-fsa-iwmap=wbd/tmp/wbd.rules.fsa.iwmap.txt
--out-map=wbd/tmp/wbd.rules.map.txt

make: fa_build_lex: No such file or directory

make: *** [wbd/tmp/wbd.rules.fsa.txt] Error 1`

fa_build_lex was not created in my Release folder during the chapter "Make the tools ready (needed to be done one time only)". During this step I have no errors but the warning:

clang: warning: -Wl,-dead_strip: 'linker' input unused [-Wunused-command-line-argument]

What did I do wrong?

PS: This topic is a duplicate of one post from Stack Overflow... sorry. I will also duplicate the good answer and mention my savior when I have my answer.

Sentence boundary for other languages

I am trying to figure out if there is any support for sentence boundary recognition for other languages.
I could not find more documentation on how to replace wbd.bin sbd.bin for new languages?
It is mentioned that these are NLTK-style tokenizer and sentensizers. So, is there a way to export language specific NLTK models like PunktSentenceTokenizer to BilngFire?

Fail to import package after pip install

Hello there,

Just pip installed the Blingfire in Linux Ubuntu 14.04.

Tried to import the package and I'm getting this traceback.

Any ideas on what goes wrong?

Traceback (most recent call last):
File "", line 1, in
File "/home/---/Programs/pycharm-2017.1/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/home/---/envs/py36env/lib/python3.6/site-packages/blingfire/init.py", line 16, in
blingfiretokdll = cdll.LoadLibrary(os.path.join(path, "libblingfiretokdll.so"))
File "/usr/lib/python3.6/ctypes/init.py", line 426, in LoadLibrary
return self._dlltype(name)
File "/usr/lib/python3.6/ctypes/init.py", line 348, in init
self._handle = _dlopen(self._name, mode)
OSError: /home/---/envs/py36env/lib/python3.6/site-packages/blingfire/libblingfiretokdll.so: symbol _ZTTNSt7__cxx1119basic_ostringstreamIcSt11char_traitsIcESaIcEEE, version GLIBCXX_3.4.21 not defined in file libstdc++.so.6 with link time reference

libblingfiretokdll.dylib arm64e support

Hi all, I am using M1 Mac env, and I got this. Can anyone help?

/Users/yinnnyou/miniconda3/envs/s2search/lib/python3.8/site-packages/blingfire
Traceback (most recent call last):
  File "s2search_example.py", line 3, in <module>
    from s2search.rank import S2Ranker
  File "/Users/yinnnyou/workspace/s2search/s2search/__init__.py", line 1, in <module>
    from s2search.features import *
  File "/Users/yinnnyou/workspace/s2search/s2search/features.py", line 5, in <module>
    from s2search.text import find_query_ngrams_in_text, fix_text, STOPWORDS
  File "/Users/yinnnyou/workspace/s2search/s2search/text.py", line 5, in <module>
    from blingfire import text_to_words
  File "/Users/yinnnyou/miniconda3/envs/s2search/lib/python3.8/site-packages/blingfire/__init__.py", line 20, in <module>
    blingfire = cdll.LoadLibrary(os.path.join(path, "libblingfiretokdll.dylib"))
  File "/Users/yinnnyou/miniconda3/envs/s2search/lib/python3.8/ctypes/__init__.py", line 451, in LoadLibrary
    return self._dlltype(name)
  File "/Users/yinnnyou/miniconda3/envs/s2search/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/yinnnyou/miniconda3/envs/s2search/lib/python3.8/site-packages/blingfire/libblingfiretokdll.dylib, 0x0006): tried: 
'/Users/yinnnyou/miniconda3/envs/s2search/lib/python3.8/site-packages/blingfire/libblingfiretokdll.dylib' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e')), 
'/usr/local/lib/libblingfiretokdll.dylib' (no such file), '/usr/lib/libblingfiretokdll.dylib' (no such file)

Support for ARM64 architecture

Hi,

We'd like to use BlingFire in ARM64 processes natively. I'm creating this ticket to express our interest in official support for ARM64 for all three supported operating systems (Windows, Linux, and macOS), as well as keeping existing support for x64.

Thanks!

Size of models?

Hello
I am wondering about the difference compared to models listed at the PyTorch repo. (They are roughly 500MB). Are the models here the same, and if so - how come BlingFire models are so small in comparison?

Add support for XLnet

You just merged support for BERT it seems, which is nice; but XLnet is the new state of the art and has a big momentum, it would be nice to include it.

Just by curiosity, will you use BERT internally for improving BING search results? (if you cannot disclose this information, that's OK!)

Link to wrappers in other languages in the README

I wrote a Rust wrapper for the library -- I was curious if there's any intention to link to "unoffical" wrappers in other languages in the README.

Broken link in README

Hi,

file [1] contains broken link [2].

Include benchmarcks in Readme

BlingFire seems really interesting, thank you for open sourcing it!!!
But I (and many others) who use NLP tools have high accuracy requirements.

I think you would attract more users if you published more benchmarcks on Nlp tasks BlingFire allow.
That would also help to catch regressions!

I would particularly be interested in english lemmatization performance.

If you reach state of the art performance on some tasks, I would be pleased to add them on NLP-progress

I see that you have a benchmarck wiki but not on enough tasks.
Btw it would be nice to refresh your benchmarcks, now that you merged BERT.

microsoft / blingfire Goto Github PK

blingfire's Introduction

Bling Fire

Introduction

Bling Fire Tokenizer Overview

Python API Description

Examples

1. Python example, using default pattern-based tokenizer:

2. Python example, load a custom model for a pattern-based tokenizer:

3. Python example, calling BERT BASE tokenizer

4. Python example, doing tokenization and hyphenation of a text

5. C# example, calling XLM Roberta tokenizer and getting ids and offsets

6. JavaScript example, fetching and loading model file, using the model to compute ids

7. Example of making a difference with using Bling Fire default tokenizer in a classification task

8. Example of reaching 99% accuracy for language detection

How to create your own models

Support for other programming languages

Supported Platforms

Contributing

Working Branch

Pull Request

Feedback

Reporting Security Issues

License

blingfire's People

Contributors

Stargazers

Watchers

Forkers

blingfire's Issues

Requests

Recommend Projects

Recommend Topics

Recommend Org