Comments (3)
Is YouTokenToMe a right tool to do it?
https://github.com/VKCOM/YouTokenToMe
If yes, do I need to use below command line to generate tokenized model for the used languages?
yttm bpe --data data/train.en --model bpe8k_yttm.model --vocab_size 8200
from nemo.
Yes, please refer to the YouTokenToMe tokenizer's docs on how to do it.
The command you posted looks good to me. Do you have any issues with it?
from nemo.
Dear Oleksii,
No the issue was solved!
So I am closing this issue.
I will try to formulate my further questions for using of new language with NeMo's NLP as separate issues.
from nemo.
Related Issues (20)
- Feature Normalization in the ASR preprocessor is too slow. HOT 3
- Conflict between precision and plugins arguments in Trainer HOT 1
- Error response from daemon: unauhorized: authentication required HOT 1
- Why use two types of names? spe refers to the Google sentencepiece library tokenizer. bpe for SentencePiece tokenizer HOT 2
- Error in coverting Mixtral-7B hf checkpoint to Nemo HOT 1
- Error while exporting to TensorRTLLM format - AttributeError: 'NoneType' object has no attribute 'get' HOT 3
- `megatron_gpt_finetuning.py` does not work `max_epochs` HOT 1
- Is frame marblenet VAD still supported?
- Context parallel does not work in some cases which works well using megatron-lm directly
- Can't train/finetune a model on two RTX4090
- is Forced Alignment available on prebuilt Docker images? HOT 1
- Support for Specifying Start and End Time when Reading WAV File HOT 1
- Saving and reloading the pretrained model's vocab breaks the tokenizer. HOT 2
- IndexError: index -1 is out of bounds for dimension 1 with size 0
- Support required for fine tuning cache aware streaming model
- Canary model stuck in a loop? Just repeats the same phrases over and over. HOT 2
- Slow training on Mixtral-8x22B when DP size > 1 HOT 2
- TypeError: EncDecRNNTBPEModel.change_vocabulary() got an unexpected keyword argument 'new_vocabulary' HOT 1
- Memory Allocation Error during alignement (tools/nemo_forced_aligner/align.py)
- "greedy_batched" methods should support "partial_hypotheses" option
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nemo.