Comments (9)
I know NLTK has a Moses tokenizer, but unsure if it's a good port / I've always used tokenize.pl anyway
For the reference, for large corpora (e.g MT) I've found it faster to pretokenize your data and feed it into TorchText with the str.split tokenizer, versus having to tokenize on every run.
from text.
@nelson-liu Is there an option to include a bash script to tokenize your data as part of the library?
This issue mentions that Moses in NLTK was fixed nltk/nltk#1214
from text.
Hmm, not directly. Ostensibly you could write a function that takes an input string and runs the bash script on it (e.g with subprocess) and parse the output for tokenization, but that feels like a lot of overhead
from text.
Edited my comment above. Looks like NLTK contributors fixed Moses in this merged pull request: nltk/nltk#1553
from text.
I think including the NLTK version of the Moses tokenizer is a good idea, though it shouldn’t be difficult to use the existing API to call it for now
from text.
+1, slighty unrelated but wonder if it's worth including the other NLTK tokenizers (e.g punkt or PTB) / having some sort of public facing API for using the spacy tokenizers in other languages. This seems like a slippery slope API design-wise, though...
from text.
Yeah, I think I’d rather show in the docs etc how easy it is to call them yourself. But lots of people (including me until a few minutes ago!) don’t know that NLTK now has a Moses-compatible tokenizer, so it might be worth including that one so more people know they can move away from the perl script preprocessing approach.
from text.
from text.
PR at #58
from text.
Related Issues (20)
- One of the three datasets returned by Multi30k seems to be bugged.
- Confusing docs for build_vocab_from_iterator
- how to run this code
- UTF-8 error with testing set of `torchtext.datasets.Multi30k(language_pair=("de", "en"))`. HOT 4
- Torch Text Transform Documentation Mismatch
- The Future of torchtext HOT 1
- BLEU_SCORE weird behaviour
- Fail to import torchtext KeyError: 'SP_DIR' HOT 1
- how to install libtorchtext for cpp project use? please give some operation .thanks
- Unable to download wikitext datasets HOT 4
- AttributeError: module 'torchtext' has no attribute 'legacy'
- # Liste von Namen und Alter personen = [ {"name": "Max", "alter": 30}, {"name": "Anna", "alter": 25}, {"name": "Lisa", "alter": 35} ] # Ausgabe der Liste for person in personen: print("Name:", person["name"]) print("Alter:", person["alter"]) print()
- [Release Blocking] TorchData is too old for PyTorch 2.3 HOT 1
- Remove SpaCy/NLTK as an optional dependency by creating our own tokenizer for a number of languages
- wikitext-2 is not available anymore HOT 2
- Why torchtext needs to reinstall torch
- [RFC] Deprecate/Stop TorchText releases starting with Pytorch release 2.4 HOT 9
- PyTorch 2.4 is not supported by TorchText
- Wikitext-103 URL is down HOT 3
- t5_demo can't retrieve CNNDM from drive.google; how to use local copy?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from text.