Comments (3)
I just looked at the data and you can use the following script to process any pair using https://github.com/google/sentencepiece:
#! /usr/bin/env bash
# Dependencies:
# - https://github.com/google/sentencepiece
CORPUS_DIR=$(pwd)
SOURCE_LANG="en"
TARGET_LANG="de"
VOCBA_SIZE=32000
# Learn BPE across both corpora
spm_train \
--input=${CORPUS_DIR}/corpus.tc.${SOURCE_LANG},${CORPUS_DIR}/corpus.tc.${TARGET_LANG} \
--model_prefix=${CORPUS_DIR}/bpe \
--vocab_size=32000 \
--model_type=bpe
# Apply BPE to corpus
spm_encode --model=${CORPUS_DIR}/bpe.model --output_format=piece \
< ${CORPUS_DIR}/corpus.tc.${SOURCE_LANG} \
> ${CORPUS_DIR}/corpus.tc.bpe.${SOURCE_LANG}
spm_encode --model=${CORPUS_DIR}/bpe.model --output_format=piece \
< ${CORPUS_DIR}/corpus.tc.${TARGET_LANG} \
> ${CORPUS_DIR}/corpus.tc.bpe.${TARGET_LANG}
# Apply BPE to all dev data
for lang in ${SOURCE_LANG} ${TARGET_LANG}; do
for infile in $(find ${CORPUS_DIR}/dev | grep tc.${lang}); do
echo $infile
outfile="${infile%.*}.bpe.${lang}"
spm_encode --model=${CORPUS_DIR}/bpe.model --output_format=piece < $infile > $outfile
echo $outfile
done
done
from seq2seq.
WMT17 organisers already published preprocessed version of the data: link to data. The scripts used for preprocessing are included.
However, they have tried to keep the pre-processing fairly 'light touch' (only Moses standard preprocessing).
It could be a good unified starting point for us here.
from seq2seq.
This is great, yes. I will make preparing the datasets a lot easier.
from seq2seq.
Related Issues (20)
- speeding up inference nmt chatbot nlp
- InvalidArgumentError, Found Inf or NaN gradient(global norm). HOT 2
- Invalid argument: No OpKernel was registered to support Op 'PyFunc' HOT 4
- ValueError: Can not provide both every_secs and every_steps
- seq2seq checkpoint restore for transfer learning
- num_units is not a valid argument for BasicLSTMCell class tf 1.14 HOT 3
- KeyErrors when running pipeline test HOT 8
- Fix Google seq2seq Installation Errors
- AttributeError: module 'tensorflow.python.platform.flags' has no attribute '_FlagValues' HOT 4
- Error while executing
- tensorflow.python.framework.errors_impl.NotFoundError : Key not found HOT 2
- Error while making predictions (Testing).
- Deprecate non-standard BLEU scripts
- How to build a character based seq2seq tensorflow model for spell correction?
- Error On Setup HOT 1
- WMT 2016 En-De Download Link is broken HOT 1
- python -m unittest seq2seq.test.pipeline_test -> ModuleNotFoundError: No module named 'seq2seq' HOT 2
- ModuleNotFoundError: No module named 'tensorflow.contrib' HOT 2
- ModuleNotFoundError: No module named 'tensorflow' HOT 1
- Can I decode embedings to sequences using seq2seq? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seq2seq.