Comments (2)
Hi @roudimit,
Thank you for raising this issue and so sorry for the late reply!
I fixed this by adding ...
Does this happen to the other monolingual models as well?
Did you use the language tag in the reference for evaluation in the multilingual setting?
Yes, and later on I found out it is NOT the common practice. So, we are going to update the results in a newer version of the MuAViC paper.
Is there a way to force output text in a certain language?
I never tried this before. However, theoretically we can use bos_index
to do so with the multilingual dictionary (dict.x.txt
) to know the index of the bos symbol. For example <ar>
is the first word in the dictionary, then its index gonna be 4
since the first four indices are: <s>
, <pad>
, </s>
, and <unk>
in that order.
Should I add the language tag to the beginning of all sentences?
That's how I did it in the paper. However, looking back I don't think it was necessary.
How do you balance samples from different languages?
In the paper, we used random sampling, which doesn't balance samples from different languages. However, you can balance the dataset sampling by following these steps:
- First, create different TSV files, one for every language.
- Then, set the
dataset_train_subset
to these files as comma-separated, e.g.dataset.train_subset=train_en,train_ar,train_el, ...etc.
- Then, change the
load_dataset
to be similar to this.
from muavic.
Hi @Anwarvic thanks for the clarifications! I'll keep this issue open for now since the multilingual model WER are impacted by the language tag in the beginning, and since you are planning to update the results.
from muavic.
Related Issues (20)
- Minor issue HOT 2
- Error when preprocessing the video data HOT 1
- A small bug during audio pre-processing HOT 1
- Got error when preparing LRS3 HOT 5
- download_ted2020() error HOT 4
- TEDx Talk with ID=D4TE28-L7FI is not available anymore HOT 5
- Error running the data prep script HOT 7
- Error when generating the manifest for AVSR HOT 3
- Questions towards hyper-parameters and the token post-processing HOT 1
- Unable to download corpora other than English HOT 1
- Problems when Downloading the Italian Dataset HOT 2
- VSR performance lower on MuAViC version of LRS3 (En) HOT 2
- Empty X -> EN translations HOT 2
- Noise parameters for decoding and training HOT 6
- Problem met when downloading German data HOT 2
- Only audio files could be downloaded
- Could you please tell me what version your 'sox' is? HOT 3
- How much storage do I need in total to download the muavic dataset?
- RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for AVHubertSeq2Seq: HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from muavic.