tsurara's Introduction

tsurara

pip install -r requirements.txt
python -m unidic download (downloads a 500MB dictionary file)
python -m tsurara review -i subtitle_file.srt -o output_file.csv

This will:

parse the subtitle file
tokenize the Japanese text into words
filter out words that you probably don't care about (particles, proper nouns, suffixes, etc.)
sort the words in descending order of frequency of appearance in an anime dataset
drop you into the main loop that iterates over words

While in the loop, tsurara incrementally saves data to a file called .tsurara.json in your home directory (which you don't need to touch) and your specified output csv. Even if you exit at any point, your progress until that point is saved.

In the main loop, you have the following options:

known: mark a word as known forever. you will never see it again
reveal meaning: show meaning of the word and redisplay the menu. if it revealed the meaning by default it would be hard to tell if you knew the word
add: add the word to your output file (there are a few submenus to choose the reading and definition). this also marks the word as known
skip: do nothing. go to next word
ignore: mark a word as ignored forever. this is does more or less the same thing as known but is separate because I'm going to add an option to clear ignored words later
quit: exit immediately. your progress excluding the current word is saved.

The first letter of each option corresponds to its keyboard shortcut.

The advantage of this tool is that as you process more files, your known words list will grow. Since words are sorted by descending frequency, loading a new file will show you the most common words you don't know. So even if you don't go through all the words in a file you're spending your time optimally.

Recommend Projects

sidmani / tsurara Goto Github PK

tsurara's Introduction

tsurara

tsurara's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent