Git Product home page Git Product logo

Comments (5)

p16i avatar p16i commented on June 1, 2024

@cnlinxi sorry for your inconvenience. I thought using floydhub would be sustainable but it seems very costly in a long run. So, I've decided to cancel my subscription, hence losing the datasets there.

I'll get back to you regarding the corpus. Would you mind sharing a bit on what you plan to do with the code?

from attacut.

cnlinxi avatar cnlinxi commented on June 1, 2024

@heytitle Sorry for reply too late. I hope to use this model to segment Thai words, and hope to improve it. I hope to provide a good Thai text regularization method.

from attacut.

p16i avatar p16i commented on June 1, 2024

@cnlinxi sorry again for my response. You can find the data at https://codeforthailand.s3-ap-southeast-1.amazonaws.com/attacut-related/data.zip

Please unzip and make sure the root directory is at ./data. The content of the archive contains
image

Only the first two are relevant for training; sampling-0 means all the dateset, while sampling-10 means only 10 files are used. You can use sampling-10 for quick training.

Before running the training command below, make sure that you have the ./artifacts directory.

python ./scripts/train.py --model-name seq_sy_ch_conv_concat \
 --model-params "embc:8|embs:8|conv:8|l1:6|do:0.1" \
 --data-dir ./data/best-syllable-crf-and-character-seq-feature-sampling-0  \
 --output-dir ./artifacts/model-xx  \
 --epoch 2 \
 --batch-size 1024 \
 --lr 0.001 \
 --lr-schedule "step:5|gamma:0.5"

from attacut.

cnlinxi avatar cnlinxi commented on June 1, 2024

@heytitle thank you very much. I have trained this model on BEST 2010. Great work:)

from attacut.

charlesfufu avatar charlesfufu commented on June 1, 2024

https://codeforthailand.s3-ap-southeast-1.amazonaws.com/attacut-related/data.zip

Is word split by "~" in "best-syllable-tokenized" dataset?

from attacut.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.