Git Product home page Git Product logo

ocr-post-correction's People

Contributors

shrutirij avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ocr-post-correction's Issues

Error installing dependencies

Hi,

I'm trying ti run and test this repo, but i'm getting an error when i try to install the dependencies.

Ubuntu 22.04 LTS

`pip install -r postcorr_requirements.txt

...

  INFO:root:/usr/bin/g++ -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 build/temp.linux-x86_64-cpython-310/tmp/pip-install-0pfrr3uk/dynet_233344f25cf84b9098f2fbac1c2cd338/build/py3.10-64bit/python/_dynet.o -L/tmp/pip-install-0pfrr3uk/dynet_233344f25cf84b9098f2fbac1c2cd338/build/py3.10-64bit/dynet/ -L. -L/usr/lib/x86_64-linux-gnu -Wl,--enable-new-dtags,-R/tmp/pip-install-0pfrr3uk/dynet_233344f25cf84b9098f2fbac1c2cd338/build/py3.10-64bit/dynet/ -Wl,--enable-new-dtags,-R/usr/lib/ -ldynet -o /tmp/pip-install-0pfrr3uk/dynet_233344f25cf84b9098f2fbac1c2cd338/build/py3.10-64bit/python/_dynet.cpython-310-x86_64-linux-gnu.so -Wl,-rpath='/usr/lib/',--no-as-needed
  INFO:root:Copying built extensions...
  [100%] Built target pydynet
  INFO:root:Installing...
  Consolidate compiler generated dependencies of target dynet
  [ 98%] Built target dynet
  [ 98%] Built target pydynet_precopy
  [100%] Built target pydynet
  Install the project...
  -- Install configuration: "Release"
  CMake Error at dynet/cmake_install.cmake:46 (file):
    file cannot create directory: /usr/include/dynet.  Maybe need
    administrative privileges.
  Call Stack (most recent call first):
    cmake_install.cmake:47 (include)
  
  
  make: *** [Makefile:100: install] Erro 1
  error: /usr/bin/make install
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for dynet
Failed to build dynet
ERROR: Could not build wheels for dynet, which is required to install pyproject.toml-based projects

`

I realize this output:

file cannot create directory: /usr/include/dynet. Maybe need
administrative privileges.

I try sudo pip install -r postcorr_requirements.txt

i got this output:

Collecting dynet Using cached dyNET-2.1.2.tar.gz (509 kB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Collecting editdistance Using cached editdistance-0.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (282 kB) Collecting Levenshtein Using cached Levenshtein-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (169 kB) Collecting cython Using cached Cython-3.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB) Requirement already satisfied: numpy in /usr/lib/python3/dist-packages (from dynet->-r postcorr_requirements.txt (line 1)) (1.21.5) Collecting rapidfuzz<4.0.0,>=3.1.0 Using cached rapidfuzz-3.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB) Building wheels for collected packages: dynet Building wheel for dynet (pyproject.toml) ... done Created wheel for dynet: filename=dyNET-2.1.2-cp310-cp310-linux_x86_64.whl size=3542372 sha256=af2a6936a12d7d17d77059ff66ba6353ab8d7ac543664134e29349e04c741b28 Stored in directory: /root/.cache/pip/wheels/2d/39/d7/01b76ca1370da9de9825b7051a8fd9aff320b254e2bba7ccce Successfully built dynet Installing collected packages: rapidfuzz, editdistance, cython, Levenshtein, dynet Successfully installed Levenshtein-0.23.0 cython-3.0.6 dynet-2.1.2 editdistance-0.6.2 rapidfuzz-3.5.2 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

there is a better way to install?

Great paper

Need help with anything ? Code re-factoring etc ?

Help with installing Dynet - GPU

Hi,
I am facing some issues while installing Dynet-GPU for CUDA 11.1. Can you please inform, whether you have used the CPU version of dynet or GPU ? If GPU, then can you inform the version of dynet and eigen that you have used?

System Specifications:
Cuda Version - 11.1

I have tried the following versions.

Build Command:
To avoid Unsupported GPU architecture for compute_30 during build time, the below command is used.

cmake .. -DEIGEN3_INCLUDE_DIR=../eigen -DPYTHON='which python' -DBACKEND=cuda -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.1

Dynet Eigen Error
Latest(master branch) Eigen 3.2 CX11 folder is not present
Latest(master branch) Eigen 3.3 identifier std::round is undefined in device code
Latest(master branch) Eigen 3.3.7 identifier std::round is undefined in device code
Latest(master branch) Eigen 3.4 Error while running make
Dynet 2.0.3 Eigen-2355b22 Unsupported GPU architecture for compute_30 (while running make)
Dynet 2.1 Eigen-b2e267dc99d4.zip Unsupported GPU architecture for compute_30 (while running make)

Dynet dynamic memory allocation

I have installed dynet with gpu compatibility as mentioned in the docs. Also the --dynet-mem is set in the train_single-source.sh file. Even then I got this error. Following is the Traceback of the entire error.

[dynet] Device Number: 2
[dynet] Device name: GeForce GTX 1080 Ti
[dynet] Memory Clock Rate (KHz): 5505000
[dynet] Memory Bus Width (bits): 352
[dynet] Peak Memory Bandwidth (GB/s): 484.44
[dynet] Memory Free (GB): 11.5464/11.7215
[dynet] Device(s) selected: 2
[dynet] random seed: 2652333402
[dynet] using autobatching
[dynet] allocating memory: 6000MB
[dynet] memory allocation done.
Param, load_model: None
Traceback (most recent call last):
File "/mnt/data/souvik/sanskrit/ocr-post-correction/postcorrection/multisource_wrapper.py", line 65, in
pretrainer = PretrainHandler(
File "/mnt/data/souvik/sanskrit/ocr-post-correction/postcorrection/pretrain_handler.py", line 81, in init
self.pretrain_model(pretrain_src1, pretrain_src2, pretrain_tgt, epochs)
File "/mnt/data/souvik/sanskrit/ocr-post-correction/postcorrection/pretrain_handler.py", line 88, in pretrain_model
self.seq2seq_trainer.train(
File "/mnt/data/souvik/sanskrit/ocr-post-correction/postcorrection/seq2seq_trainer.py", line 55, in train
batch_loss.backward()
File "_dynet.pyx", line 823, in _dynet.Expression.backward
File "_dynet.pyx", line 842, in _dynet.Expression.backward
ValueError: Dynet does not support both dynamic increasing of memory pool size, and automatic batching or memory checkpointing. If you want to use automatic batching or checkpointing, please pre-allocate enough memory using the --dynet-mem command line option (details http://dynet.readthedocs.io/en/latest/commandline.html).

Distributed training and auto-batching with this codebase

Hi @shrutirij thanks for the great work and very well-documented and clean codebase, I greatly appreciate it!

I adapted and tried running this codebase for some conceptually similar experiments and I've observed a few quirks that I wanted to run by you to get your thoughts since I haven't really used Dynet before.

  1. I ran the codebase with --dynet-gpus 8 (after also modifying opts.py to support this arg) and found that although 8 processes are spawned and attached to 8 GPUs, only the first process has > 0% GPU utilization. It appears that this codebase doesn't support distributed training in its current form. Is that accurate? Is there an equivalent to PyTorch's DistributedDataParallel and DistributedSampler that I can use to perform data-parallel training and inference with Dynet? It would greatly speed up my experiments.
  2. It appears that the training time with a CPU is the same as the training time with GPU when using 1 GPU via provision of the --dynet-gpu flag. Is this what you noticed too during your runs? If not, could you suggest how I can get this to run faster with a GPU?
  3. It appears that the Dynet auto-batching feature isn't working, because I tried running the code with and without the --dynet-autobatch 1 flag and the run-time doesn't seem to change. I see the main training loop looks like the following (where minibatch_size is always set to 1 here):
            for i in range(0, len(train_data), minibatch_size):
                cur_size = min(minibatch_size, len(train_data) - i)
                losses = []
                dy.renew_cg()
                for (src1, src2, tgt) in train_data[i : i + cur_size]:
                    losses.append(self.model.get_loss(src1, src2, tgt))
                batch_loss = dy.esum(losses)
                batch_loss.backward()
                trainer.update()
                epoch_loss += batch_loss.scalar_value()
            logging.info("Epoch loss: %0.4f" % (epoch_loss / len(train_data)))

Doesn't this mean that cur_size is always 1, causing the inner for loop to just iterate over a list of size 1 by default? If I were to override minibatch_size to, say, 32, how does Dynet ensure that 1 forward operation occurs per batch of 32 examples instead of 32 separate forward passes?

Thanks a lot for your time and thanks again for the great work toward protecting endangered languages!

Segmentation Fault: Pretraining Epoch 0

The training process is interrupted by a segmentation fault during the very first epoch as part of the pretraining process. The error encountered is as follows:

[dynet] random seed: 1678755796
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
[dynet] random seed: 2822143777
[dynet] using autobatching
[dynet] allocating memory: 10000MB
[dynet] memory allocation done.
train_single-source.sh: line 65: 1310309 Segmentation fault (core dumped) python3 postcorrection/multisource_wrapper.py --dynet-mem $dynet_mem --dynet-autobatch 1 --pretrain_src1 $pretrain_src --pretrain_tgt $pretrain_tgt $params --single --vocab_folder $expt_folder/vocab --output_folder $expt_folder --model_name $pretrained_model_name --pretrain_only

I was able to narrow down the problem to the following piece of code:
batch_loss.backward() in line 80: lm_trainer.py

The following issue may hint towards the potential problem:
[(https://github.com/clab/dynet/issues/308)]

Has anyone encountered this problem before?

a vs "a" without a hat

in the example diagram with the training model for corrections, I notices that ka is not corrected. how do you make corrections for the two variants of the IPA a?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.