shrutirij / ocr-post-correction Goto Github PK

View Code? Open in Web Editor NEW

129.0 129.0 19.0 10.45 MB

License: Other

Python 89.35% Shell 10.65%

ocr-post-correction's People

Contributors

Stargazers

Watchers

Forkers

jwijffels linkonbsmrstu abhilasharavichander rosnerm skarlett992 vanlar-cyber ducbluee fundou aucan meghanathmacha jivnesh sudharsan2020 vahini01 bumbutudor omarhatemsalem zaidsheikh

ocr-post-correction's Issues

Error installing dependencies

Hi,

I'm trying ti run and test this repo, but i'm getting an error when i try to install the dependencies.

Ubuntu 22.04 LTS

`pip install -r postcorr_requirements.txt

...

  INFO:root:/usr/bin/g++ -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 build/temp.linux-x86_64-cpython-310/tmp/pip-install-0pfrr3uk/dynet_233344f25cf84b9098f2fbac1c2cd338/build/py3.10-64bit/python/_dynet.o -L/tmp/pip-install-0pfrr3uk/dynet_233344f25cf84b9098f2fbac1c2cd338/build/py3.10-64bit/dynet/ -L. -L/usr/lib/x86_64-linux-gnu -Wl,--enable-new-dtags,-R/tmp/pip-install-0pfrr3uk/dynet_233344f25cf84b9098f2fbac1c2cd338/build/py3.10-64bit/dynet/ -Wl,--enable-new-dtags,-R/usr/lib/ -ldynet -o /tmp/pip-install-0pfrr3uk/dynet_233344f25cf84b9098f2fbac1c2cd338/build/py3.10-64bit/python/_dynet.cpython-310-x86_64-linux-gnu.so -Wl,-rpath='/usr/lib/',--no-as-needed
  INFO:root:Copying built extensions...
  [100%] Built target pydynet
  INFO:root:Installing...
  Consolidate compiler generated dependencies of target dynet
  [ 98%] Built target dynet
  [ 98%] Built target pydynet_precopy
  [100%] Built target pydynet
  Install the project...
  -- Install configuration: "Release"
  CMake Error at dynet/cmake_install.cmake:46 (file):
    file cannot create directory: /usr/include/dynet.  Maybe need
    administrative privileges.
  Call Stack (most recent call first):
    cmake_install.cmake:47 (include)
  
  
  make: *** [Makefile:100: install] Erro 1
  error: /usr/bin/make install
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for dynet
Failed to build dynet
ERROR: Could not build wheels for dynet, which is required to install pyproject.toml-based projects

I realize this output:

file cannot create directory: /usr/include/dynet. Maybe need
administrative privileges.

I try sudo pip install -r postcorr_requirements.txt

i got this output:

Collecting dynet Using cached dyNET-2.1.2.tar.gz (509 kB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Collecting editdistance Using cached editdistance-0.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (282 kB) Collecting Levenshtein Using cached Levenshtein-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (169 kB) Collecting cython Using cached Cython-3.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB) Requirement already satisfied: numpy in /usr/lib/python3/dist-packages (from dynet->-r postcorr_requirements.txt (line 1)) (1.21.5) Collecting rapidfuzz<4.0.0,>=3.1.0 Using cached rapidfuzz-3.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB) Building wheels for collected packages: dynet Building wheel for dynet (pyproject.toml) ... done Created wheel for dynet: filename=dyNET-2.1.2-cp310-cp310-linux_x86_64.whl size=3542372 sha256=af2a6936a12d7d17d77059ff66ba6353ab8d7ac543664134e29349e04c741b28 Stored in directory: /root/.cache/pip/wheels/2d/39/d7/01b76ca1370da9de9825b7051a8fd9aff320b254e2bba7ccce Successfully built dynet Installing collected packages: rapidfuzz, editdistance, cython, Levenshtein, dynet Successfully installed Levenshtein-0.23.0 cython-3.0.6 dynet-2.1.2 editdistance-0.6.2 rapidfuzz-3.5.2 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

there is a better way to install?

Great paper

Need help with anything ? Code re-factoring etc ?

Help with installing Dynet - GPU

Hi,
I am facing some issues while installing Dynet-GPU for CUDA 11.1. Can you please inform, whether you have used the CPU version of dynet or GPU ? If GPU, then can you inform the version of dynet and eigen that you have used?

System Specifications:
Cuda Version - 11.1

I have tried the following versions.

Build Command:
To avoid Unsupported GPU architecture for compute_30 during build time, the below command is used.

cmake .. -DEIGEN3_INCLUDE_DIR=../eigen -DPYTHON='which python' -DBACKEND=cuda -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.1

Dynet	Eigen	Error
Latest(master branch)	Eigen 3.2	CX11 folder is not present
Latest(master branch)	Eigen 3.3	identifier std::round is undefined in device code
Latest(master branch)	Eigen 3.3.7	identifier std::round is undefined in device code
Latest(master branch)	Eigen 3.4	Error while running make
Dynet 2.0.3	Eigen-2355b22	Unsupported GPU architecture for compute_30 (while running make)
Dynet 2.1	Eigen-b2e267dc99d4.zip	Unsupported GPU architecture for compute_30 (while running make)

Dynet dynamic memory allocation

I have installed dynet with gpu compatibility as mentioned in the docs. Also the --dynet-mem is set in the train_single-source.sh file. Even then I got this error. Following is the Traceback of the entire error.

[dynet] Device Number: 2
[dynet] Device name: GeForce GTX 1080 Ti
[dynet] Memory Clock Rate (KHz): 5505000
[dynet] Memory Bus Width (bits): 352
[dynet] Peak Memory Bandwidth (GB/s): 484.44
[dynet] Memory Free (GB): 11.5464/11.7215
[dynet] Device(s) selected: 2
[dynet] random seed: 2652333402
[dynet] using autobatching
[dynet] allocating memory: 6000MB
[dynet] memory allocation done.
Param, load_model: None
Traceback (most recent call last):
File "/mnt/data/souvik/sanskrit/ocr-post-correction/postcorrection/multisource_wrapper.py", line 65, in
pretrainer = PretrainHandler(
File "/mnt/data/souvik/sanskrit/ocr-post-correction/postcorrection/pretrain_handler.py", line 81, in init
self.pretrain_model(pretrain_src1, pretrain_src2, pretrain_tgt, epochs)
File "/mnt/data/souvik/sanskrit/ocr-post-correction/postcorrection/pretrain_handler.py", line 88, in pretrain_model
self.seq2seq_trainer.train(
File "/mnt/data/souvik/sanskrit/ocr-post-correction/postcorrection/seq2seq_trainer.py", line 55, in train
batch_loss.backward()
File "_dynet.pyx", line 823, in _dynet.Expression.backward
File "_dynet.pyx", line 842, in _dynet.Expression.backward
ValueError: Dynet does not support both dynamic increasing of memory pool size, and automatic batching or memory checkpointing. If you want to use automatic batching or checkpointing, please pre-allocate enough memory using the --dynet-mem command line option (details http://dynet.readthedocs.io/en/latest/commandline.html).

Distributed training and auto-batching with this codebase

Hi @shrutirij thanks for the great work and very well-documented and clean codebase, I greatly appreciate it!

I adapted and tried running this codebase for some conceptually similar experiments and I've observed a few quirks that I wanted to run by you to get your thoughts since I haven't really used Dynet before.

I ran the codebase with --dynet-gpus 8 (after also modifying opts.py to support this arg) and found that although 8 processes are spawned and attached to 8 GPUs, only the first process has > 0% GPU utilization. It appears that this codebase doesn't support distributed training in its current form. Is that accurate? Is there an equivalent to PyTorch's DistributedDataParallel and DistributedSampler that I can use to perform data-parallel training and inference with Dynet? It would greatly speed up my experiments.
It appears that the training time with a CPU is the same as the training time with GPU when using 1 GPU via provision of the --dynet-gpu flag. Is this what you noticed too during your runs? If not, could you suggest how I can get this to run faster with a GPU?
It appears that the Dynet auto-batching feature isn't working, because I tried running the code with and without the --dynet-autobatch 1 flag and the run-time doesn't seem to change. I see the main training loop looks like the following (where minibatch_size is always set to 1 here):

            for i in range(0, len(train_data), minibatch_size):
                cur_size = min(minibatch_size, len(train_data) - i)
                losses = []
                dy.renew_cg()
                for (src1, src2, tgt) in train_data[i : i + cur_size]:
                    losses.append(self.model.get_loss(src1, src2, tgt))
                batch_loss = dy.esum(losses)
                batch_loss.backward()
                trainer.update()
                epoch_loss += batch_loss.scalar_value()
            logging.info("Epoch loss: %0.4f" % (epoch_loss / len(train_data)))

Doesn't this mean that cur_size is always 1, causing the inner for loop to just iterate over a list of size 1 by default? If I were to override minibatch_size to, say, 32, how does Dynet ensure that 1 forward operation occurs per batch of 32 examples instead of 32 separate forward passes?

Thanks a lot for your time and thanks again for the great work toward protecting endangered languages!

Segmentation Fault: Pretraining Epoch 0

The training process is interrupted by a segmentation fault during the very first epoch as part of the pretraining process. The error encountered is as follows:

[dynet] random seed: 1678755796
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
[dynet] random seed: 2822143777
[dynet] using autobatching
[dynet] allocating memory: 10000MB
[dynet] memory allocation done.
train_single-source.sh: line 65: 1310309 Segmentation fault (core dumped) python3 postcorrection/multisource_wrapper.py --dynet-mem $dynet_mem --dynet-autobatch 1 --pretrain_src1 $pretrain_src --pretrain_tgt $pretrain_tgt $params --single --vocab_folder $expt_folder/vocab --output_folder $expt_folder --model_name $pretrained_model_name --pretrain_only

I was able to narrow down the problem to the following piece of code:
batch_loss.backward() in line 80: lm_trainer.py

The following issue may hint towards the potential problem:
[(https://github.com/clab/dynet/issues/308)]

Has anyone encountered this problem before?

a vs "a" without a hat

in the example diagram with the training model for corrections, I notices that ka is not corrected. how do you make corrections for the two variants of the IPA a?

shrutirij / ocr-post-correction Goto Github PK

ocr-post-correction's People

Contributors

Stargazers

Watchers

Forkers

ocr-post-correction's Issues

Error installing dependencies

Great paper

Help with installing Dynet - GPU

Dynet dynamic memory allocation

Distributed training and auto-batching with this codebase

Segmentation Fault: Pretraining Epoch 0

a vs "a" without a hat

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent