vacancy / nscl-pytorch-release Goto Github PK

View Code? Open in Web Editor NEW

410.0 410.0 93.0 299 KB

PyTorch implementation for the Neuro-Symbolic Concept Learner (NS-CL).

Home Page: http://nscl.csail.mit.edu

License: MIT License

Python 100.00%

concept-learning neuro-symbolic-learning vqa

nscl-pytorch-release's People

Contributors

Stargazers

Watchers

Forkers

wangkanger kelvinson johndpope shubhampachori12110095 hyzcn sjoerdapp dhh1995 daominglyu wtdeng vineetp13 wrccrwx godlovesun ericsta pswietojanski heygrandpa lp-github mathfinder awesome-archive ml-lab bergen jwyang dragomirradev rezacsedu mooninrain phimachine demetriussuchkov odellus sailfish009 qcactus be-redasmara shiquanyang dapatil211 d-behl jinhoha stephenkyung thanakorn davidchenpeng ravissj4 lizw14 erobic o7s8r6 wyx1516 hendrikstrobelt zrhonor oolongqian gwliu213 mohit291298 personx000 codeaudit sakshamjindal himelys pandapyh shrr98 doinker arturhd minalspatil zikangxiong summuk shalei120 grez72 binliang-nlp deval-maker yichao-liang johari-kritika peterzhousz adamishay mengyaowunotavailable wyy0206 wh-forker chriskok qwtail albertyao1993 cephdon alexanderdurr realliyifei fishingmapache celsopitta work-tokita wufeim dbarbedillo prochalo aishniparab anubhavanand1516 kapardine lianglili gomb0c iq-scm junfeng-geo alexiazh zhazhat hiyyg kfypgy67

nscl-pytorch-release's Issues

Error coming from Jacinle

I've followed all the instructions in the README.md. However, when I get to running the command jac-crun <gpu_id> scripts/trainval.py --desc experiments/clevr/desc_nscl_derender.py --training-target derender --curriculum all --dataset clevr --data-dir <data_dir>/clevr/train --batch-size 32 --epoch 100 --validation-interval 5 --save-interval 5 --data-split 0.95, I get this error

(nscl) crytting@hatch:~$ jac-crun GPU-0ff7a712-8a42-af6d-7f21-17147fda6a7c --desc experiments/clevr/desc_nscl_derender.py --training-target derender --curriculum all --dataset clevr --data-dir ./clevr/train --batch-size 32 --epoch 100 --validation-interval 5 --save-interval 5 --data-split 0.95
11 09:56:09 Loading jacinle config: /home/crytting/Jacinle/jacinle.yml.
11 09:56:09 Loading vendor: ReservoirSample-PyTorch.
11 09:56:09 Loading vendor: AdvancedIndexing-PyTorch.
11 09:56:09 Loading vendor: SynchronizedBatchNorm-PyTorch.
11 09:56:09 Loading vendor: PreciseRoIPooling-PyTorch.
11 09:56:09 Loading vendor: SceneGraphParser.
/home/crytting/Jacinle/bin/jac-run: line 10: exec: --: invalid option
exec: usage: exec [-cl] [-a name] [command [arguments ...]] [redirection ...]

The line 10 that it references is

exec "$@"

from the mentioned file /home/crytting/Jacinle/bin/jac-run. Any ideas on how to fix?

Broken link to the project page and object detections for clevr dataset

Hi,

The links to the project page and object detection files is broken.

json file miss

The clevr/train/scenes.json and clevr/val/scenes.json can not be found.

Stuck at Building dataloader & Testing

When trying to train, it stucks at 'Building data loader' stage.

In addition, when Testing, stops at 34% of validation

The code may be incomplete?

I run your code, while the following message doesn't appear.
[07 16:30:54 [email protected]:/data/vision/billf/scratch/jiayuanm/projects/NSCL-PyTorch/nscl/datasets/factory.py] Filtering out questions containing "how big" and "made of", #before = 699989, #after = 633615.

Replicating experiments on VQS

Hi!

Just wanted to make a request. Would it be possible for you to add instructions in the README.md for how one can replicate results on the VQS dataset?

Thanks!

The links of project and data is broken

The links to the project page and object detection files is broken.

Run Time Error

Would it be possible to access the pretrained model weights for vqs?

_use_shared_memory - PyTorch 1.1

The following error appears with PyTorch 1.1:

File "Jacinle/jactorch/data/collate.py", line 149, in _stack
if torchdl._use_shared_memory:
AttributeError: module 'torch.utils.data.dataloader' has no attribute '_use_shared_memory'

dataloader in 1.1 does not have an attribute '_use_shared_memory', see here.

How is vocal.json being used?

Suppose the model predicts a synonym for metal, e.g. "shiny", how does the program executor map shiny to what it is really supposed to mean, i.e. metal?

Training stuck at Epoch 15

Hello,
I can train the model since the process kills itself after this message:

Building the data loader. Curriculum = 3/8, length = 32218.
Epoch 15 acc/qa=1.000000 loss=0.046158 loss/qa=0.046158 time/data=0.008719 time/step=1.016501: 100%|##############################| 1006/1006 [18:08<00:00, 1.08s/it]
Epoch 15 (validation) validation/acc/qa=1.000000: 2%|#4 | 20/1094 [00:41<11:04, 1.62it/s]/home/colors/Desktop/nscl/Jacinle/bin/jac-crun: line 6: 3305 Killed $JACROOT/bin/jac-run "$@"

Does not support MultiGPU training

Hi, an AssertionError

assert len(progs) == len(batch_features)

appears when

args.gpu_parallel=True

did you meet this error?

Semantic parser training code

Hi!

We are currently doing a research on your VQA papers, and would really like to have the source code for zero annotation parser training, as it was one of the benefits of the NSCL paper.

Note: This current release contains only training codes for the visual modules. That is, currently we still assume that a semantic parser is pre-trained using program annotations. In the full NS-CL, this pre-training is not required. We also plan to release the full training code soon.

Is there any possibility code will be released soon?

CUDA Error

Hi,

I had been trying to run your code to train a model for the CLEVR dataset but I'm running into an issue. The traceback is below:

...
09 15:59:43 Building the model.
09 16:01:08 Writing meter logs to file: "dumps/clevr/desc_nscl_derender/derender-curriculum_all-qtrans_off/meta/run-2019-07-09-15-58-56.meter.json".
09 16:01:08 Building the data loader.
09 16:05:22 Building the data loader. Curriculum = 3/4, length = 1930.
  0%|                                                                                                                                  | 0/60 [00:00<?, ?it/s]Using /tmp/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/torch_extensions/_prroi_pooling/build.ninja...
Building extension module _prroi_pooling...
ninja: no work to do.
Loading extension module _prroi_pooling...
cudaCheckError() failed : CUDA driver version is insufficient for CUDA runtime version

This was the command that I ran:
jac-crun 1 scripts/trainval.py --desc experiments/clevr/desc_nscl_derender.py --training-target derender --curriculum all --dataset clevr --data-dir clevr/train --batch-size 32 --epoch 100 --validation-interval 5 --save-interval 5 --data-split 0.95

I checked that my driver version (410.78) is in fact compatible with the CUDA version (10.0). Moreover, I am able to run other code that relies on PyTorch and uses the GPU. Am I missing something here?

Would appreciate any help, thanks!

semantic parser training codes

This current release contains only training codes for the visual modules. That is, currently we still assume that a semantic parser is pre-trained using program annotations. In the full NS-CL, this pre-training is not required. We also plan to release the full training code soon.

Just a reminder, will there be any updates about the training codes of the semantic parser? Thanks!

Help using the code

Hello, any chance you could release a one-or-two self contained file with the trained mask-rcnn model, I'd like to use to perform some experiments.

Thanks in advance

Where can I find the images in questions.json for validation?

For the validation, questions.json file is provided. It contains 15,000 unique images with filenames ranging from CLEVR_val_000000.png to CLEVR_val_014999.png. However, the original CLEVR dataset has two validation splits, split A has filenames going from CLEVR_ValA_00000.png to CLEVR_ValA_014999.png, whereas split B has filenames going from CLEVR_ValB_000000.png to CLEVR_ValB_000000.png. The images in each split are different, so which filenames is questions.json referring to?

confusion on dataset names

Hello,
Can anyone explain me what is the difference between 'scenes-raw.json' and 'scenes.json' files? The scenes.json files with direct link have the exact same format data as in clevr dataset scenes files.

Module for training semantic parser

Hi,
you mention that the code for training full semantic parser will be released later. It will be helpful if instructions for training semantic parser can be provided or full code can be released. I am trying to replicate the experiments for CLEVR and VQS datasets

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3, 256, 7, 7]], which is output 0 of CudnnConvolutionBackward, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

RuntimeError: one of the variables needed for gradient computation has been modified by an in-place operation: [torch.cuda.FloatTensor [3, 256, 7, 7]], which is output 0 of CudnnConvolutionBackward, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
while running the training code using the command provided in the readme. Further research showed that I could debug this error using the PyTorch anomaly detector. This showed that the error occurred when calling a forward function.
Full traceback:
/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:57: UserWarning: Traceback of forward call that caused the error:
File "scripts/trainval.py", line 401, in
main()
File "scripts/trainval.py", line 173, in main
main_train(train_dataset, validation_dataset, extra_dataset)
File "scripts/trainval.py", line 291, in main_train
train_epoch(epoch, trainer, train_dataloader, meters)
File "scripts/trainval.py", line 346, in train_epoch
loss, monitors, output_dict, extra_info = trainer.step(feed_dict, cast_tensor=False)
File "/home/user/Jacinle/jactorch/train/env.py", line 135, in step
loss, monitors, output_dict = self._model(feed_dict)
File "/home/user/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "experiments/clevr/desc_nscl_derender.py", line 40, in forward
f_sng = self.scene_graph(f_scene, feed_dict.objects, feed_dict.objects_length)
File "/home/weichen/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/data2/PycharmProjects/NSCL-PyTorch-Release/nscl/nn/scene_graph/scene_graph.py", line 125, in forward
this_object_features[sub_id], this_object_features[obj_id],

Traceback (most recent call last):
File "scripts/trainval.py", line 401, in
main()
File "scripts/trainval.py", line 173, in main
main_train(train_dataset, validation_dataset, extra_dataset)
File "scripts/trainval.py", line 291, in main_train
train_epoch(epoch, trainer, train_dataloader, meters)
File "scripts/trainval.py", line 346, in train_epoch
loss, monitors, output_dict, extra_info = trainer.step(feed_dict, cast_tensor=False)
File "/home/user/Jacinle/jactorch/train/env.py", line 155, in step
loss.backward()
File "/home/user/.local/lib/python3.7/site-packages/torch/tensor.py", line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/user/.local/lib/python3.7/site-packages/torch/autograd/init.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3, 256, 7, 7]], which is output 0 of CudnnConvolutionBackward, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

The error occurred in epoch 6. The loss value seemed to be decreasing normally.
Epoch 6 acc/qa=0.687500 loss=0.708516 loss/qa=0.708516 time/data=0.542045 time/step=1.222588: 0%| | 1/469 [00:01<13:45, 1.76s/i
Epoch 6 acc/qa=0.562500 loss=0.713307 loss/qa=0.713307 time/data=0.009401 time/step=1.012083: 0%| | 1/469 [00:02<13:45, 1.76s/i
Epoch 6 acc/qa=0.562500 loss=0.713307 loss/qa=0.713307 time/data=0.009401 time/step=1.012083: 0%| | 2/469 [00:02<11:59, 1.54s/i
Epoch 6 acc/qa=0.812500 loss=0.566920 loss/qa=0.566920 time/data=0.011552 time/step=1.050238: 0%| | 2/469 [00:03<11:59, 1.54s/i
Epoch 6 acc/qa=0.812500 loss=0.566920 loss/qa=0.566920 time/data=0.011552 time/step=1.050238: 1%| | 3/469 [00:03<10:51, 1.40s/it]/

Any ideas,
Thanks

about concept quantization

Hi, in your paper, Pr[object i is Red] is given by shifted and scaled sigmoid function, but there seems no sigmoid function in your code as shown below

NSCL-PyTorch-Release/nscl/nn/reasoning_v1/concept_embedding.py

Line 181 in ef493d5

 logits = ((query_mapped * reference).sum(dim=-1) - 1 + margin) / margin / self._tau 

If so, the range of the 'logits' variable can be a problem when it adds with the 'belong' vector. Could you explain more about this? Thx!