facebookresearch / embodiedqa Goto Github PK

Train embodied agents that can answer questions in environments

License: Other

Python 99.21% Shell 0.79%

embodiedqa's Issues

build_graph() in house3d.py takes a lot of memory

I am trying to generate shortest paths from environments without questions. But the build_graph() function takes up more than 64 GB memory and takes more than an hour to finish. Is this situation normal or is there something wrong with the code?

train_vqa (and others) hanging when using multiprocessing

I'm trying to run this model as per the instructions and it keeps hanging, usually during or after optimizer.step() but sometimes in other places as well. I've found that completely removing the multiprocessing and just running train() on it's own get's rid of my problem (I'm using a p3.2xlarge AWS instance, so memory/processing power is not an issue).

I also found this page which appears to address a very similar issue with using the data loader which you are also using in your code, so I am wondering if this could be the root of the problem. I have downloaded and installed and deleted and reinstalled all the repositories and data and everything else numerous times so I am pretty certain the issue is not my fault. Thanks!

train_nav.py goes into Pdb.set_trace()

During training of train_nav.py, pdb.set_trace() is called in data.py. Specifically, in this part. I'm unable to grasp why exactly len function is failing, Do you have a fix for that?

Supply new results on the provided dataset

These would be very nice to have so we don't have to retrain the model ourselves. Would it be possible for you to provide the same results as in the paper but on the uploaded dataset?

Pretraining MultitaskCNN

Hi all,

I can't find code for pretraining the MultitaskCNN. Is it available somewhere?

Thanks!

A question about VQA part

if len(pos_queue) < 5:  
    pos_queue = train_loader.dataset.episode_pos_queue[len(
        pos_queue) - 5:] + pos_queue

In train_eqa.py, when the length of input frames of vqa_model less than 5, it will use episode_pos_queue[len(pos_queue)-5:], is there something "using standard pos" occur?

For example, when randomly put the agent in somewhere far away from target object, it can stop immediately, and get final 5 frames from standard pos_queue, which lead to a high accuracy.

How about replacing it with the following code?

pos_queue = [pos_queue[0].copy() for _ in range(5 - len(pos_queue))] + pos_queue

load_graph() returns empty graph

I used the load_graph() method from House3DUtils to load a previously computed graph (the file path was valid), but self.graph was empty after the following lines:

EmbodiedQA/utils/house3d.py

Lines 272 to 273 in 9113156

 self.graph = Graph() 

 self.graph.load(path)

Not sure if anyone else has had this issue; in any case, the following code worked for me:

import pickle
g = pickle.load(open(path, 'rb'))
from dijkstar import Graph
self.graph = Graph(g)

TypeError: local_create_house() takes 2 positional arguments but 3 were given

After I prepared all the data as mentioned in README.md modified the House3D/tests/config.json file, I run the train_nav.py and it always gives me the error "TypeError: local_create_house() takes 2 positional arguments but 3 were given"

Specifically,

Process Process-2:
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
TypeError: local_create_house() takes 2 positional arguments but 3 were given
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "train_nav.py", line 817, in train
    train_loader = EqaDataLoader(**train_loader_kwargs)
  File "/home/jiayi/EmbodiedQA/training/data.py", line 890, in __init__
    max_actions=max_actions)
  File "/home/jiayi/EmbodiedQA/training/data.py", line 224, in __init__
    self._load_envs(start_idx=0, in_order=True)
  File "/home/jiayi/EmbodiedQA/training/data.py", line 326, in _load_envs
    self.all_houses = pool.starmap(local_create_house, _args)
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 268, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
TypeError: local_create_house() takes 2 positional arguments but 3 were given

Any help?

About how to keep "d_0_10,30,50" the same distance length all the times?

Hi,
I really want to know how to keep "d_0_10,30,50" the same distance length?
Such as d_0_10 = 0.35, d_0_30 = 1.89, d_0_50 = 3.54 all the times.

The version of SUNCG dataset v1, v2 or v2.1

I download the suncg v2.1 dataset, and do "python make_houses.py"
We only get 742 house.obj file while the eqa_v1.json reports a total envs of 770.
This cause the error during our training and evalution.
Error Code:
house objects not found! objFile=</EmbodiedQA/data/suncg/house/8675a21d3eb31d8c69e85a945ceeec00/house.obj>

I am sure the house path is correct. Is this problem relevant to the version suncg dataset ?

Clarify eval/train parallel runs in vqa, nav, and eqa main

From what I can tell, in each of train_vqa, train_nav, and train_eqa, a model with shared parameters is fed to a thread running eval() and at least one thread running train() (more if specified via command-line arguments).

However, the train() method runs substantially slower, so for a fixed number of epochs (say, the default of 1000), on my machine eval() reaches the epoch cap when train() is only on around 160. After that, the train() thread(s) keep running but there's no eval() thread to checkpoint them! So I gain nothing by letting them continue to run.

For the models presented in the paper, what was your training paradigm to address this? Do you use the last checkpoint (of highest accuracy) that epoch() makes regardless of how far along the train() threads are?

Out of memory in train_vqa.py

train_vqa.py is running into out of memory issues. I managed to get some runs through but now it consistently goes out of memory. Even removing the multiprocessing does not solve that. Any help?
I have enough memory to run train_nav.py and train_eqa.py so it's not a problem of resources. I'm using AWS p3.2xlarge.

While I greatly appreciate you releasing the code, it would be great if you could add a bit of documentation in the code.

no 'torch.save(checkpoint, checkpoint_path)'

I find there's no 'torch.save(checkpoint, checkpoint_path)' in train() function in train_vqa() and train_eqa().What should I do with this?

For EQA testing, how do you decide where to start the agent?

This can have a dramatic effect on the accuracy of the agent. Do you pick a random start location, and if so, do you record that in the dataset so it can be replicated?

Assertion failure when running vanilla train_nav.py

I'm running into an assertion error when running the train_nav.py

I'm able to train VQA models, but when I run train_nav.py (after setting the -target_obj_conn_map_dir to the appropriate path on my system), the code starts training on the first epoch, reaching about 2% before failing an assertion in training/data.py.

The issued command matches the github example:

python train_nav.py -to_log 1 -model_type pacman -identifier pacman

I sort of assume I have failed to download or move some specific file, but as far as I can tell everything checks out, so I thought it could be a bug from the changes you guys have been making recently.

Let me know what you think; I'd really appreciate it so we can get the system running!

Out of memory in train_vqa.py with 8 num_processes

I have a linux server with 93G physical memory and 64G swap, but it still got the error below when I set num_processes=8:

File "/home/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 73, in _launch 
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Update: I found that the problem is it needs 65G Memory for each process in data loader. And the data in data loader is not shared. The implementation is consistent of multiprocessing. Since A3C need threads to get a better performance, I'm wondering if there is some way to have more num_processes by sharing data in data loader between processes?

X-axes is number of epoch. This is the log I test on validation set every training epoch. My train log shows that the training process of train_vqa.py is not so stable. Is it similar to yours ?

Errors running train_eqa.py

Hi I encountered some problems during the training of train_eqa.py. The mode is set as "train"

Traceback (most recent call last):
File "train_eqa_em.py", line 961, in
train(0, args, shared_nav_model, shared_ans_model)
File "train_eqa_em.py", line 641, in train
action = planner_prob.multinomial().data
TypeError: multinomial() missing 1 required positional arguments: "num_samples"`

The python is 3.7 from anaconda3. The torch version is 0.4.1.post2.
Any suggestions on this error? Thanks!

And sometimes it would show:
python: vendor/csv.h:442: char * io::LineReader::next_line():
Assertion 'data_begin < data_end' failed

So where might be the problem of these errors?
Thanks

facebookresearch / embodiedqa Goto Github PK

embodiedqa's Issues

Recommend Projects

Recommend Topics

Recommend Org