facebookresearch / embodiedqa Goto Github PK
View Code? Open in Web Editor NEWTrain embodied agents that can answer questions in environments
License: Other
Train embodied agents that can answer questions in environments
License: Other
I am trying to generate shortest paths from environments without questions. But the build_graph() function takes up more than 64 GB memory and takes more than an hour to finish. Is this situation normal or is there something wrong with the code?
I'm trying to run this model as per the instructions and it keeps hanging, usually during or after optimizer.step() but sometimes in other places as well. I've found that completely removing the multiprocessing and just running train() on it's own get's rid of my problem (I'm using a p3.2xlarge AWS instance, so memory/processing power is not an issue).
I also found this page which appears to address a very similar issue with using the data loader which you are also using in your code, so I am wondering if this could be the root of the problem. I have downloaded and installed and deleted and reinstalled all the repositories and data and everything else numerous times so I am pretty certain the issue is not my fault. Thanks!
During training of train_nav.py, pdb.set_trace() is called in data.py. Specifically, in this part. I'm unable to grasp why exactly len
function is failing, Do you have a fix for that?
These would be very nice to have so we don't have to retrain the model ourselves. Would it be possible for you to provide the same results as in the paper but on the uploaded dataset?
Hi all,
I can't find code for pretraining the MultitaskCNN. Is it available somewhere?
Thanks!
if len(pos_queue) < 5:
pos_queue = train_loader.dataset.episode_pos_queue[len(
pos_queue) - 5:] + pos_queue
In train_eqa.py, when the length of input frames of vqa_model less than 5, it will use episode_pos_queue[len(pos_queue)-5:], is there something "using standard pos" occur?
For example, when randomly put the agent in somewhere far away from target object, it can stop immediately, and get final 5 frames from standard pos_queue, which lead to a high accuracy.
How about replacing it with the following code?
pos_queue = [pos_queue[0].copy() for _ in range(5 - len(pos_queue))] + pos_queue
I used the load_graph()
method from House3DUtils
to load a previously computed graph (the file path was valid), but self.graph was empty after the following lines:
Lines 272 to 273 in 9113156
Not sure if anyone else has had this issue; in any case, the following code worked for me:
import pickle
g = pickle.load(open(path, 'rb'))
from dijkstar import Graph
self.graph = Graph(g)
After I prepared all the data as mentioned in README.md
modified the House3D/tests/config.json
file, I run the train_nav.py
and it always gives me the error "TypeError: local_create_house() takes 2 positional arguments but 3 were given"
Specifically,
Process Process-2:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.5/multiprocessing/pool.py", line 47, in starmapstar
return list(itertools.starmap(args[0], args[1]))
TypeError: local_create_house() takes 2 positional arguments but 3 were given
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "train_nav.py", line 817, in train
train_loader = EqaDataLoader(**train_loader_kwargs)
File "/home/jiayi/EmbodiedQA/training/data.py", line 890, in __init__
max_actions=max_actions)
File "/home/jiayi/EmbodiedQA/training/data.py", line 224, in __init__
self._load_envs(start_idx=0, in_order=True)
File "/home/jiayi/EmbodiedQA/training/data.py", line 326, in _load_envs
self.all_houses = pool.starmap(local_create_house, _args)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 268, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
TypeError: local_create_house() takes 2 positional arguments but 3 were given
Any help?
Hi,
I really want to know how to keep "d_0_10,30,50" the same distance length?
Such as d_0_10 = 0.35, d_0_30 = 1.89, d_0_50 = 3.54 all the times.
I download the suncg v2.1 dataset, and do "python make_houses.py"
We only get 742 house.obj file while the eqa_v1.json reports a total envs of 770.
This cause the error during our training and evalution.
Error Code:
house objects not found! objFile=</EmbodiedQA/data/suncg/house/8675a21d3eb31d8c69e85a945ceeec00/house.obj>
I am sure the house path is correct. Is this problem relevant to the version suncg dataset ?
From what I can tell, in each of train_vqa, train_nav, and train_eqa, a model with shared parameters is fed to a thread running eval() and at least one thread running train() (more if specified via command-line arguments).
However, the train() method runs substantially slower, so for a fixed number of epochs (say, the default of 1000), on my machine eval() reaches the epoch cap when train() is only on around 160. After that, the train() thread(s) keep running but there's no eval() thread to checkpoint them! So I gain nothing by letting them continue to run.
For the models presented in the paper, what was your training paradigm to address this? Do you use the last checkpoint (of highest accuracy) that epoch() makes regardless of how far along the train() threads are?
train_vqa.py is running into out of memory issues. I managed to get some runs through but now it consistently goes out of memory. Even removing the multiprocessing does not solve that. Any help?
I have enough memory to run train_nav.py and train_eqa.py so it's not a problem of resources. I'm using AWS p3.2xlarge.
While I greatly appreciate you releasing the code, it would be great if you could add a bit of documentation in the code.
I find there's no 'torch.save(checkpoint, checkpoint_path)' in train() function in train_vqa() and train_eqa().What should I do with this?
This can have a dramatic effect on the accuracy of the agent. Do you pick a random start location, and if so, do you record that in the dataset so it can be replicated?
I'm running into an assertion error when running the train_nav.py
I'm able to train VQA models, but when I run train_nav.py (after setting the -target_obj_conn_map_dir to the appropriate path on my system), the code starts training on the first epoch, reaching about 2% before failing an assertion in training/data.py.
The issued command matches the github example:
python train_nav.py -to_log 1 -model_type pacman -identifier pacman
I sort of assume I have failed to download or move some specific file, but as far as I can tell everything checks out, so I thought it could be a bug from the changes you guys have been making recently.
Let me know what you think; I'd really appreciate it so we can get the system running!
I have a linux server with 93G physical memory and 64G swap, but it still got the error below when I set num_processes=8
:
File "/home/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 73, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
Update: I found that the problem is it needs 65G Memory for each process in data loader. And the data in data loader is not shared. The implementation is consistent of multiprocessing. Since A3C need threads to get a better performance, I'm wondering if there is some way to have more num_processes
by sharing data in data loader between processes?
X-axes is number of epoch. This is the log I test on validation set every training epoch. My train log shows that the training process of train_vqa.py is not so stable. Is it similar to yours ?
Hi I encountered some problems during the training of train_eqa.py
. The mode is set as "train"
Traceback (most recent call last):
File "train_eqa_em.py", line 961, in
train(0, args, shared_nav_model, shared_ans_model)
File "train_eqa_em.py", line 641, in train
action = planner_prob.multinomial().data
TypeError: multinomial() missing 1 required positional arguments: "num_samples"`
The python is 3.7 from anaconda3. The torch version is 0.4.1.post2.
Any suggestions on this error? Thanks!
The SUNCG data is unavailable now. Is there another way to get obj + mtl files for the houses in EQA?
Hi I encountered some errors when training the navigator.
Sometimes it would show:
AssertionError: [Environment] House object not found!
But when I cd into that house directory, there is a house.obj file.
The errors is caused by the the House3D/core.py, line 47, in local_create_house
And sometimes it would show:
python: vendor/csv.h:442: char * io::LineReader::next_line():
Assertion 'data_begin < data_end' failed
So where might be the problem of these errors?
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.