Hi, I think the training progresses well, and there does'nt seem to

Serialise and save the `DNC` at checkpoints about pytorch-neucom HOT 15 CLOSED

ypxie commented on July 29, 2024

Serialise and save the `DNC` at checkpoints

from pytorch-neucom.

Comments (15)

ypxie commented on July 29, 2024

Hi, Thanks for pointing this out. The check pts error been fixed.
For the nan error, I think it might be the optimization issue. You can try smaller learning rate.

from pytorch-neucom.

ypxie commented on July 29, 2024

The nan error seems roots in the relu of the controller output, which produce all 0 key, results to nan in backward of consine distance. The latest version should be more stable.

from pytorch-neucom.

AjayTalati commented on July 29, 2024

Thanks a lot 👍 - will try it, and give you feedback :)

from pytorch-neucom.

AjayTalati commented on July 29, 2024

Hi, I keep playing around with hyper parameters, but even with the latest version, I still get a lot of runs with NaNs early on? When it works it's cool, bit's a bit tiring restarting again and again?

	Avg. Logistic Loss: 0.6933
Iteration 50/100000
	Avg. Logistic Loss: 0.5672
Iteration 100/100000
	Avg. Logistic Loss: 0.3351
Iteration 150/100000
	Avg. Logistic Loss: 0.2968
Iteration 200/100000
	Avg. Logistic Loss: 0.2923
Iteration 250/100000
	Avg. Logistic Loss: 0.2872
Iteration 300/100000
	Avg. Logistic Loss: 0.2894
Iteration 350/100000
	Avg. Logistic Loss: 0.2839
Iteration 400/100000
	Avg. Logistic Loss: 0.2857
Iteration 450/100000
	Avg. Logistic Loss: nan
Iteration 500/100000
	Avg. Logistic Loss: nan
Iteration 550/100000
	Avg. Logistic Loss: nan
Iteration 600/100000
	Avg. Logistic Loss: nan
Iteration 650/100000
	Avg. Logistic Loss: nan
Iteration 700/100000
	Avg. Logistic Loss: nan
Iteration 750/100000
	Avg. Logistic Loss: nan
Iteration 800/100000
	Avg. Logistic Loss: nan
Iteration 850/100000
	Avg. Logistic Loss: nan
Iteration 900/100000
	Avg. Logistic Loss: nan
Iteration 950/100000
	Avg. Logistic Loss: nan

Have you got any advice on ways to make the training stable/good hyper-parameter settings/weights initialisation? I'll keep playing around with it though.Thanks, Aj

from pytorch-neucom.

ypxie commented on July 29, 2024

That's strange, I tested it several times, and it looks quite stable to me and could converge to 0.01.
Did you update all the files? you can try with a new git clone.
I will also take a closer look.

from pytorch-neucom.

AjayTalati commented on July 29, 2024

Fresh pull from Git, but it's working OK this run!

How many iterations do you need for 0.001 ? It usually converges to 0.01 for me, and then I get Nans

I'm going to try it on better GPU maybe its just this machine - which GPU are using?

from pytorch-neucom.

ypxie commented on July 29, 2024

Glad to know that it is working, :D.
Did you get 0.01 and nan from the elder version? I could get 0.001 with a smaller network config (nhid and mem_size =64 and shorter seq). It usually takes more than 15000 iterations. I am using a laptop Geforece 940m gpu.
Let me know if you still get annoying nan.

from pytorch-neucom.

AjayTalati commented on July 29, 2024

Erm I've got 0.01 from the latest version too, so it looks like its my GPU. What do you get if you use a CPU?

For some reason when I try on my CPU it crashes, and gives the,

AssertionError: leaf variable was used in an inplace operation

and this is with the latest pull and fresh pytorch? To be honest I don't understand why I get this with the CPU, and not when I run it with CUDA - very strange ???

from pytorch-neucom.

ypxie commented on July 29, 2024

I could run it on cpu, and it's much faster than the gpu version. = =
The problem I have is, the running in cpu mode will consume more and more memory gradually, some other people's also reported similar issues with pytorch.

I am using the latest torch.

from pytorch-neucom.

AjayTalati commented on July 29, 2024

How did you manage to run it on CPU ????? I might try reinstalling pytorch without CUDA ???

Yes, I've seen this memory leak too when running multithread A3C !!! I manged to fix it by getting rid of all logging, and any unnecessary things stored in the training loop.

It's a pytorch, (not an algorithm thing), as I never get it in Tensorflow. There's some sort of a memory leak, which you can plug with garbage collection, but even then it still grows for long runs ???

Have you tried MxNet ?

from pytorch-neucom.

AjayTalati commented on July 29, 2024

Well here's my error message with the latest version of the code, and a fresh install of PyTorch,

Iteration 0/100000Traceback (most recent call last):
  File "train.py", line 153, in <module>
    output, _ = ncomputer.forward(input_data)
  File "../../neucom/dnc.py", line 110, in forward
    interface['erase_vector']
  File "../../neucom/memory.py", line 389, in write
    allocation_weight = self.get_allocation_weight(sorted_usage, free_list)
  File "../../neucom/memory.py", line 144, in get_allocation_weight

flat_unordered_allocation_weight.cpu()

  File "/home/ajay/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 636, in scatter_
    return Scatter(dim, True)(self, index, source)
RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.

Have you got any ideas how to fix this?

from pytorch-neucom.

ypxie commented on July 29, 2024

To run on cpu, I just need to set cuda to False, and it could run seamlessly. Could you try it on python 2.7? I can only think of this as the cause.

from pytorch-neucom.

AjayTalati commented on July 29, 2024

I got it to run on the CPU, i.e. not get the leaf error thing, by reversing the commenting out of the lines around 144 in neucom/memory.py, (which have cpu() in them), that allowed it to run using python 3.6. But it did not train, it just converged to about 0.26, which is much worse than GPU ???

So it does work with PyTorch on both CPU and GPU, but it's a lot different to TensorFlow, even though the algorithm and code are very similar ???

I will give it ago with an install of Anaconda 2.7, and PyTorch CPU - spent so much time on this - really want to get it to work !

from pytorch-neucom.

ypxie commented on July 29, 2024

if you changed that line, it should have some issues when you run it in gpu mode.
That function will complain if it's inputs host on gpu at the time I wrote it.

You can just use gpu, cpu mode has some memory issues as well which need to be fixed.

from pytorch-neucom.

AjayTalati commented on July 29, 2024

OK, thanks a lot :)

I won't be able to do any testing today, but please let me know, if and when you fix the CPU capability - that will be really really COOL 👍

from pytorch-neucom.

Serialise and save the `DNC` at checkpoints about pytorch-neucom HOT 15 CLOSED

Comments (15)

Related Issues (7)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent