Comments (4)
From your description I guess there some problems of your environment. Please give me the detailed error information.
from attentionxml.
I retried the training of the level 2, then i get this error :
``RuntimeError Traceback (most recent call last)
in
2 start_time = time.time()
3 model = FastAttentionXML(labels_num, data_cnf, model_cnf, '')
----> 4 model.train(train_x, train_y, valid_x, valid_y, mlb)
5 finish_time = time.time()
6 print('Training Finished')
/home/hmouzoun/patent_classification/AttentionXML/deepxml/tree.py in train(self, train_x, train_y, valid_x, valid_y, mlb)
204 kwargs=self.model_cnf['cluster'])
205 cluster_process.start()
--> 206 self.train_level(self.level - 1, train_x, train_y, valid_x, valid_y)
207 cluster_process.join()
208 cluster_process.close()
/home/hmouzoun/patent_classification/AttentionXML/deepxml/tree.py in train_level(self, level, train_x, train_y, valid_x, valid_y)
87 return train_y, model.predict(train_loader, k=self.top), model.predict(valid_loader, k=self.top)
88 else:
---> 89 train_group_y, train_group, valid_group = self.train_level(level - 1, train_x, train_y, valid_x, valid_y)
90 torch.cuda.empty_cache()
91
/home/hmouzoun/patent_classification/AttentionXML/deepxml/tree.py in train_level(self, level, train_x, train_y, valid_x, valid_y)
147 F'Number of Labels: {labels_num}, '
148 F'Candidates Number: {train_loader.dataset.candidates_num}')
--> 149 model.train(train_loader, valid_loader, **model_cnf['train'][level])
150 model.optimizer = model.state = None
151 logger.info(F'Finish Training Level-{level}')
/home/hmouzoun/patent_classification/AttentionXML/deepxml/models.py in train(self, *args, **kwargs)
170
171 def train(self, *args, **kwargs):
--> 172 super(XMLModel, self).train(*args, **kwargs)
173 self.save_model_to_disk()
174
/home/hmouzoun/patent_classification/AttentionXML/deepxml/models.py in train(self, train_loader, valid_loader, opt_params, nb_epoch, step, k, early, verbose, swa_warmup, **kwargs)
68 #if type(train_x)!=list:
69 # train_x = train_x.cpu()
---> 70 loss = self.train_step(train_x, train_y.cuda())#change train_x to train_x.cuda()
71 if global_step % step == 0:
72 self.swa_step()
/home/hmouzoun/patent_classification/AttentionXML/deepxml/models.py in train_step(self, train_x, train_y)
156 scores = self.network(train_x, candidates=candidates, attn_weights=self.attn_weights)
157 loss = self.loss_fn(scores, train_y)
--> 158 loss.backward()
159 self.clip_gradient()
160 self.optimizer.step(closure=None)
/usr/local/lib/python3.8/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
219 retain_graph=retain_graph,
220 create_graph=create_graph)
--> 221 torch.autograd.backward(self, gradient, retain_graph, create_graph)
222
223 def register_hook(self, hook):
/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
128 retain_graph = create_graph
129
--> 130 Variable.execution_engine.run_backward(
131 tensors, grad_tensors, retain_graph, create_graph,
132 allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED``
I think that's a problem of OOM knowing that i'm training the algo on data set containing 3720000 examples of 300 words in average, only on one GPU.
from attentionxml.
I think it's a problem of your NVIDIA driver or CUDA library, not OOM.
from attentionxml.
Hi @HMM2021 did you resolve this issue?
from attentionxml.
Related Issues (20)
- Troubleshooting with Amazon-670K HOT 2
- where is PLT compression performed? HOT 3
- Regarding max_leaf hyperparameter for Amazon-670K dataset HOT 3
- Transformers as encoder HOT 1
- .npy file HOT 3
- CUDA ERROR HOT 6
- 无法 加载'glove.840B.300d.gensim' model HOT 1
- License for code HOT 1
- 如何在自己的数据集上做cluster HOT 1
- Where can I get the label name from the data link. HOT 1
- AttentionXML in production HOT 5
- 执行命令过程的报错问题 HOT 2
- Preprocess not working HOT 5
- 关于论文中PSP@K的评价指标
- RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED HOT 4
- How to generate `train_v1.txt` for datasets such as Amazon-670k? HOT 5
- AttentionXML on the Amazon-670k HOT 1
- Training Error on Amazon-670k HOT 4
- Time Complexity HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from attentionxml.