Hi, ruochi! My higashi work smoothly until it comes with train for imputation, and the

Higahi stop at train_for_imputation_nbr_0 on both API and CLI. about higashi HOT 21 CLOSED

ma-compbio commented on June 16, 2024

Higahi stop at train_for_imputation_nbr_0 on both API and CLI.

from higashi.

Comments (21)

ruochiz commented on June 16, 2024

Thank you for your interest in Higashi! When using it through the CLI mode, did it just hang like this (stuck at 0% without any error), or it would quit with an error information? If it's the former one, could you help to attach the log when you kill the process (ctrl+c), such that I can try to figure which process is hanging? Thanks!

from higashi.

EddieLv commented on June 16, 2024

This is the error, or do you need the complete log file?

(Training) : 0%| | 0/1000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/biogenger/Biosoftwares/Higashi/higashi/main_cell.py", line 1362, in
train(higashi_model,
File "/home/biogenger/Biosoftwares/Higashi/higashi/main_cell.py", line 653, in train
bce_loss, mse_loss, train_accu, auc1, auc2, str1, str2, train_pool, train_p_list = train_epoch(
File "/home/biogenger/Biosoftwares/Higashi/higashi/main_cell.py", line 124, in train_epoch
for p in as_completed(train_p_list):
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/concurrent/futures/_base.py", line 245, in as_completed
waiter.event.wait(wait_timeout)
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 574, in wait
signaled = self._cond.wait(timeout)
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 312, in wait
waiter.acquire()
KeyboardInterrupt
Exception ignored in: <module 'threading' from '/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py'>
Traceback (most recent call last):
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 1411, in _shutdown
atexit_call()
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/concurrent/futures/process.py", line 95, in _python_exit
t.join()
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 1029, in join
self._wait_for_tstate_lock()
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 1045, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt:
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

from higashi.

EddieLv commented on June 16, 2024

Hi, ruochi. I wonder if the bug is related with the pytorch version? Or I actually did not install higashi through git successfully?

from higashi.

ruochiz commented on June 16, 2024

I don't think it has to do with torch version as 1.11.0 is sth I have tested on. The deadlock seems to be triggered by the multiprocessing part. I will run some test on my end. Meanwhile could you share the config.JSON file you created for this run? Thx.

from higashi.

EddieLv commented on June 16, 2024

{
"config_name": "Cere-24-20220416",
"data_dir": "/media/biogenger/D/Projects/CZP/Cere-24-20220416/7_higashi_input",
"input_format": "higashi_v1",
"structured": "true",
"temp_dir": "/media/biogenger/D/Projects/CZP/Cere-24-20220416/8_higashi_out",
"genome_reference_path": "/media/biogenger/D/Projects/CZP/Cere-24-20220416/GRCm39.chr.sizes.txt",
"cytoband_path": "/media/biogenger/D/Projects/CZP/Cere-24-20220416/GRCm39_cytoband.txt",
"chrom_list": ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19"],
"resolution": 1000000,
"resolution_cell": 1000000,
"local_transfer_range": 1,
"dimensions": 64,
"loss_mode": "zinb",
"rank_thres": 1,
"embedding_epoch": 80,
"no_nbr_epoch": 80,
"with_nbr_epoch": 60,
"embedding_name": "Cere-24-20220416_zinb",
"impute_list": ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19"],
"minimum_distance": 1000000,
"maximum_distance": -1,
"neighbor_num": 5,
"cpu_num": -1,
"gpu_num": 0,
"UMAP_params": {"n_neighbors": 20}
}

from higashi.

EddieLv commented on June 16, 2024

And my python version is 3.9.0. :)

from higashi.

ruochiz commented on June 16, 2024

Hi, I just updated the code base (specifically the main_cell.py file). Could you try to set the cpu_num as 1, run Higashi with the CLI approach (python higashi/main_cell.py -c ../...JSON -s 2)? The -s 2 will make sure the program starts at the training for imputation step. Setting cpu_num = 1 in the JSON file will disable the multiprocessing. Let's see if there will be any error without using multiprocessing. If it hangs again, interrupt it and attach the logs please. Thx.

from higashi.

EddieLv commented on June 16, 2024

It seems work, ruochi.

0%| | 0/24 [00:00<?, ?it/s]
100%|██████████| 24/24 [00:00<00:00, 412554.49it/s]

0%| | 0/24 [00:00<?, ?it/s]
100%|██████████| 24/24 [00:00<00:00, 521571.48it/s]

0%| | 0/24 [00:00<?, ?it/s]
25%|██▌ | 6/24 [00:00<00:00, 57.71it/s]
100%|██████████| 24/24 [00:00<00:00, 123.71it/s]

0%| | 0/24 [00:00<?, ?it/s]
100%|██████████| 24/24 [00:00<00:00, 759.27it/s]

(Training) : 0%| | 0/1000 [00:00<?, ?it/s]
(Training) : 0%| | 1/1000 [00:00<04:23, 3.79it/s]
(Training) BCE: 0.797 MSE: 0.000 Loss: 0.797 norm_ratio: 0.00: 0%| | 2/1000 [00:00<03:47, 4.39it/s]
(Training) BCE: 0.879 MSE: 0.000 Loss: 0.879 norm_ratio: 0.00: 0%| | 3/1000 [00:00<03:49, 4.34it/s]
(Training) BCE: 0.870 MSE: 0.000 Loss: 0.870 norm_ratio: 0.00: 0%| | 4/1000 [00:00<03:52, 4.29it/s]
(Training) BCE: 0.818 MSE: 0.000 Loss: 0.818 norm_ratio: 0.00: 0%| | 5/1000 [00:01<03:39, 4.53it/s]
(Training) BCE: 0.781 MSE: 0.000 Loss: 0.781 norm_ratio: 0.00: 1%| | 6/1000 [00:01<03:46, 4.38it/s]
(Training) BCE: 0.775 MSE: 0.000 Loss: 0.775 norm_ratio: 0.00: 1%| | 7/1000 [00:01<03:50, 4.32it/s]
(Training) BCE: 0.822 MSE: 0.000 Loss: 0.822 norm_ratio: 0.00: 1%| | 8/1000 [00:01<03:42, 4.46it/s]
(Training) BCE: 0.766 MSE: 0.000 Loss: 0.766 norm_ratio: 0.00: 1%| | 9/1000 [00:02<03:46, 4.37it/s]
(Training) BCE: 0.827 MSE: 0.000 Loss: 0.827 norm_ratio: 0.00: 1%| | 10/1000 [00:02<03:41, 4.48it/s]
(Training) BCE: 0.849 MSE: 0.000 Loss: 0.849 norm_ratio: 0.00: 1%| | 11/1000 [00:02<03:47, 4.34it/s]
(Training) BCE: 0.834 MSE: 0.000 Loss: 0.834 norm_ratio: 0.00: 1%| | 12/1000 [00:02<03:56, 4.18it/s]
(Training) BCE: 0.748 MSE: 0.000 Loss: 0.748 norm_ratio: 0.00: 1%|▏ | 13/1000 [00:03<04:08, 3.96it/s]
(Training) BCE: 0.856 MSE: 0.000 Loss: 0.856 norm_ratio: 0.00: 1%|▏ | 14/1000 [00:03<03:58, 4.13it/s]

But what if I wanna use multi cpu?

from higashi.

EddieLv commented on June 16, 2024

And I test it with cpu:-1, the same error occurs.

from higashi.

ruochiz commented on June 16, 2024

That's... unexpected... the cpu=1 is just used to debug... I thought the error would persist. It's just easier to debug without multiprocessing. What if you do cpu:2 or cpu:3? Would that trigger the error?

from higashi.

EddieLv commented on June 16, 2024

Yeap... I tried cpu=2,8, and that trigger the same error, but cpu=1 can work.

from higashi.

ruochiz commented on June 16, 2024

Let me try to run the code on my cpu server and get back to you. If cpu=1 can work then it has nothing to do with the data itself. I have sth that I suspect might be the reason though. Will get back with more details.

from higashi.

EddieLv commented on June 16, 2024

I found that it actually created multi process, but the process seemed sleeping.

from higashi.

EddieLv commented on June 16, 2024

Hi, ruochi. How is the question solved?

from higashi.

ruochiz commented on June 16, 2024

Sorry for the late reply. I was on a trip. I tested it on the cpu machine I have (linux). The multiprocessing seems to be working fine. I am planning to test it on a windows PC. The configuration of the environment takes a while as I never used that PC to run python program before. I will post an update later.

from higashi.

EddieLv commented on June 16, 2024

Hi,ruochi. My computer is linux as well, I wonder if I did not install higashi successfully actually?
Recently I met with some problems more,
1.when I set cpu=1 and run CLI,
the .err file is
0%| | 0/19 [00:00<?, ?it/s]
100%|██████████| 19/19 [00:00<00:00, 520861.28it/s]

0%| | 0/19 [00:00<?, ?it/s]
100%|██████████| 19/19 [00:00<00:00, 664098.13it/s]
Traceback (most recent call last):
File "main_cell.py", line 1328, in
checkpoint = torch.load(save_path+"_stage1", map_location=current_device)
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.7/site-packages/torch/serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.7/site-packages/torch/serialization.py", line 231, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.7/site-packages/torch/serialization.py", line 212, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/media/biogenger/D/Projects/CZP/Cere-24-20220416/8_higashi_out/model/model.chkpt_stage1'
the part of .log file is
layer_norm1.weight True torch.Size([64])
layer_norm1.bias True torch.Size([64])
layer_norm2.weight True torch.Size([64])
layer_norm2.bias True torch.Size([64])
extra_proba.w_stack.0.weight True torch.Size([4, 41])
extra_proba.w_stack.0.bias True torch.Size([4])
extra_proba.w_stack.1.weight True torch.Size([1, 4])
extra_proba.w_stack.1.bias True torch.Size([1])
extra_proba2.w_stack.0.weight True torch.Size([4, 41])
extra_proba2.w_stack.0.bias True torch.Size([4])
extra_proba2.w_stack.1.weight True torch.Size([1, 4])
extra_proba2.w_stack.1.bias True torch.Size([1])
extra_proba3.w_stack.0.weight True torch.Size([4, 41])
extra_proba3.w_stack.0.bias True torch.Size([4])
extra_proba3.w_stack.1.weight True torch.Size([1, 4])
extra_proba3.w_stack.1.bias True torch.Size([1])
attribute_dict_embedding.weight False torch.Size([4826, 20])
params to be trained 738082
initializing data generator
initializing data generator
2.when I set cpu=1 in jupyter notebook

It seems higashi broke out in the imputation? I'm confused about the error of str and int, because the imputation process had already run for a time.

from higashi.

ruochiz commented on June 16, 2024

These two are triggered by different reasons. For the first one, it's caused by that there is not stage 1 model trained for that JSON. If you didn't trained the model before when using CLI mode, you should do python main_cell.py -c xxx -s 1 instead of -s 2

For the second one, the error is triggered by that the cytoband file you provided contains str in the "start" column. Could you attach your cytoband file here for reference? I can push a fix soon to make the code more compatible when encountering str in the "start" column, but it would be helpful to see why would there be a str.

from higashi.

EddieLv commented on June 16, 2024

OK, here is my cytoband file.
GRCm39_cytoband.txt

from higashi.

ruochiz commented on June 16, 2024

Ah. I know, it's because the first line #chrom, chromStart, chromEnd are interpreted as the content not the header. Delete the first line, the code should be fine. The cytoband file I downloaded from UCSD doesn't contain header and that's why I thought it wouldn't have the header by default. I can add some code to make sure the program ignore line that start with #.

from higashi.

EddieLv commented on June 16, 2024

OK~thanks

from higashi.

ruochiz commented on June 16, 2024

I just added some code to support a new parameter in the JSON file. If you set "cpu_num_torch": -1, but "cpu_num":1. The code should still utilizes multiprocessing for pytorch training, but only one cpu process for generating training batches. This is a temporary solution, and is not as optimized as the original version. But since I cannot replicate the error on my end. I would have to guess what triggers the error, which could take a while.

I will close this issue for now. But if I have more updates, I will posted it here.

from higashi.

Higahi stop at train_for_imputation_nbr_0 on both API and CLI. about higashi HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent