Git Product home page Git Product logo

Comments (21)

ruochiz avatar ruochiz commented on June 16, 2024

Thank you for your interest in Higashi! When using it through the CLI mode, did it just hang like this (stuck at 0% without any error), or it would quit with an error information? If it's the former one, could you help to attach the log when you kill the process (ctrl+c), such that I can try to figure which process is hanging? Thanks!

from higashi.

EddieLv avatar EddieLv commented on June 16, 2024

This is the error, or do you need the complete log file?

  • (Training) : 0%| | 0/1000 [00:00<?, ?it/s]Traceback (most recent call last):
    File "/home/biogenger/Biosoftwares/Higashi/higashi/main_cell.py", line 1362, in
    train(higashi_model,
    File "/home/biogenger/Biosoftwares/Higashi/higashi/main_cell.py", line 653, in train
    bce_loss, mse_loss, train_accu, auc1, auc2, str1, str2, train_pool, train_p_list = train_epoch(
    File "/home/biogenger/Biosoftwares/Higashi/higashi/main_cell.py", line 124, in train_epoch
    for p in as_completed(train_p_list):
    File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/concurrent/futures/_base.py", line 245, in as_completed
    waiter.event.wait(wait_timeout)
    File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 574, in wait
    signaled = self._cond.wait(timeout)
    File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 312, in wait
    waiter.acquire()
    KeyboardInterrupt
    Exception ignored in: <module 'threading' from '/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py'>
    Traceback (most recent call last):
    File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 1411, in _shutdown
    atexit_call()
    File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/concurrent/futures/process.py", line 95, in _python_exit
    t.join()
    File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 1029, in join
    self._wait_for_tstate_lock()
    File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/threading.py", line 1045, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
    KeyboardInterrupt:
    Error in atexit._run_exitfuncs:
    Traceback (most recent call last):
    File "/home/biogenger/miniconda3/envs/higashi/lib/python3.9/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
    KeyboardInterrupt

from higashi.

EddieLv avatar EddieLv commented on June 16, 2024

Hi, ruochi. I wonder if the bug is related with the pytorch version? Or I actually did not install higashi through git successfully?

from higashi.

ruochiz avatar ruochiz commented on June 16, 2024

I don't think it has to do with torch version as 1.11.0 is sth I have tested on. The deadlock seems to be triggered by the multiprocessing part. I will run some test on my end. Meanwhile could you share the config.JSON file you created for this run? Thx.

from higashi.

EddieLv avatar EddieLv commented on June 16, 2024

{
"config_name": "Cere-24-20220416",
"data_dir": "/media/biogenger/D/Projects/CZP/Cere-24-20220416/7_higashi_input",
"input_format": "higashi_v1",
"structured": "true",
"temp_dir": "/media/biogenger/D/Projects/CZP/Cere-24-20220416/8_higashi_out",
"genome_reference_path": "/media/biogenger/D/Projects/CZP/Cere-24-20220416/GRCm39.chr.sizes.txt",
"cytoband_path": "/media/biogenger/D/Projects/CZP/Cere-24-20220416/GRCm39_cytoband.txt",
"chrom_list": ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19"],
"resolution": 1000000,
"resolution_cell": 1000000,
"local_transfer_range": 1,
"dimensions": 64,
"loss_mode": "zinb",
"rank_thres": 1,
"embedding_epoch": 80,
"no_nbr_epoch": 80,
"with_nbr_epoch": 60,
"embedding_name": "Cere-24-20220416_zinb",
"impute_list": ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19"],
"minimum_distance": 1000000,
"maximum_distance": -1,
"neighbor_num": 5,
"cpu_num": -1,
"gpu_num": 0,
"UMAP_params": {"n_neighbors": 20}
}

from higashi.

EddieLv avatar EddieLv commented on June 16, 2024

And my python version is 3.9.0. :)

from higashi.

ruochiz avatar ruochiz commented on June 16, 2024

Hi, I just updated the code base (specifically the main_cell.py file). Could you try to set the cpu_num as 1, run Higashi with the CLI approach (python higashi/main_cell.py -c ../...JSON -s 2)? The -s 2 will make sure the program starts at the training for imputation step. Setting cpu_num = 1 in the JSON file will disable the multiprocessing. Let's see if there will be any error without using multiprocessing. If it hangs again, interrupt it and attach the logs please. Thx.

from higashi.

EddieLv avatar EddieLv commented on June 16, 2024

It seems work, ruochi.

0%| | 0/24 [00:00<?, ?it/s]
100%|██████████| 24/24 [00:00<00:00, 412554.49it/s]

0%| | 0/24 [00:00<?, ?it/s]
100%|██████████| 24/24 [00:00<00:00, 521571.48it/s]

0%| | 0/24 [00:00<?, ?it/s]
25%|██▌ | 6/24 [00:00<00:00, 57.71it/s]
100%|██████████| 24/24 [00:00<00:00, 123.71it/s]

0%| | 0/24 [00:00<?, ?it/s]
100%|██████████| 24/24 [00:00<00:00, 759.27it/s]

  • (Training) : 0%| | 0/1000 [00:00<?, ?it/s]
  • (Training) : 0%| | 1/1000 [00:00<04:23, 3.79it/s]
  • (Training) BCE: 0.797 MSE: 0.000 Loss: 0.797 norm_ratio: 0.00: 0%| | 2/1000 [00:00<03:47, 4.39it/s]
  • (Training) BCE: 0.879 MSE: 0.000 Loss: 0.879 norm_ratio: 0.00: 0%| | 3/1000 [00:00<03:49, 4.34it/s]
  • (Training) BCE: 0.870 MSE: 0.000 Loss: 0.870 norm_ratio: 0.00: 0%| | 4/1000 [00:00<03:52, 4.29it/s]
  • (Training) BCE: 0.818 MSE: 0.000 Loss: 0.818 norm_ratio: 0.00: 0%| | 5/1000 [00:01<03:39, 4.53it/s]
  • (Training) BCE: 0.781 MSE: 0.000 Loss: 0.781 norm_ratio: 0.00: 1%| | 6/1000 [00:01<03:46, 4.38it/s]
  • (Training) BCE: 0.775 MSE: 0.000 Loss: 0.775 norm_ratio: 0.00: 1%| | 7/1000 [00:01<03:50, 4.32it/s]
  • (Training) BCE: 0.822 MSE: 0.000 Loss: 0.822 norm_ratio: 0.00: 1%| | 8/1000 [00:01<03:42, 4.46it/s]
  • (Training) BCE: 0.766 MSE: 0.000 Loss: 0.766 norm_ratio: 0.00: 1%| | 9/1000 [00:02<03:46, 4.37it/s]
  • (Training) BCE: 0.827 MSE: 0.000 Loss: 0.827 norm_ratio: 0.00: 1%| | 10/1000 [00:02<03:41, 4.48it/s]
  • (Training) BCE: 0.849 MSE: 0.000 Loss: 0.849 norm_ratio: 0.00: 1%| | 11/1000 [00:02<03:47, 4.34it/s]
  • (Training) BCE: 0.834 MSE: 0.000 Loss: 0.834 norm_ratio: 0.00: 1%| | 12/1000 [00:02<03:56, 4.18it/s]
  • (Training) BCE: 0.748 MSE: 0.000 Loss: 0.748 norm_ratio: 0.00: 1%|▏ | 13/1000 [00:03<04:08, 3.96it/s]
  • (Training) BCE: 0.856 MSE: 0.000 Loss: 0.856 norm_ratio: 0.00: 1%|▏ | 14/1000 [00:03<03:58, 4.13it/s]

But what if I wanna use multi cpu?

from higashi.

EddieLv avatar EddieLv commented on June 16, 2024

And I test it with cpu:-1, the same error occurs.

from higashi.

ruochiz avatar ruochiz commented on June 16, 2024

That's... unexpected... the cpu=1 is just used to debug... I thought the error would persist. It's just easier to debug without multiprocessing. What if you do cpu:2 or cpu:3? Would that trigger the error?

from higashi.

EddieLv avatar EddieLv commented on June 16, 2024

Yeap... I tried cpu=2,8, and that trigger the same error, but cpu=1 can work.

from higashi.

ruochiz avatar ruochiz commented on June 16, 2024

Let me try to run the code on my cpu server and get back to you. If cpu=1 can work then it has nothing to do with the data itself. I have sth that I suspect might be the reason though. Will get back with more details.

from higashi.

EddieLv avatar EddieLv commented on June 16, 2024

I found that it actually created multi process, but the process seemed sleeping.
截图 2022-05-05 10-02-05

from higashi.

EddieLv avatar EddieLv commented on June 16, 2024

Hi, ruochi. How is the question solved?

from higashi.

ruochiz avatar ruochiz commented on June 16, 2024

Sorry for the late reply. I was on a trip. I tested it on the cpu machine I have (linux). The multiprocessing seems to be working fine. I am planning to test it on a windows PC. The configuration of the environment takes a while as I never used that PC to run python program before. I will post an update later.

from higashi.

EddieLv avatar EddieLv commented on June 16, 2024

Hi,ruochi. My computer is linux as well, I wonder if I did not install higashi successfully actually?
Recently I met with some problems more,
1.when I set cpu=1 and run CLI,
the .err file is
0%| | 0/19 [00:00<?, ?it/s]
100%|██████████| 19/19 [00:00<00:00, 520861.28it/s]

0%| | 0/19 [00:00<?, ?it/s]
100%|██████████| 19/19 [00:00<00:00, 664098.13it/s]
Traceback (most recent call last):
File "main_cell.py", line 1328, in
checkpoint = torch.load(save_path+"_stage1", map_location=current_device)
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.7/site-packages/torch/serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.7/site-packages/torch/serialization.py", line 231, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/biogenger/miniconda3/envs/higashi/lib/python3.7/site-packages/torch/serialization.py", line 212, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/media/biogenger/D/Projects/CZP/Cere-24-20220416/8_higashi_out/model/model.chkpt_stage1'
the part of .log file is
layer_norm1.weight True torch.Size([64])
layer_norm1.bias True torch.Size([64])
layer_norm2.weight True torch.Size([64])
layer_norm2.bias True torch.Size([64])
extra_proba.w_stack.0.weight True torch.Size([4, 41])
extra_proba.w_stack.0.bias True torch.Size([4])
extra_proba.w_stack.1.weight True torch.Size([1, 4])
extra_proba.w_stack.1.bias True torch.Size([1])
extra_proba2.w_stack.0.weight True torch.Size([4, 41])
extra_proba2.w_stack.0.bias True torch.Size([4])
extra_proba2.w_stack.1.weight True torch.Size([1, 4])
extra_proba2.w_stack.1.bias True torch.Size([1])
extra_proba3.w_stack.0.weight True torch.Size([4, 41])
extra_proba3.w_stack.0.bias True torch.Size([4])
extra_proba3.w_stack.1.weight True torch.Size([1, 4])
extra_proba3.w_stack.1.bias True torch.Size([1])
attribute_dict_embedding.weight False torch.Size([4826, 20])
params to be trained 738082
initializing data generator
initializing data generator
2.when I set cpu=1 in jupyter notebook
截图 2022-05-12 08-41-42
截图 2022-05-12 08-41-53
It seems higashi broke out in the imputation? I'm confused about the error of str and int, because the imputation process had already run for a time.

from higashi.

ruochiz avatar ruochiz commented on June 16, 2024

These two are triggered by different reasons. For the first one, it's caused by that there is not stage 1 model trained for that JSON. If you didn't trained the model before when using CLI mode, you should do python main_cell.py -c xxx -s 1 instead of -s 2

For the second one, the error is triggered by that the cytoband file you provided contains str in the "start" column. Could you attach your cytoband file here for reference? I can push a fix soon to make the code more compatible when encountering str in the "start" column, but it would be helpful to see why would there be a str.

from higashi.

EddieLv avatar EddieLv commented on June 16, 2024

OK, here is my cytoband file.
GRCm39_cytoband.txt

from higashi.

ruochiz avatar ruochiz commented on June 16, 2024

Ah. I know, it's because the first line #chrom, chromStart, chromEnd are interpreted as the content not the header. Delete the first line, the code should be fine. The cytoband file I downloaded from UCSD doesn't contain header and that's why I thought it wouldn't have the header by default. I can add some code to make sure the program ignore line that start with #.

from higashi.

EddieLv avatar EddieLv commented on June 16, 2024

OK~thanks

from higashi.

ruochiz avatar ruochiz commented on June 16, 2024

I just added some code to support a new parameter in the JSON file. If you set "cpu_num_torch": -1, but "cpu_num":1. The code should still utilizes multiprocessing for pytorch training, but only one cpu process for generating training batches. This is a temporary solution, and is not as optimized as the original version. But since I cannot replicate the error on my end. I would have to guess what triggers the error, which could take a while.

I will close this issue for now. But if I have more updates, I will posted it here.

from higashi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.