Git Product home page Git Product logo

fingerid's People

Contributors

huibinshen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

fingerid's Issues

About gnps data problems

hi,I already have 20,000 GNPS data, and I have used sirius to calculate the fragment tree. If I want to use shen2014_ISMB2014.py, do I need to write an gnps's parser.py about the gnps dataset myself?

multiprocessing in train_test.py

Hi,
this is for fingerid 1.4, just a question, it seems that there is no multi-processing or parallel training happening from the CPU use. I guess for 1000s of molecules it could matter (sofar its decently fast).

image

This is on a 3.1 Ghz CPU workstation, picture above shows only 1 CPU (thread) used from 16.
That was for the standard train_test.py script.

~/fingerid/examples$ time python train_test.py 

real    0m55.436s
user    0m55.352s
sys 0m1.384s

If I change the number of CPUs to 8 it will still be the same time,
trainModels(train_ckm, labels, "MODELS", select_c=False, n_p=8, prob=prob)

real    0m55.337s
user    0m55.452s
sys 0m1.528s

Cheers
Tobias

ImportError: No module named setuptools

Hi,
probably trivial for python experts, but for UBUNTU that would be

$sudo python setup.py install Traceback (most recent call last):
  File "setup.py", line 3, in <module>
    from setuptools import setup, find_packages
ImportError: No module named setuptools

Ubuntu Solution (maybe add to readme):
sudo apt-get update
sudo apt-get install python-pip python-setuptools python-numpy

Tobias

error in trainSVM.py with 16 CPUs

Hi,
I use 16 physical CPUs and fingerid 1.4, but once I change the number of processors to 16
in the file train_test.py

 # MODELS is the folder to store trained models.
    prob= True # Set prob=True if want probability output
    trainModels(train_ckm, labels, "MODELS", select_c=False, n_p=16, prob=prob)

i get the error below.

~/fingerid/examples$ python train_test.py 

Computing combined kernel for ALIGN
train models and make prediction
Create directory MODELS to store the trained models
Traceback (most recent call last):
  File "train_test.py", line 93, in <module>
    trainModels(train_ckm, labels, "MODELS", select_c=False, n_p=16, prob=prob)
  File "../../fingerid/fingerid/model/trainSVM.py", line 76, in trainModels
    args=(x, labels, model_dir, task_dict[i], prob))
KeyError: 10

just to double check if they are correctly detected, there are indeed 16 CPUs visible to python.

>>> import multiprocessing
>>> 
>>> multiprocessing.cpu_count()
16
>>> 

Anything above 10 CPUs will throw the error. This will not happen in shen_ISMB2014.py, there I can set 32 CPUs and it will still run fine. The run time is the same, only 1 CPU in use (as reported before),
Tobias

shen_ISMB2014.py issue with different fgtree_folder

Hi,
running the standard test example in shen_ISMB2014.py is fine, changing the input to "metlin_trees" will run for a while and then create the following error below. So I guess I am missing the correct training cases and structures and or some other parameters. Its not readily visible to me which parameters have to be changed in shen_ISMB2014.py if I use different spectra, or is that something that has to be changed in another script?

    # Set data info, default run on a small 50 compounds dataset.
    # To run the 978 componds dataset, change fgtree_folder to "metlin_trees".
    # The MS/MS used in the paper can be downloaded from METLIN database
    # with the same metlin id used in the filenames of fgtrees.
    # fgtree_folder = "test_data/train_trees/"

    fgtree_folder = "metlin_trees"
    ms_folder = "test_data/train_ms/"
    fingerprints = "test_data/train_output.txt" # output we want to predict

It will run for a while and then just end like this.

~/fingerid/examples$ python shen_ISMB2014.py 
computing kernel PPK
computing kernel NB
computing kernel NI
computing kernel LB
computing kernel LC
computing kernel CPC
computing kernel CP2
computing kernel CPK
computing kernel LI
computing kernel CSC
computing kernel RLB
computing kernel RLI
computing train kernel for 0'th example
computing train kernel for 1'th example
computing train kernel for 2'th example
computing train kernel for 3'th example
computing train kernel for 4'th example
computing train kernel for 5'th example
computing train kernel for 6'th example
computing train kernel for 7'th example
computing train kernel for 8'th example
computing train kernel for 9'th example
computing train kernel for 10'th example
computing train kernel for 11'th example
computing train kernel for 12'th example
computing train kernel for 13'th example
computing train kernel for 14'th example
computing train kernel for 15'th example
computing train kernel for 16'th example
computing train kernel for 17'th example
computing train kernel for 18'th example
computing train kernel for 19'th example
computing train kernel for 20'th example
computing train kernel for 21'th example
computing train kernel for 22'th example
computing train kernel for 23'th example
computing train kernel for 24'th example
computing train kernel for 25'th example
computing train kernel for 26'th example
computing train kernel for 27'th example
computing train kernel for 28'th example
computing train kernel for 29'th example
computing train kernel for 30'th example
computing train kernel for 31'th example
computing train kernel for 32'th example
computing train kernel for 33'th example
computing train kernel for 34'th example
computing train kernel for 35'th example
computing train kernel for 36'th example
computing train kernel for 37'th example
computing train kernel for 38'th example
computing train kernel for 39'th example
computing train kernel for 40'th example
computing train kernel for 41'th example
computing train kernel for 42'th example
computing train kernel for 43'th example
computing train kernel for 44'th example
computing train kernel for 45'th example
computing train kernel for 46'th example
computing train kernel for 47'th example
computing train kernel for 48'th example
computing train kernel for 49'th example
Writing PPK kernel to PPK_kernel.txt
Writing LB kernel to LB_kernel.txt
Writing NB kernel to NB_kernel.txt
Writing LC kernel to LC_kernel.txt
Writing RLB kernel to RLB_kernel.txt
Writing NI kernel to NI_kernel.txt
Writing LI kernel to LI_kernel.txt
Writing RLI kernel to RLI_kernel.txt
Writing CP2 kernel to CP2_kernel.txt
Writing CPC kernel to CPC_kernel.txt
Writing CSC kernel to CSC_kernel.txt
Writing CPK kernel to CPK_kernel.txt

Train SVM for kernel PPK
cv on 0'th fingerprint
cv on 1'th fingerprint
cv on 2'th fingerprint
cv on 3'th fingerprint
cv on 4'th fingerprint
cv on 5'th fingerprint
cv on 6'th fingerprint
cv on 7'th fingerprint
cv on 8'th fingerprint
cv on 9'th fingerprint
Writting prediction in cvpred_PPK.txt
Train SVM for kernel NB
cv on 0'th fingerprint
cv on 1'th fingerprint
cv on 2'th fingerprint
cv on 3'th fingerprint
cv on 4'th fingerprint
cv on 5'th fingerprint
cv on 6'th fingerprint
cv on 7'th fingerprint
cv on 8'th fingerprint
cv on 9'th fingerprint
Traceback (most recent call last):
  File "shen_ISMB2014.py", line 119, in <module>
    trainSVM(km_f, fingerprints, np=16, c_sel=True)
  File "shen_ISMB2014.py", line 91, in trainSVM
    cvpreds = internalCV_mp(train_km, labels, 5, select_c=c_sel, n_p=np, prob=prob)
  File "../../fingerid/fingerid/model/internalCV_mp.py", line 83, in internalCV_mp
    pred_fp[:,fp_ind] = res.pred_fp_
ValueError: could not broadcast input array from shape (978) into shape (50)

Cheers
Tobias

trainSVM scaling with multiple CPUs

Hi,
this is for finger1d 1.4 and shen_ISMB2014.py
Scaling for trainSVM(ckm_f, fingerprints, np=....) (np= number of processors)

1 CPU:
real 0m19.516s
user 0m30.252s
sys 0m2.136s

2 CPUs
real 0m12.498s
user 0m30.048s
sys 0m1.856s

4 CPUs
real 0m9.657s
user 0m29.472s
sys 0m1.800s

8 CPUs
real 0m8.742s
user 0m31.916s
sys 0m1.828s

16 CPUs
real 0m7.935s
user 0m34.060s
sys 0m2.280s

So there is basically for this small set fgtree_folder = "test_data/train_trees/" there is no scaling beyond 4 CPUs. There is some multiprocessing at the first 5-10 seconds of the code. The most time consuming part of the code (again only single CPU at 100% use) is then Writing LI kernel to LI_kernel.txt and Writing RLI kernel to RLI_kernel.txt and the general output.

Cheers
Tobias

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.