monarch-initiative / monarch-semantic-similarity-profiles Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
make install
that installs all the latest dependencies (.PHONY goal)Document in README.md
how to run the whole pipeline from install to generating everything
Pseudocode
kg.ingestible.kgx->kg.ingestible.ttl (say g2p)
O = ROBOT_MERGE(kg.ingestible.ttl, PHENIO)
runoak semsimian O --relations g2p,p,i
kg.ingestible.kgx is an rdf dump created by Koza on a data modality we are interested in, such as gene to disease.
This ticket can be closed when we have one example implemented:
PHENIO + HPOA or
PHENIO + g2d
Right now we only pull in gene associations from HPOA. This means, we only have HP->Gene associations. However, to facilitate gene-level semantic similarity between HP and MP, we need to have MP->Gene associations and Genetic orthologue relations as well.
@kevinschaper can you help us here wrt.
How would you compare two phenotypic profiles, one MP, one HP, only along their p2g associations?
Add a goal to the repo to gzip all the semantic similarity profiles, see https://github.com/INCATools/ontology-development-kit/blob/master/template/src/ontology/Makefile.jinja2#L905 for inspiration.
When calculating SEMSIM profiles using --information content, the pipeline must ensure the information content file is generated beforehand to avoid errors during OAK command execution. Declare the information content file as a dependency in the semsim profile Makefile goal.
From @caufieldjh:
To get graph embeddings (note this is just with grape - NEAT may be used to automate the process, but this is what runs):
Install grape: pip install grape
from grape.datasets.kghub import KGPhenio
from grape.embedders import FirstOrderLINEEnsmallen
graph = KGPhenio()
embedding = FirstOrderLINEEnsmallen().fit_transform(graph)
By default, embedding will be a pandas df, or if you run `.fit_transform(graph, return_dataframe=False) then it will be a numpy array.
So the final step is to save accordingly, e.g. with embedding.to_csv('embedding.tsv', sep="\t")
phenio
, use the "normal phenio" for now, not the "Monarch version" - note, this may change in the future - the reason for this is that Monarch PHENIO is changed in ways that I dont understand fully):
During the SEMSIM calculation, a warning message is displayed Failed to import custom IC map: Error parsing IC value: invalid float literal
. The SEMSIM file is generated without any errors.
However, this warning showing the impossibility of loading the IC file might be affecting the SEMSIM values.
I verified all values in the information content column to ensure they are numeric, and indeed they are.
id information_content
HP:0000001 4.096300700299448
HP:0000002 13.062826751687464
HP:0000003 18.192109768632434
HP:0000005 12.834557764014347
HP:0000006 18.192109768632434
HP:0000007 18.192109768632434
HP:0000008 11.237913458245558
HP:0000009 13.384754846574829
HP:0000010 16.607147267911277
runoak --stacktrace -vvv -i semsimian:sqlite:data/ontology/phenio-monarch.db similarity -p i \
--set1-file data/tmp/hp_terms.txt \
--set2-file data/tmp/hp_terms.txt \
--min-jaccard-similarity 0.4 \
--information-content-file data/tmp/phenio_monarch_hp_ic.tsv \
-O csv \
-o profiles/phenio-monarch-hp-hp.0.4.semsimian.tsv
Failed to import custom IC map: Error parsing IC value: invalid float literal
oaklib 0.6.14
semsimian 0.2.17
Due to an issue during SEMSIM calculation (monarch-initiative/semsimian#133), we need to rerun the SEMSIM profiles experiments as soon as the bug is fixed.
profiles.yml contains:
ontologies:
- id: upheno2-lattice
- id: upheno1-equivalent
- id: upheno1
semantic_similarity_profiles:
- name: all
method: semsimian
ontology: upheno2-lattice
branches:
subject: UPHENO:0001001
object: UPHENO:0001001
- subset: hp-mp
method: semsimian
ontology: upheno2-lattice
branches:
subject: UPHENO:0001001
object: UPHENO:0001001
prefixes:
subject: HP
object: MP
- subset: hp-mp
method: semsimian
ontology: upheno2-lattice
branches:
subject: UPHENO:0001001
object: UPHENO:0001001
prefixes:
subject: HP
object: MP
- subset: hp-mp
method: cosine
ontology: upheno2-lattice
branches:
subject: UPHENO:0001001
object: UPHENO:0001001
prefixes:
subject: HP
object: MP
Makefile.j2
includes all make goals to:
ontologies
section of the config.Semsimian is oak semsimian and cosine
is neat consine similarity - work with Justin and Harry to set this up.
We need a different suffix to prevent overwriting and enable the coexistence of SEMSIM profiles with and without the --information-content flag during its calculation.
Adding custom.Makefile goals to include these associations in phenio ontology:
In monarch-initiative/semsimian#82 (comment)
@caufieldjh showed us that HP:phenotypic abnormality is very different parents than MP:phenotypic abnormality.
Can we determine why? In particular, why does the HP term have Uberon ancestors?
@caufieldjh I will assign you for now, but feel free to talk to Chris and assign someone else - it is easier for me to work if I can assign while creating the ticket so I am sure its not dropping of the radar.
IC scores should be computed before we run runoak similarity and passed in there using the ic-map parameter.
See monarch-initiative/semsimian#124 (comment) for some context
And make sure its uploaded in the right location
Add new profile to calculate SEMSIM using HP and XPO ontologies.
Add make release
goal to makefile that uploads all semantic similarity profiles as GZIPPED archives to a new versioned semsim profile release.
Inspiration: https://github.com/INCATools/ontology-development-kit/blob/master/template/src/ontology/Makefile.jinja2#L1084
Headers are valid #
-commented yaml files. We use them like:
# ontology: upheno1
# branches:
# - HP:123
# - MP:123
# similarity_measure: jaccard
# similarity_threshold: 0.7
# tool: semsimian
# tool_version: 0.0.1
subject_id object_id diff
....
This is will drastically reduce file sizes. See SSSOM for example implementation.
This is the replacement ticket for obophenotype/upheno-dev#38
We need to iterate over this goal, as it is, as of yet, not clear how to fix this. In the old uPheno,
MP:123 = HP:123.
In uPheno2, MP:123 sub UPHENO:111, HP:123 sub UPHENO:111, so the equivalence axiom is replaced by a common parent. This drastically changes the way (graph-based) semantic similarity algorithms behave.
This is the key ticket: INCATools/ontology-access-kit#634
@souzadevinicius This has a high priority, but right now, I don't know how to advice you on fixing it.
Can you make sure this does not fall under the radar, and mention it to me every time we meet? (add to your board as high priority)
runoak -i semsimian:sqlite:data/ontology/upheno2-lattice.db similarity -p i --set1-file data/tmp/upheno2-lattice_hp_terms.txt --set2-file data/tmp/upheno2-lattice_mp_terms.txt -O csv -o profiles/upheno2-lattice-hp-mp.semsimian.tsv
/usr/local/lib/python3.10/dist-packages/rdflib_jsonld/__init__.py:9: DeprecationWarning: The rdflib-jsonld package has been integrated into rdflib as of rdflib==6.0.0. Please remove rdflib-jsonld from your project's dependencies.
warnings.warn(
FileNotFoundError: [Errno 2] No such file or directory: 'profiles/upheno2-lattice-hp-mp.semsimian.tsv'
make: *** [Makefile:177: profiles/upheno2-lattice-hp-mp.semsimian.tsv] Error 1
Whenever a command is run that creates a file in a directory that does not have a dependency (direct or indirect) in that same direct, add mkdir -p dirname
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.