Git Product home page Git Product logo

incubator's Introduction

mirtop

Project Status: Active – The project has reached a stable, usable state and is being actively developed. biorxiv

Command line tool to annotate with a standard naming miRNAs e isomiRs.

This tool adapt the miRNA GFF3 format agreed on here: https://github.com/miRTop/mirGFF3

Chat

Ask question, ideas Contributors to code

Cite

http://mirtop.github.io

Contributing

Everybody is welcome to contribute, fork the devel branch and start working!

If you are interesting in miRNA or small RNA analysis, you can jump into the incubator issue pages to propose/ask or say hi:

https://github.com/miRTop/incubator/issues

About

Join the team: https://orgmanager.miguelpiedrafita.com/join/15463928

Read more: http://mirtop.github.io

Installation

Bioconda

conda install mirtop -c bioconda

PIP

pip install mirtop

develop version

Thes best solution is to install conda to get an independent enviroment.

wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh

bash Miniconda-latest-Linux-x86_64.sh -b -p ~/mirtop_env

export PATH=$PATH:~/mirtop_env

conda install -c bioconda pysam pybedtools pandas biopython samtools

git clone http://github.com/miRTop/mirtop

cd mirtop

python setup.py develop

Quick start

Read complete commands at: https://mirtop.readthedocs.org

git clone mirtop
cd mirtop/data
mirtop gff --sps hsa --hairpin examples/annotate/hairpin.fa --gtf examples/annotate/hsa.gff3 -o test_out sim_isomir.bam

Output

The mirtop gff generates the GFF3 adapted format to capture miRNA variations. The output is explained here.

Contributors

Citizens

Here we cite any person who has contribute somehow to the project different than through code development and/or bioinformatic concepts.

Gianvito Urgese, Jan Oppelt(CEITEC Masaryk University, Brno, Czech Republic), Thomas Desvignes, Bastian, Kieran O'Neill (BC Cancer), Charles Reid (University of California Davis), Radhika Khetani (Harvard Chan School of Public Health), Shannan Ho Sui (Harvard Chan School of Public Health), Simonas Juzenas(CAU), Rafael Alis (Catholic University of Valencia), Aida Arcas (Instituto de Neurociencias (CSIC-UMH)), Yufei Lin (Harvard University), Victor Barrera(Harvard Chan School of Public Health), Marc Halushka (Johns Hopkins University)

incubator's People

Contributors

gurgese avatar lpantano avatar thomasdesvignes avatar vbarrera avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

incubator's Issues

GFF3::seqID

Hi all again!

cc: @lpantano @gurgese @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC

I will start a issue column type at a time. Let's see if that makes easy to get as least a few people commenting.

The first column is for chromosome ID. That brings the discussion whether we should use genomic position or precursor position. Or allow both if in the header we can get the exact version-database for the precursor.

I am starting to be more incline to use genomic position because that should be the same among databases. With the condition to have the hairpin as a parent feature in the file as well. It will be like:

chr1 mirbase hairpin start end .... id=hairpin1
chr1 mirbase mirna start end ... parent=hairpin1

The only think I am not clear is what we do with the miRNAs that have multiple precursor on the genome. I can only think about adding an attribute like other_parents=hairpin2,hairpin3... and those parents should be in the GFF3 file as well.

Please comment with new ideas, or if you agree, disagree, missing scenario I am missing...

Thanks!

Read name

Hi miRTop guys,

I have a question - what should be the contents of 'Read'? Also, I was reading up on GFF3 documentation here, and it seems that all 'additional' attributes should be lowercase. Does that mean that 'Read' should be lowercase as well?

Best,
Nikola

small RNAseq output format definition

Hi all again,

cc: @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC

After giving some time to think, I realized we could ask slightly different for a solution the naming problem. I will do a separate issue (tomorrow) for miRNA+isomiR naming since we all agree are super related and we can discuss there. Meanwhile, I think we can give a thought on:

Do we need a standard output for small RNAseq data? if yes, which?

Advantages:

  • results comparable at the moment, through different labs, pipelines etc. This will promote sharing.
  • easily way to make benchmarking analysis to determine the best pipeline or find errors
  • In the case to have a centralized DB, like miRGeneDB, is pretty easy to incorporate new information, after QC analysis
  • Promote tools developments for specific analysis, like visualization, de-novo discovery..., QC reports, meta-analysis...like in other fields, RNAseq, variant calling...
  • Promote using different tools that are focused on different parts. Like, analyzing with miRAnalizer but then go to R and perform further analysis.

All that is true if we set a minimum mandatory information that should be in the file format.

Information that should be included:

  • Commands used for the analysis (like in BAM/VCF)
  • Position on the genome/database where the small RNA is
  • Exactly small RNA sequence can be recovered
  • QC of the alignment/reads
  • Type of small RNA: miRNA, tRNA, etc
  • Name of the small RNA, miRNA name, or isomiR, or similar. Accept multiple names.
  • Abundance, raw, normalized ...
  • What database used for the detection
  • If predicted or known
  • Multiple samples in the same file showing information that is specific for each sample if needed
  • Secondary structure of precursor
  • It should have a API/toolkit associated to promote usage
  • PASS/FAIL filter to allow having all data in case of re-analysis (like BAM have un-mapped or VCF false variation)

(Please add other information the format should have if you agree on this)

I will develop more as a comment in the issue, my first idea was a version of VCF files. They are focused on variation, and can be adapted to small RNA variation using the different fields. Beside the INFO column allow us to create any specific field needed. And there are already many tools supporting this file format that can encourage to get people using it.

Please, add any of your ideas and remember you can use the vote system on the issue page to give support, or, just reply. Feel free to modify this issue to add more information.

** Deadline: May21 **

Submit work to BOSC2018

Hi all

I plan to submit abstract for BOSC2018: https://gccbosc2018.sched.com. I will send the draft and incorporate the feedback but probably after submitting because the deadline is on Friday. I need to know if somebody oppose to this. I’ll add all people that have been involved from the beginning in order of contribution. BTW, all of you can present this whenever you want.

I'll use the affiliation here to add you as authors: http://mirtop.github.io. If you are not there, or want to change something let me know.

cc:@lpantano @gurgese @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC @phillipeloher

Road Plan for half of 2018 - Read please

@lpantano @gurgese @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC @sbb25 @phillipeloher

Hi all,

It will be a little long email, but please take 15 min to go over. It will help to decide how we start spreading the word about this.

  • BOSC was great, people got a lot of question and it was accepted with open hands.
  • I got a lot of good ideas to help with the format and get people using it:
    • make a logo, we are having the competition during this week, so we are almost there.
    • create a separated repository with the format only, see here
    • submit to EDAM ontology and FAIRsharing, they are database that keep track of formats and databases, (I submitted to both), we are waiting to be reviewed. @BastianFromm maybe you want to submit mirGeneDB to FAIRSharing.org?
    • publish a very small paper with the format only. This actually, I am in favor to do it. We can publish on F1000. It is open and they allow very short papers. The main idea is to have something out soon so we get people aware, without a paper it is more difficult, and the current work with tewari data is great but it will need time. Can you tell me what do you think?

The deadline for important modifications to the first version of the format is in 1 week (07/08/2018). Just to be sure we spread the word with the very first usable format.

For that I need all of you to go to the definition and open an issue with anything you think it is important to have and we don't have it. Anything, you would need to have if I want to develop something over this format, in term of query, visualization, re-mapping, anything you would need to know.

As final idea, all people recommend to try to present this as in many place as possible, but I cannot do it alone. So even if it is a slide in a talk, just do it. Having a paper will help. But you still can do this at any time. If you go to a conference and have a poster, as well, mention this to people, so we can create an ecosystem of mirna data analysis tools.

In summary:

  • need your thoughts about publishing the format in F1000 (very short paper), leaving the python tool and tewari data for the next publication.
  • need your feedback to make the format useful for everybody.

Thanks!

#18

bioinformatics tools

Hi @ThomasDesvignes , @keilbeck

I started to migrate my code to here. It mainly does annotation from BAM files mapped agains miRBase precursors to some naming rules. (https://github.com/miRTop/mirtop) (for now python)

I will adapt the rules to be the ones we decided, and for now that works.

I created a team for developers, so if you think that somebody would like to help with the code, ask him/her to join and I will add to the team that person.

I will try to get people from miRBase someway. We really need them if we want to have this used by the community.

thanks for all your work!

Start and end on miRNA paralogs

Hi! I have question for the miRTop community.

How would you define the "precursor start/end" in case of reads that can be assigned to paralogs (about ~ 15% of described miRNA have multiple copies with exact mature sequence)?

column4/5: start/end: precursor start/end as indicated by alignment tool

Hi and a Couple of Questions

Hello! I'm a new user of miRTop. I've started using it through the nf-core smrnaseq pipeline, providing computational support for an academic lab studying miRNA. As we've been examining the outputs of miRTop from that pipeline, we've come across a couple of things that I would like to understand better. I'm not sure if this is the best place to pose my questions, so please point me in the right direction if there is somewhere that would be more appropriate.

Here are my questions:

  1. iso_snv vs iso_snv_seed

One of the reads that is being fed into miRTop is TGGTGTTGTCCCCCCGAGTGGC. This is correctly being called a 2GA variant of TAGTGTTGTCCCCCCGAGTGGC (kshv-miR-K12-10a-3p), however it is odd that miRTop is calling this an iso_snv variant instead of an iso_snv_seed, since it is the second position affected. Is there some reason for this that I might be missing?

  1. Output Table Format (mirtop.tsv)

Another read, TGAGGTAGTAGGTTGTATAGTT, is called correctly as hsa-let-7a-5p. But it seems that since there are three different stem-loop records for hsa-let-7a that this sequence aligns equally well to (hsa-let-7a-1, hsa-let-7a-2, hsa-let-7a-3), each of which can be the parent of the mature sequence hsa-let-7a-5p, the output table ends up having three separate records for the read. And each sample seems to randomly get either the correct number of counts or a zero, as shown in the example below. It seems that we can get the correct counts for each read-miRNA combination by aggregating these rows, but I wanted to (1) make sure that this is an appropriate way to handle these rows and (2) understand why the output is this way.

UID Read miRNA Sample 1 Sample 2 Sample 3
iso-22-XKVLRYVPQ TGAGGTAGTAGGTTGTATAGTT hsa-let-7a-5p 172722 0 114082
iso-22-XKVLRYVPQ TGAGGTAGTAGGTTGTATAGTT hsa-let-7a-5p 0 124121 0
iso-22-XKVLRYVPQ TGAGGTAGTAGGTTGTATAGTT hsa-let-7a-5p 0 0 0

Any thoughts, suggestions, or guidance from someone on these issues would be greatly appreciated! Thank you!

GFF3::source | GFF3::type

Hi all again!

cc: @lpantano @gurgese @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC

I propose to use the database used by the tool to put in the second column: source

I propose to use these labels for the type column (3rd):

  • hairpin : this could be the parent
  • annotated: this could be the annotated in the database, this is child from hairpin. I am trying to avoid canonical as we have discussed before. Maybe reference is another idea since it would be similar to the problem we have for SNPs, where reference it was just designated by the first genomes sequenced but doesn't mean is the most abundant. miRNA is another we can use I guess.
  • isomiR/variant: this could be the detected sequence, this is child from previous one
  • other types of miRNAs?

Contribute with more options or any thoughts you have about it! thanks!

What is a miRNA?

  • "Do we need to define what is a miRNA before jump into naming?"

Deadline to participate: May,01.

Suggested answers: a) we should do it before moving forward, b) we can move forward. Please, add your thoughts about what you think will help to define real miRNAs.


"What is a miRNA?" is a big open question! When we think we've nailed it down, new evidence shows us new exceptions or variations. How should we deal with that? What should we do?

Everyone has different vision because different interests in studying miRNAs. Some see miRNAs as evolutionary genetic entities and care a lot about their pure genetic nature and biogenesis, some see them as functional products and don't care that much about how they were made but instead care more about what they do, some others can see them in a totally different way, or as a mix of all other visions.

Comments, suggestions, personal visions and interests, please share it here!

File organization in the Incubator

In order to have the easiest accessibility to the information, I think it would be best that each topic in the incubator would have its own folder in which we could have different files instead of having everything in one file.
A classic folder/topic would contain files with generic names:

  • Topic description (cf my little list in the Incubator README)
  • Discussions
  • Agreement
  • Relevant literature?
    So that people who search for the final info can go straight to the Agreement file

Maybe one file is better, I'm just not very familiar with GitHub yet.

mirgff3 and mirtop paper

Hi all,

@lpantano @gurgese @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC @sbb25 @phillipeloher

We have a first draft for the paper:

https://docs.google.com/document/d/1m3ON4BSuaFIkiMsZT7ZxNFFZG0o7QoOjd_MZ7WMVoW0/edit?usp=sharing

You'll need to ask for permission to access. I'll give you as soon as I get the requests.

The plan is to submit at end fo October to F1000. I'll send a reminder in two weeks again.

To add reference, just add between parenthesis the DOI or PUBMED ID and I'll add it later in the correct format.

There is a specific section to add the description of the tools we mention, it would be great if the main developers do that for their tools.

Thanks!

logo

we need kind of a logo, any help is welcome

incubator/format/definition.md: Multiple variants

Are multiple variants allowed for the same reported isomiR sequence?

For example, isomiR Sequence TGGGGCGGAGCTTCCGGAGGC has the following possible locations just within miRBase22.
MIMAT0015058_2&hsa-miR-3180-3p&offsets|0|-1
MIMAT0018178_1&hsa-miR-3180&offsets|0|+2
MIMAT0015058_1&hsa-miR-3180-3p&offsets|0|-1
MIMAT0015058&hsa-miR-3180-3p&offsets|0|-1

If so, what would the recommendation of the above be? Should folks generating the GFF3 file be encouraged to report all four (including the 0|-1 and 0|+2 offsets) above within Column 9->Attributes->variant?

GFF3::attributes

cc: @lpantano @gurgese @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC

I'd like to discuss the last columns since probably would need more time, and before everybody go in holidays, trips, conferences ... etc :) I'd like to have a chance to get your thoughts.

  • ID: unique ID based on sequence like mintmap has for tRNA: prefix-22-BZBZOS4Y1 (https://github.com/TJU-CMC-Org/MINTmap/tree/master/MINTplates). good way to use it as cross-mapper ID between different naming or future changes.
  • Name: miRNA name used in the database
  • Parent: hairpin precursor name
  • Alias or Dbxref: get names from other databases miRBase or miRgeneDB
  • Expression: raw counts separated by ,
  • Normalized_expression: normalization by the tool if any. Same format than before
  • Filter: PASS or REJECT (this allow to keep all the data and select the one you really want to consider as valid features)
  • Variant: string character similar to CIGAR to show the difference with the ref_miRNA
  • Target: to add other genomic positions where the sequences map as well?
  • Seed: just to have the 2-8 nt of the sequence

Any other attribute you normally use or would like to have?

miRNA/isomiR naming

Hi all again,

cc: @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC

Finally discussing naming proposals. Feel free to adapt the issue to ask more questions or clarify them.

Deadline: May, 16.

Naming goals

In order to set up a functional and coherent annotation system, I think that we should first ask ourselves: what should a microRNA annotation nomenclature accomplish? (mlhack)

Say if you agree or not:

  1. Correspondence between mature and pre-microRNA names: Perfect correspondence between mature and hairpin names, i.e. if I have the name of the hairpin --> I can know the name of the mature sequences and vice-versa (this is not possible right now with miRBase because if multiple copies exist of a microRNA, no coherent nomenclature rules exist in miRBase)
  2. Definition of the canonical sequence: should define and name the canonical sequence & point out if it is a constitutive canonical (same sequence in all known tissues) or regulated canonical (depends on the tissue)
  3. Guide and passenger strand: If a clear distinction between guide and passenger strand can be made at a functional level, this must be reflected in the naming (with the good old ‘*’ for example)
  4. Evolutionary information and family naming (I): The naming should include information about paralogues and homologues (like in miRGeneDB): to achieve this, a (evolutionary family) seed definition is needed.
  5. Evolutionary information and family naming (II): If the seed region changes --> the function of the microRNA changes: should the microRNAs that are homologous but having different functions (regulate different genes because they have different seeds) receive the same name?

Other technical issues that could be improved:

Say if you agree or not:

A)Length of the hairpin sequences
Right now, in miRBase each hairpin sequence is pre-microRNA + X nt flanking sequences: X can be anything and is not defined by miRBase. This number needs to be fixed (5 nt, 10 nt, 15 nt – what ever).

miRNA

miRNA defined in miRBase: XXXXXXXXXXXXXXXX hsa-miR-X-5p

How to name if:

  1. The precursor of the miRNA, and the 3p miRNA
  2. Another copy of the precursor (identical)
  3. Another copy of the precursor is not identical, only the mature miRs-5/3p
  4. Another copy of the precursor is not identical, only miR-5p
  5. Another copy of the precursor is not identical, and miR-5p is on the 3p arm this time

In case of novel ‘mirna’ detected by sequencing, how to name if:

i) NGS + structure (‘looks like’ a microRNA due to 5’ homogeneity and Drohsa/Dicer pattern),
ii) phylogenetic footprinting (miRNA is conserved),
iii) Ago loading
iv) Impact of Drosha/Dicer knockout and
v) positive functional assay.

isomiR

Say if you agree or not, or offer other option:

  1. Should be annotated all isomiRs?
  2. Should be use labels for each type of variants?
  3. Other ideas?

Imagine the previous miRNA, how to annotate:

  1. InDels XXXXXX-XXXXXXXX
  2. template addition at 5’: xxXXXXXXXXXXXX
  3. template addition at 3': XXXXXXXXXXXXxx
  4. non-template addition at 3’: XXXXXXXXXXXXYY
  5. nuncleotide substitution: XXXXXXYXXXXX

Star or not star?

Here I'd like to open the discussion on the use of the "star" symbol.

Originally, when we thought that only one arm of the hairpin was functional, the star symbol () was used to convey that this strand was a non-functional by-product of the functional miRNA biogenesis. But the use of the start strand denomination '' is now not approved by any nomenclature consortium, including miRBase since April 2011 (Release 18), because many miRNA genes were then showed to produce mature miRNAs from both sides of the hairpin and because of the risk of fluctuation of expression levels as this denomination relies a lot on sequencing depth and the nature of the studied tissue/stage/etc. But in some cases of extreme differences in levels of expression, this additional symbol can convey potentially useful functional information.

So, should we simply follow the gene nomenclature consortia and not support this symbol? Or try to find an agreement and define rules for using this symbol to make this symbol consistent and trustworthy?

For example, at what level of arm selection can we say that a strand is likely only a by-product? Fromm et al (2015) propose a one-fold change. To me this appears not strong enough of a difference to call the second strand star strand, given the non-representation of the complete expressed miRNAome of an organism and the sequence bias known in miRNA-Seq library preparation. For instance I would personally be more confident in a 10-fold change and a good representation of tissue types in the organism considered to call one strand the star strand.

If you have any comments or propositions to try to clarify this situation, please participate!

2019 mirtop 1-day meeting

Hi all,

We were discussing to organize a one day or half day event during a conference people may go this year.

To try to maximize the number of people, could you mention the conference you are going or you would go if we do this.

We are proposing these ones, please reply with more options:

https://www.embo-embl-symposia.org/symposia/2019/EES19-10/ Heidelberg/October
http://www.ashg.org/2019meeting/ Houston/October
http://www.keystonesymposia.org/19D7 Korea/April

I will give a couple of days and then send a doodle to see which place gets more votes!

Thanks!

CC:
@miRTop/everybody
@lpantano @gurgese @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC @sbb25 @phillipeloher @carriewright11 @arunhpatil

isomir naming

Here, people can ask, discuss about the isomir naming document it is in this repository:

https://github.com/miRTop/incubator/blob/master/isomirs/isomir_naming.md

It will end with the first version of the isomir naming and file format for bioinformatics tools and mentioning in papers.

If you want to join and provide your ideas on idomiR naming and things to pay attention to, please just comment in this issue.

Re-analysis of Extracellular RNA Communication Consortium data

Dear miRTop colleague,

We are inviting you all to participate in an isomiR research project. We have been talking for some time about isomiR naming conventions. One of the potential challenges with isomiRs is that technical factors such as library preparation methods may skew the isomiR families. Marc is publishing on this in an upcoming Genome Research paper. We think it may be valuable to determine the extent to which biology or technical reasons impact on isomiRs of miRNAs.

To that end we have identified a dataset that we feel would be useful to solve this problem. Dr. Muneesh Tewari has graciously agreed to share data from the recent Extracellular RNA Communication Consortium (ERCC) study comparing RNA-seq of miRNAs from 9 different laboratories. The preprint is here and we encourage you to read it: http://www.biorxiv.org/content/biorxiv/early/2017/05/17/113050.full.pdf. The datasets are here, but are not yet released to the public: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=whipakmajrwprcv&acc=GSE94586. The group used a pooled plasma sample and a set of synthetic RNA libraries with multiple different library preparation methods. We believe we can use the 15 sequencing runs of the same plasma samples to ask some important questions about the etiology of isomiRs.

We can specifically address the extent of different library preparation kits on isomiR formation. We can also analyze the raw .fastq files with different miRNA alignment software to determine the role different methodologies have on isomiR calls. You likely have more ideas of what we can do with the data.

We propose to have a conference call to discuss the project for those of you who are interested. We can do some brain-storming, discuss the objectives, the methods, and settle on a division of labor based on everyone’s areas of expertise. We would like to publish a manuscript on the findings from the authorship of “The miRTop Consortium.” We believe this will give us credibility when we propose new naming conventions.

Please let us know your interest in such an endeavor. Participate in the doodle to agree on a date for the conference call: https://beta.doodle.com/poll/f59aetuganmu7hau.

Sincerely,
Lorena Pantano and Marc Halushka

Looking for developers

Hi @miRTop/everybody

please, if you are interested in participate in the developing part of this project, chime in here.

We'll start organizing this in September, and we'll work during 3 months to have a first version of the API.

cheers

incubator/format/definition.md: Definition clarifications

Not sure if opening an issue here is appropriate or if there is a better place to put these questions. No real issues to report in this case, just some clarification/documentation questions.

header:

  • what is 'source ontology', example?
  • for 'sample names used in attribute' - does this include items such as SRA accessions?
  • 'small RNA GFF version ' - is that the format of this specification? Or is this a different version?
  • Filter tags, really like the concept. Think we should give an example that's not just Pass/Fail. Such as CAUTION(ambiguous sequence, may not be an isomiR)

Attributes:

  • Read Name - is this optional when the attribute->expression flags are used?
  • Parent - is this precursor name the same as Column 1 precursor name? If so, is it redundant?
  • Hits - what is defined as a 'hit'? Is this different than Attributes->Expression?
  • Variant - what happens it multiple variants for the same isomiR? Will open up a different issue on this since it's more involved

Studying the isomiR accurcy with NGS technology

@lpantano @gurgese @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC @sbb25 @phillipeloher @carriewright11 @arunhpatil

Hi all,

I am opening a new thread to organize the discussion of the tewari reanalysis.

Since now is the re-analysis of a lot of data, I change the tentative title of the project.

These are the minutes for today: https://github.com/miRTop/incubator/blob/master/tewari/minutes.md#01-17-2019

Next meeting 01-31-2019, 10am EST time, 4pm GMT+1 time. (https://zoom.us/j/553765969)

Thanks!

GFF3::Attributes::Variants

cc: @lpantano @gurgese @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC @sinanugur @Bastami @haebhardt

Let's discuss Variant attribute that will give the information about the type of isomiR. I think the main idea is to get a CIGAR/TAG like string that can be parsed and give the full information of the change.

Some previous discussion are here: https://github.com/miRTop/incubator/blob/master/isomirs/isomir_naming.md

Anybody has more ideas for this? For instance how you would name this isomiR:

ref --AAAAAAAAAAAAAAAAAAAA
iso AAAACAAAAAAAAAAAAAA--TT

This isomiR starts 2 nucleotides before the reference and ends 2 nucleotide before as well. It has TT as nucleotide addition and a NT change at position 5 A->C.

I think there are two general ways to describe this, TAG-wise or CIGAR-like, please propose others if you work with different ones. We can use both and define an attribute for each of them as well, as @gurgese just mentioned in other issue.

TAG-wise or similar (it could be more general as well):

miRNA-5p.AAs.5AC.aa.TTe

CIGAR-like or similar or just use the CIGAR like the BAM file exactly

AAI2M5C19MAADTT

Either way we need to define it exactly. So please, propose one example of what you use or would like to have or you are missing, and I'll try to merge them all and propose the final definition that we can discuss further for minor details.

Cheers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.