Git Product home page Git Product logo

comparison-analyzer's Introduction

Hi ๐Ÿ‘‹ My name is Heidi Holappa

Software developer

I work for the Finnish Social Insurance Institution with a focus on wellbeing data and resource and authorization services. The technological landscape I work with is diverse and challenging. The demand to produce secure and reliable quality products is high. I am fortunate to work with a team of motivated and driven specialists working hard to provide as much value to our customers as possible. Some key technologies I work with on daily or weekly basis: Java, Spring Boot, Spring Security, Spring Batch, FHIR HAPI, OpenShift, Atlassian product family.

Background:

  • BSc in Computer science (GPA 4.92).
    • MSc in Computer Science ongoing
    • Focusing on Software development, embedded systems and data structures and algorithms.
  • MSc in Social Services with 10+ years of experience in the NGO field. Work experience:
    • data management, BI-tools and reporting
    • project documention
    • cooperation with ICT-stakeholders
    • international partnerships
  • ๐ŸŒย  I'm based in Loviisa, Finland
  • โœ‰๏ธย  You can contact me at [email protected]
  • ๐Ÿง ย  passionate about
    • software development
    • Frameworks such as .NET, Spring Boot
    • embedded systems and
    • all things RISC-V
  • Geeky facts:
    • Favorite languages: Python, Java, Haskell, Verilog HDL
    • Favorite IDE: IntelliJ IDEA
    • Idols: Alan Turing, Alonzo Church, Edsger Dijkstra, David Patterson and John Hennessy

Skills

Languages:

Java JavaScript Python rlang C#

Frameworks:
Flask Django .NET .NET

Other:

Git Bootstrap React Material UI NodeJS Express MongoDB PostgreSQL MySQL Heroku Photoshop

Socials

GitHub stats

Heidi's GitHub stats

GitHub Streak

comparison-analyzer's People

Contributors

heidi-holappa avatar

Watchers

 avatar

comparison-analyzer's Issues

As a user I can store a reference GTF and a comparison GTF to databases

Initial concept

  • Expand current features to include gffcompare-output-gtf-file and reference-gtf-file. The arguments could be -i or --input for gffcompare-output-gtf and -r or --reference
  • Content from each file is required
  • Content from each file is stored into a separate database with same conditionalities as currently is being done

#1 Familiarilize with gffutils

gffutils provides tools for storing gtf-file content into a relation database (sqlite3). This tool can be used in the creation of the wanted feature.

See: gffutils

Related user story: #1

Create more extensive test-sample-data

Details

The current sample-data consists only of data related to one transcript. It's good to keep the test dataset small, but it should also be more extensive. Create a new dataset with 3-4 transcripts. Some unittests will break. Fix them.

Update

No more need for more extensive sample test. Closing issue.

As a user I can select whether I want stdout or a file to be created

Initial concept

A user can use an argument to specify an output file. The argument is -o or --output. If no argument is given, output will be given as stdout.

Revision

After consideration the optimal solution seemed to be to store stdout to a file as a default. This has been implemented. Closing issue.

Improve unittests for compana.py

Details

  • Initial tests added. An open issue is that the tests are more like integration tests testing the whole application. This might cover underlying issues.
  • Consider mocking called classes and methods to more precisely test that everything is called properly.

Update 3rd of July 2023

  • Code refactored, new unittests written
  • Persisting probel is that testing compana.py covers parts of code that might not have unittests.
  • Reserve time for writing missing unittests after this task is finished. Create new issues as needed.

#58 Write documentation for new code

Details

  • Write documentation explaining actions in alignment parsing and indel identification
  • Write documentation for BAM-processing
  • Write documentation for the pipeline

As a user I a can get an output with offsets for a specified class code

Initial concept

Initial plan is to calculate the offset for exons. The steps for this could be:

  • user can give desired class code as an argument.
  • gffcompare-output-GTF file: access a transcript containing a selected access code and extract the gene_id (to locate the exons in the reference file related to the gene)
  • gffcompare-output-GTF-file: access exons related to that transcript
  • compare the location of matching exons in both files (reference-GTF-file and gffcompare-output-GTF-file) and calculate the offset for each exon as a tuple (s, e), in which s = offset at start location and e = offset at end location

The output should have a row for each transcript with the following items:

  • Transcript identification
  • A list of offsets

An example:

<transcript id>;[ (s_1, e_1), (s_2, e_2), ..., (s_n, e_n)]

The output format and delimiter(s) need to be planned more

As a user I am shown statistic on class codes

Key functionalities

  • by giving an argument -s or --stats user is shown statistics on the given gtf-file. Syntax
compana.py -i <filename> -s

Output

The output should be

Class code           n
<code>              <n>
<code>              <n>

#15 alter the code to do offset comparison as a list.

Initial idea

  • The transcript comparison should be it's own function
  • The function takes two lists as an input
  • The function returns a list with smallest offsets computed for each value on the list.

Definitions

  • offset: an exon can have an offset in the start index and/or in the end index. The offset is a two element tuple with the following values (start_idx - reference.start_idx, end_idx - reference.end_idx)
  • total offset: the absolute distance of offset in start and end: total_offset = |start_offset| + | end_offset |
  • smallest offset: is the smallest total offset

Related user story: #15

As a user I can store gtf-content to a relational database

Initial concept

A user should be able to store data from CLI. The workflow could be:

  • User calls the application with a command line instruction (e.g. python3 <filename> <arguments>
  • As an argument user provides file directory
  • output is a db-file in the directory where the given file is located.

Directory example

The given input-file does not need to be located in the same directory as the utility.

|
โ””โ”€โ”€โ”€ util 
            util.py
|
|
โ””โ”€โ”€โ”€ file location
            <file>.gtf
            <file>-ca.db

The initial idea for the fuctionality:

flowchart LR
    A{Start util} --> B[db exists?]
    B -->|Yes| C{Remove db?}
    C -->|No| D[End]
    C -->|Yes| E[delete db]
    E --> F[Run util]
    B -->|No| F
    F --> D
Loading

#58 write code to infer whether an indel or a mitchmatch happened at a given window

Details

Some initial code exists in the file cigar-parser.py. However, pysam's get_aligned_pairs might prove more effective for this use case. First get familiar with pysam's functionalities and then implement a solution for the given task. See documentation for sprint-6 for more details.

Update 20.6.2022: For now we will leave mismatches out and focus on indels.

Finish first: #84

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.