heidi-holappa / comparison-analyzer Goto Github PK

Python 98.81% HTML 1.19%

comparison-analyzer's Introduction

Hi 👋 My name is Heidi Holappa

Software developer

I work for the Finnish Social Insurance Institution with a focus on wellbeing data and resource and authorization services. The technological landscape I work with is diverse and challenging. The demand to produce secure and reliable quality products is high. I am fortunate to work with a team of motivated and driven specialists working hard to provide as much value to our customers as possible. Some key technologies I work with on daily or weekly basis: Java, Spring Boot, Spring Security, Spring Batch, FHIR HAPI, OpenShift, Atlassian product family.

Background:

BSc in Computer science (GPA 4.92).
- MSc in Computer Science ongoing
- Focusing on Software development, embedded systems and data structures and algorithms.
MSc in Social Services with 10+ years of experience in the NGO field. Work experience:
- data management, BI-tools and reporting
- project documention
- cooperation with ICT-stakeholders
- international partnerships

🌍 I'm based in Loviisa, Finland
✉️ You can contact me at [email protected]
🧠 passionate about
- software development
- Frameworks such as .NET, Spring Boot
- embedded systems and
- all things RISC-V
Geeky facts:
- Favorite languages: Python, Java, Haskell, Verilog HDL
- Favorite IDE: IntelliJ IDEA
- Idols: Alan Turing, Alonzo Church, Edsger Dijkstra, David Patterson and John Hennessy

Skills

Languages:

Frameworks:

Other:

Socials

GitHub stats

comparison-analyzer's People

Contributors

Watchers

comparison-analyzer's Issues

Create initial DoD for project

#58 Iterate through created BAM-files to see whether an insertion/deletion exists at a given coordinate

#14 Update Instruction manual

As a user I can store a reference GTF and a comparison GTF to databases

Initial concept

Expand current features to include gffcompare-output-gtf-file and reference-gtf-file. The arguments could be -i or --input for gffcompare-output-gtf and -r or --reference
Content from each file is required
Content from each file is stored into a separate database with same conditionalities as currently is being done

#15 Extract transcripts with given class code from gffcompare-gtf database

#50 extract end exon positions for a given offset

#1 Familiarilize with gffutils

gffutils provides tools for storing gtf-file content into a relation database (sqlite3). This tool can be used in the creation of the wanted feature.

See: gffutils

Related user story: #1

#1 Add ArgParser and create argument for giving full path to gtf-file

Related user story: #1

#9 Provide an output from created statistic data

Create more extensive test-sample-data

Details

The current sample-data consists only of data related to one transcript. It's good to keep the test dataset small, but it should also be more extensive. Create a new dataset with 3-4 transcripts. Some unittests will break. Fix them.

Update

No more need for more extensive sample test. Closing issue.

As a user I can find interesting results for multiple offsets at one run

#14 Add argument for specifying reference GTF - file

Task

Add argument -r or --reference
Argument is required

Related user story: #14

#1 import os to parse file and directory-path

Related user story: #1

#58 Get familiar with pysams get_aligned_pairs -method.

#58 Filter BAM-files with reads mapped to a selected transcript

Write unittests to verify that the program works correctly.

Create application structure

As a user I can specify the level of logging details shown and store logging details to a file

Tasks

Add logging library
Refactor logging into categories (info, debug, warning,...)
Add argument for enabling debug logging
Add argument for writing log information to a file (append, no overwrite)

As a user I can select whether I want stdout or a file to be created

Initial concept

A user can use an argument to specify an output file. The argument is -o or --output. If no argument is given, output will be given as stdout.

Revision

After consideration the optimal solution seemed to be to store stdout to a file as a default. This has been implemented. Closing issue.

#50 extract exon's end and start positions for a given offset

#15 Compute offsets for each exon within the transcript based on matching exons in the reference data

#14 Create (if necessary) a database from the specified file

Task

Expand current model to manage creation / uploading of two databases

Related user story: #14

As a user I can verify if at given coordinate a insertion or a deletion exists.

Improve unittests for compana.py

Details

Initial tests added. An open issue is that the tests are more like integration tests testing the whole application. This might cover underlying issues.
Consider mocking called classes and methods to more precisely test that everything is called properly.

Update 3rd of July 2023

Code refactored, new unittests written
Persisting probel is that testing compana.py covers parts of code that might not have unittests.
Reserve time for writing missing unittests after this task is finished. Create new issues as needed.

#15 Extract related transcripts (based on transcript_id) from reference-gtp-database

#15 find out what a strand is data files

Related user story: #15

#50 extract characters in a given distance from reference-FASTA

Related user story: #50

#1 Create sqlite3 database from imported data using gffutils

Functionality

User provides full path to gtf file as an CLI argument.
A database is created by using gffutils
location of database is the path of the gtf file. Database filename is <gtf-filename w/o extension>-ca.db

Related user story: #1

#58 Write documentation for new code

Details

Write documentation explaining actions in alignment parsing and indel identification
Write documentation for BAM-processing
Write documentation for the pipeline

#15 Add an argument for specifying one class code

Task

Add a new argument c or class-code
Argument is optional (at this stage)

Related user story: #15

#9 Write a functionality to return class codes from db

#15 offset statistics file: add strand (first find out what a strand is)

Details

Add strand to offset output information
First find out what strand is)

Related user story: #15

#15 Extract ref_transcript_id from each transcript and related exons

Bugfix: Fix issue with offset for mismatches not mapping correctly

Found a non-common instance where offsets are not mapped correctly in situations where there are mismatching alignments in both reference- and aligned-data.

#15 formulate definitions and formal explanations to show why offset calculation works

Intuitively the offset computation works, but a formal definition and explanation should be formulated. Add the definition and explanation into the documentation. Consider additionally creating a proof.

Related user story: #15

As a user I a can get an output with offsets for a specified class code

Initial concept

Initial plan is to calculate the offset for exons. The steps for this could be:

user can give desired class code as an argument.
gffcompare-output-GTF file: access a transcript containing a selected access code and extract the gene_id (to locate the exons in the reference file related to the gene)
gffcompare-output-GTF-file: access exons related to that transcript
compare the location of matching exons in both files (reference-GTF-file and gffcompare-output-GTF-file) and calculate the offset for each exon as a tuple (s, e), in which s = offset at start location and e = offset at end location

The output should have a row for each transcript with the following items:

Transcript identification
A list of offsets

An example:

<transcript id>;[ (s_1, e_1), (s_2, e_2), ..., (s_n, e_n)]

The output format and delimiter(s) need to be planned more

As a user I am shown statistic on class codes

Key functionalities

by giving an argument -s or --stats user is shown statistics on the given gtf-file. Syntax

compana.py -i <filename> -s

Output

The output should be

Class code           n
<code>              <n>
<code>              <n>

#15 Update instruction manual

Write unittests for code written in sprint 5

Details

Lot's of new code was written in sprint 5, but the quality is still weak. Improve code quality and write unittests for new code.

As a user I can two characters from reference data in a given position.

As a user I am given natural languange guidance if I give incorrect parameters for arguments

#15 alter the code to do offset comparison as a list.

Initial idea

The transcript comparison should be it's own function
The function takes two lists as an input
The function returns a list with smallest offsets computed for each value on the list.

Definitions

offset: an exon can have an offset in the start index and/or in the end index. The offset is a two element tuple with the following values (start_idx - reference.start_idx, end_idx - reference.end_idx)
total offset: the absolute distance of offset in start and end: total_offset = |start_offset| + | end_offset |
smallest offset: is the smallest total offset

Related user story: #15

As a user I can store gtf-content to a relational database

Initial concept

A user should be able to store data from CLI. The workflow could be:

User calls the application with a command line instruction (e.g. python3 <filename> <arguments>
As an argument user provides file directory
output is a db-file in the directory where the given file is located.

Directory example

The given input-file does not need to be located in the same directory as the utility.

|
└─── util 
            util.py
|
|
└─── file location
            <file>.gtf
            <file>-ca.db

The initial idea for the fuctionality:

flowchart LR
    A{Start util} --> B[db exists?]
    B -->|Yes| C{Remove db?}
    C -->|No| D[End]
    C -->|Yes| E[delete db]
    E --> F[Run util]
    B -->|No| F
    F --> D

#50 Import pyfaidx - library

pyfaidx included in requirements.txt.

Related user story: #50

#58 write code to infer whether an indel or a mitchmatch happened at a given window

Details

Some initial code exists in the file cigar-parser.py. However, pysam's get_aligned_pairs might prove more effective for this use case. First get familiar with pysam's functionalities and then implement a solution for the given task. See documentation for sprint-6 for more details.

Update 20.6.2022: For now we will leave mismatches out and focus on indels.

Finish first: #84

#9 Write a functionality to count n of each class code

#50 create an initial data structure for output

Details

Create an initial concept for a data structure and include it in the general information -documentation.

Install and import gffutils
Add gffutils to requirements.txt

Related user story: #1

heidi-holappa / comparison-analyzer Goto Github PK

comparison-analyzer's Introduction

Hi 👋 My name is Heidi Holappa

Software developer

Skills

Socials

GitHub stats

comparison-analyzer's People

Contributors

Watchers

comparison-analyzer's Issues

Initial concept

Details

Update

Task

Tasks

Initial concept

Revision

Task

Details

Functionality

Details

Task

Details

Initial concept

Key functionalities

Output

Details

Initial idea

Definitions

Initial concept

Directory example

Details

Details

Recommend Projects

Recommend Topics

Recommend Org