Git Product home page Git Product logo

compare-annotations's Introduction

Compare annotations

This repo contains a script (compare_annotations.py) for quantifying the improvement in an annotation when a genome is reassembled and/or reannotated.

For example, imagine you had an annotated bacterial genome that's a couple of years old. You've now come back to this genome with new versions of the assembler and annotator and made an updated version. This script can tell you how things changed at the gene level. Hopefully they got better!

This script works by doing a global alignment of the genes of one annotation to the other. This means the two genomes must be roughly aligned at the gene level โ€“ i.e. they should start and end at the same places. If your new genome contains structural rearrangements, that will break this script!

A few other things to note:

  • The input annotated genomes must be in GenBank format.
  • This script only looks at 'CDS' features in the genomes, nothing else.

Requirements

This script uses Python 3 and Biopython. If you can run python3 -c "import Bio" without getting an error, you should be good to go!

No installation is required: just clone the repo and run the script:

git clone https://github.com/rrwick/Compare-annotations.git
Compare-annotations/compare_annotations.py --help

2023 update: When I tried this script using Biopython v1.81, it no longer works. But it still works using Biopython v1.78, so you might need to make a conda environment like this to run the script:

conda create -n old_biopython biopython=1.78
conda activate old_biopython

Example usage

Here are two versions of a genome you can try this script on: CP001172.1 and CP001172.2.

Download them in GenBank format and then run the script like this:

compare_annotations.py CP001172.1.gb CP001172.2.gb > results

Output

This script outputs a gene-by-gene analysis.

When two CDSs are identical, you'll see something like this:

Exact match
  old: Anhydro-N-acetylmuramic acid kinase(AnhMurNAc kinase) (14970-16098 -, 1128 bp)
  new: Anhydro-N-acetylmuramic acid kinase (14970-16098 -, 1128 bp)

Or if the CDSs are similar (i.e. the same gene) but not identical, you'll see something like this:

Inexact match (98.54% ID, old seq longer)
  old: tyrosyl-tRNA synthetase (16159-17392 +, 1233 bp)
  new: Tyrosine--tRNA ligase (16177-17392 +, 1215 bp)

And if a CDS is only in one of the two assemblies, you might see stuff like this:

In old but not in new:
  Glutathione S-transferase, C-terminal domain protein (24047-24413 +, 366 bp)

In new but not in old:
  Sodium:neurotransmitter symporter family protein (25765-27244 +, 1479 bp)

Summarising output

Here are a few lines of Bash to generate a summary file:

printf "Features in old assembly: %5s\n" $(grep "Features in old assembly" results | grep -oP "\d+") >> summary
printf "Features in new assembly: %5s\n" $(grep "Features in new assembly" results | grep -oP "\d+") >> summary
printf "\n" >> summary
printf "Exact match:              %5s\n" $(grep -c "Exact match" results) >> summary
printf "\n" >> summary
printf "Inexact match:            %5s\n" $(grep -c "Inexact match" results) >> summary
printf "  same length:            %5s\n" $(grep -c "same length" results) >> summary
printf "  new seq longer:         %5s\n" $(grep -c "new seq longer" results) >> summary
printf "  old seq longer:         %5s\n" $(grep -c "old seq longer" results) >> summary
printf "\n" >> summary
printf "In new but not in old:    %5s\n" $(grep -c "In new but not in old" results) >> summary
printf "In old but not in new:    %5s\n" $(grep -c "In old but not in new" results) >> summary
printf "\n" >> summary
printf "No longer hypothetical:   %5s\n" $(grep -c "no longer hypothetical" results) >> summary
printf "Still hypothetical:       %5s\n" $(grep -c "still hypothetical" results) >> summary
printf "Became hypothetical:      %5s\n" $(grep -c "became hypothetical" results) >> summary

The result of which should look something like this:

Features in old assembly:  3458
Features in new assembly:  3427

Exact match:               2937

Inexact match:              354
  same length:                3
  new seq longer:           286
  old seq longer:            65

In new but not in old:      136
In old but not in new:      167

No longer hypothetical:     239
Still hypothetical:         638
Became hypothetical:         84

License

GNU General Public License, version 3

compare-annotations's People

Contributors

rrwick avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

compare-annotations's Issues

Issues running script

Hi, I get the following error message when running the script:

Traceback (most recent call last):
File "compare_annotations.py", line 163, in
main()
File "compare_annotations.py", line 49, in main
old_record = next(old)
File "/Users/kathy/opt/anaconda3/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 73, in next
return next(self.records)
StopIteration

Any ideas what might be going on?

Thank you!

Indentation off in else statement

else:
sys.exit('\nERROR: Failed to find alignment')

Am I right in thinking this else statement seems "out of place"? i.e. if you make it through the above for loop without hitting a break then this else statement is always triggered. Meaning the code below it is unreachable.

I tried indenting it one level, but I'm not sure that's correct either?

TypeError: cannot unpack non-iterable bool object

Hello,

When I try to run compare-annotations, the program runs fine but at the end I get the following message :

Traceback (most recent call last):
File "/home/user/Compare-annotations/compare_annotations.py", line 163, in
main()

File "/home/user/Compare-annotations/compare_annotations.py", line 82, in main
match, identity, length_diff = compare_features(old_feature, new_feature, old_record, new_record, args.match_identity_threshold)

TypeError: cannot unpack non-iterable bool object

The code from line 82 looks like this :

 match, identity, length_diff = compare_features(old_feature, new_feature, old_record, new_record, args.match_identity_threshold)
            if match:
                for j in range(old_offset):
                    print_in_old_not_new(old_features[old_i + j])
                for j in range(new_offset):
                    print_in_new_not_old(new_features[new_i + j])
                print_match(old_features[old_i + old_offset], new_features[new_i + new_offset], identity, length_diff)
                old_i += old_offset
                new_i += new_offset
                break
        else:
            sys.exit('\nERROR: Failed to find alignment')

        if old_feature is None and new_feature is None:
            break

        old_i += 1
        new_i += 1


def print_match(f1, f2, identity, length_diff):
    print('', flush=True)
    if identity == 1.0:
        print('Exact match')
    else:
        print('Inexact match (' + '%.2f' % (identity * 100.0) + '% ID, ', end='')
        if length_diff == 0:
            print('same length)')
        elif length_diff > 0:
            print('old seq longer)')
        elif length_diff < 0:
            print('new seq longer)')
    print('  old: ', end='')
    print_feature_one_line(f1)
    print('  new: ', end='')
    print_feature_one_line(f2)
    p1 = f1.qualifiers['product'][0].lower()
    p2 = f2.qualifiers['product'][0].lower()
    if 'hypothetical' in p1 and 'hypothetical' in p2:
        print('  still hypothetical')
    if 'hypothetical' in p1 and 'hypothetical' not in p2:
        print('  no longer hypothetical')
    if 'hypothetical' not in p1 and 'hypothetical' in p2:
        print('  became hypothetical')


def print_in_old_not_new(f):
    print('')
    print('In old but not in new:')
    print('  ', end='')
    print_feature_one_line(f)


def print_in_new_not_old(f):
    print('')
    print('In new but not in old:')
    print('  ', end='')
    print_feature_one_line(f)


def print_feature_one_line(f):
    f_str = f.qualifiers['product'][0]
    strand = '+' if f.location.strand == 1 else '-'
    f_str += ' (' + str(f.location.start) + '-' + str(f.location.end) + ' ' + strand + ', '
    f_str += str(f.location.end - f.location.start) + ' bp)'
    print(f_str)


def compare_features(f1, f2, r1, r2, match_identity_threshold):
    if f1 is None or f2 is None:
        return False
    s1 = f1.extract(r1).seq
    s2 = f2.extract(r2).seq
    score = pairwise2.align.globalms(s1, s2, 1, 0, 0, 0, score_only=True)
    identity = score / max(len(s1), len(s2))
    match = identity > match_identity_threshold
    length_diff = len(s1) - len(s2)
    return match, identity, length_diff


if __name__ == '__main__':
    main()

Can you provide answers that could help me debug this issue ?

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.