Git Product home page Git Product logo

itips's Introduction

ITIPs: Identification of transposable element insertion polymorphisms (TIPs)

We developed a pipeline to identify population-scale TIPs based on a pan-genome analysis and large-scale resequencing of accessions in B. rapa. The pipeline for identifying population-scale TIPs employed three sequential steps: identification of insertions and deletions, construction of the TE insertion dataset, and determination of TIPs at a population scale.

Introduction

A pipeline developed to adequately retrieve population-scale TIPs

Step 1:Identification of insertions and deletions in the B. rapa pan-genome.

In this step, we used each of the 20 B. rapa genomes as the reference and identified insertions and deletions in the B. rapa pan-genome using the smartie-sv piepeline.

image

Step 2: Construction of the TE insertion dataset.

After obtaining insertions and deletions, we mapped each insertion or deletion onto the B. rapa TE library. If the similarity and coverage of one deletion (or insertion) were greater than 80% (also called ‘the 80-80 rule’), then the deletion (or insertion) was defined as a TE insertion. Furthermore, we proposed the concepts of ‘aligned regions’ and ‘unaligned regions’ to describe TIPs in the pan-genome. The concepts were based on Chiifu genomic sequences. If genomic sequences from the other 19 accessions could be covered by the Chiifu sequences, we denoted such regions as being ‘aligned regions’; if the genomic sequences in the other genomes could not be covered by the Chiifu sequences, we defined them as ‘unaligned regions’.

image

Step 3: Determination of TIPs at a population scale.

We implemented the strategy by mapping the short reads onto the TE insertion and their flanking sequences. If one or two boundaries for a TE insertion were covered by the short reads, we defined this accession as harboring the same TE insertion. The detailed process included three steps: we first extracted the flanking sequences of each TE insertion (including 1 kb upstream and downstream of the TE insertion); then, the upstream flanking sequence, the TE insertion sequence, and the downstream flanking sequence were linked together in order. After that, we mapped the population-scale resequencing short reads onto our constructed target sequences. If a read in one accession was directly aligned to the upstream and downstream flanking sequences, we considered that there was no TE insertion in this accession; if a read in one accession was directly aligned to the TE insertion sequence and at least one flanking sequence (upstream or downstream flanking sequence), then the accession was considered to harbor the same TE insertion.

image

Installation

The pipeline ITIPs is installation-free but requires dependencies: smartie-sv and bwa (Version: 0.7.17-r1188). The binary file of bwa have been provided in the /ITIPs/scripts/ folder.

git clone https://github.com/caixu0518/ITIPs.git
cd ITIPs
chmod u+x *pl 
cd scripts
chmod u+x *

Inputs

Two types of inputs are required for ITIPs

  1. Genome fasta. i.e. reference genome, query 1 genome, query 2 genome ......
  2. population-scale resequencing reads. i.e. Sam1_1.fq.gz, Sam1_2.fq.gz ......

Outputs

Phase 1: the pipeline will generated reference TE insertion and non-reference TE insertion. i.e. reference_TE.insertions.xls, non-reference_TE.insertions.xls

Chr     Start   End     Type    SVlen   Upstream        Downstream      Gene    CDS
A10     90093   90797   deletion        704     query1;query2   BraA10g000190.3.1C      BraA10g000200.3.1C      -       -
A10     158873  161879  deletion        3006    query1  BraA10g000350.3.1C      BraA10g000340.3.1C      -       -
A10     161994  162256  deletion        262     query1  BraA10g000350.3.1C      -       -       -
A10     248968  253788  deletion        4820    query2  -       BraA10g000500.3.1C      -       -
A10     252712  253372  deletion        660     query1  -       BraA10g000500.3.1C      -       -
A10     253389  254595  deletion        1206    query1  -       BraA10g000500.3.1C      -       -
A10     253794  254442  deletion        648     query2  -       BraA10g000500.3.1C      -       -
A10     325403  326892  deletion        1489    query1  BraA10g000690.3.1C      -       -       -
A10     325405  325631  deletion        226     query2  BraA10g000680.3.1C      -       -       -
A10     325635  326898  deletion        1263    query2  BraA10g000690.3.1C      -       -       -
A10     329657  332373  deletion        2716    query2  -       BraA10g000690.3.1C      -       -

Phase 2: Genotypes of TE insertions in each resequencing genome. i.e. Sam1.refereceTEinsertion and Sam1.Non-refereceTEinsertion.

TEindex AB      BC      AC      L       R       Genotype
Dref100 0       0       0       5       0       GG
Dref103 0       0       0       0       0       NA
Dref104 0       0       0       5       0       GG
Dref113 0       0       0       1       0       GG
Dref116 0       0       0       0       0       NA
Dref12  0       0       0       1       0       GG
Dref122 0       0       0       0       0       NA
Dref123 0       0       0       1       0       GG
Dref125 0       0       0       1       0       GG

CC indicates that the genotype in the corresponding accession was consistent with the reference genome, and GG indicates that the genotype in the accession was different from the reference genome, NA represents missing loci.

Usage

There are three main sequential steps to identify and genotype TE insertions, corresponding to 01.Reference_Nonreference_TEinsertion.pl, 02.get_TE_insertions_and_flankingSequences.pl, and 03.TE_insertions_genotype.pl.

Step 1: Identification of reference and non-reference TE insertions between different genomes.

perl 01.Reference_Nonreference_TEinsertion.pl  -h

Usage: perl 01.Reference_Nonreference_TEinsertion.pl  -query <query.info.lst>  -ref <reference.info.lst>  -TElib <EDTA.TElib.fa>  -bin <the path to smartie-sv>  -script <the path to scripts>

-query	[required] the query id and query genome files. Two columns (queryName queryGenomeFile).
-ref    [required] the reference information. Three columns (referenceName ReferenceGenomeFile ReferenceGff3)
-TElib  [required] the species TE library
-bin   	[required] the path to smartie-sv. i.e. /10t/caix/src/smartie-sv/bin
-script [required] the path to perl scripts

Step 2: extract flanking seuqences of each TE insertion

perl 02.get_TE_insertions_and_flankingSequences.pl  -h

Usage: perl 02.get_TE_insertions_and_flankingSequences.pl  -refGenome <ref.fa>  -refName  <ref>   -script <the path to scripts>  reference_TE.insertions.xls  non-reference_TE.insertions.xls

-refGenome	[required] the reference genome in fasta foramt.
-refName	  [required] the reference genome name, same as provided in the first step.
-script   	[required] the path to perl scripts

Step 3: Genotype TE insertions using short reads

perl 03.TE_insertions_genotype.pl -h

perl 03.TE_insertions_genotype.pl   -Fasta  <ref.referenceTEinsertions_and_flanking1kb.fasta>  -leftRead <Sam1_1.fq.gz>  -rightRead <Sam1_2.fq.gz> -samId <Sam1>  -output <Sam1.refereceTEinsertion>  -script <the path to scripts>  -threads <threads>

-Fasta	[required] the TE insertion and flanking sequences in fasta format
-leftRead	[required] left read
-rightRead	[required] right read
-samId	[required] sample name i.e. Sam1
-output	[required] TE genotype results
-script	[required] the path to scripts
-threads	[optional] threads  default: 6 cores

An example

we recommend that users modified their file format with those we provided in the testData.

cd  testData
sh  runMe.sh
cat runMe.sh

#!/bin/bash
path=`pwd`
script="${path}/../scripts/"

perl ../01.Reference_Nonreference_TEinsertion.pl  -query   query.info.lst  \
                                                  -ref     reference.info.lst  \
                                                  -TElib   EDTA.TElib.fa  \
                                                  -bin     /10t/caix/src/smartie-sv/bin  \
                                                  -script  ${script} 

perl ../02.get_TE_insertions_and_flankingSequences.pl  -refGenome ref.fa  \
                                                       -refName   ref  \
                                                       -script    ${script}

perl ../03.TE_insertions_genotype.pl      -Fasta  ref.Non-referenceTEinsertions_and_flanking1kb.fasta  \
                                          -leftRead Sam1_1.fq.gz  \
                                          -rightRead Sam1_2.fq.gz  \
                                          -samId  Sam1 \
                                          -output Sam1.Non-refereceTEinsertion  \
                                          -script  ${script}  \
                                          -threads 10 

perl ../03.TE_insertions_genotype.pl      -Fasta  ref.referenceTEinsertions_and_flanking1kb.fasta  \
                                          -leftRead Sam1_1.fq.gz  \
                                          -rightRead Sam1_2.fq.gz  \
                                          -samId  Sam1 \
                                          -output Sam1.refereceTEinsertion  \
                                          -script  ${script}  \
                                          -threads 10

Final results:
Phase 1: reference_TE.insertions.xls and non-reference_TE.insertions.xls
Pahse 2: Sam1.refereceTEinsertion and Sam1.Non-refereceTEinsertion

Citations

Xu Cai, Runmao Lin, Jianli Liang, Graham J. King, Jian Wu, Xiaowu Wang. (2022). Transposable element insertion: a hidden major source of domesticated phenotypic variation in Brassica rapa. Plant Biotechnology Journal. https://doi.org/10.1111/pbi.13807

itips's People

Contributors

caixu0518 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

itips's Issues

Makefile issue for smartie-sv

I am unable to install smartie-SV because the system runs into a Makefile error. I will write below the error that is causing this error, I have tried contacting the smartie-SV team but they are not responding. Is there anyway you can help me to get it installed so that I can push forward my analysis.
./Makefile: line 1: .PHONY:: command not found
./Makefile: line 3: all:: command not found
./Makefile: line 5: bin/blasr:: No such file or directory
./Makefile: line 6: MAKE: command not found
./Makefile: line 8: bin/printGaps:: No such file or directory
./Makefile: line 9: cd: src/print_gaps: No such file or directory
./Makefile: line 11: src/htslib/libhts.a:: No such file or directory
./Makefile: line 12: cd: src/htslib: No such file or directory
./Makefile: line 14: bin:: command not found
./Makefile: line 17: clean:: command not found

As it appears, there is an issue with .PHONY

Hello question about insertions and deletions

Hello ! I used your very nice pipeline for my project concerning some plant genomes. Wanted to ask you : I have 5 genomes I run 5 times the pipeline. Every result, for each reference genome there are insertion results and deletion results. The Insertion at the resulting "info" file, has just start coordinates, the deletion has both start and end. I would like to understand, how this works. For instance, I did a strict blast of the insertion sequence to see if I will get unique hits between the reference and the queries ymthat were reported to have it from the results (indeed I verified Thales unique hit). Concerning the deletion now : you give start and end coordinates of the reference for the deletion. For example reference "A" has a deletion sequence D1A , and only one of the queries is reported to have it query "B" (for simplicity). Does this mean that in the aligned region of the "A" and query "B" , "B" has a TE insertion while A does not?

Happy holidays!

About insertions and deletions?

Dear xu
Thank you for designing such a good process!
I currently have 30 mammalian genomes, 100 long-read ONT data corresponding to the species, and a lot of second-generation sequencing data. My goal is to study the polymorphism of TE insertions in this species. Your idea is very good. I will use two strategies to obtain accurate TE results.

  1. Identify insertions and deletions between genomes through genome alignment, and compare with TE lib to find the real TE.
  2. Directly detect TE in ONT data through the currently released third-generation sequencing data TE detection software. Or detect insertions and deletions by aligning to the genome, and align these insertions and deletions to TE lib.
    Finally, combine the results of these two parts. Do you think this strategy will have better accuracy and diversity than the individual strategies? In order to facilitate the use of second-generation sequencing data typing later.
    In addition, I would like to ask whether the choice of TE lib can directly run EDTA, and then merge each result through the 80-80-80 strategy to remove redundancy, and obtain the final usable TE lib.

Sincerely hope to get your reply
Best wishes
yulong

smartie-sv-dependency issue

Hi
I am trying to install smartie-sv in the server but it is causing many errors as follows.

(base) baicai@18:08:35 /home/data1/baicai/AWAIS$ cd smartie-sv-master && make
cd src/blasr && make && cp alignment/bin/blasr ../../bin/blasr && cp alignment/bin/sawriter ../../bin/sawriter
make[1]: Entering directory '/home/data1/baicai/AWAIS/smartie-sv-master/src/blasr'
make[1]: *** No targets specified and no makefile found. Stop.
make[1]: Leaving directory '/home/data1/baicai/AWAIS/smartie-sv-master/src/blasr'
make: *** [Makefile:6: bin/blasr] Error 2

is there anyway around to install it or circumvent this error to install ITIPs on my server?
Please help, looking forward to your response.

Genomic data for NHCC001 and ZYCX renamed as PCC and PCD

Hi
I have been able to retrieve gff files for these genomes that you have used in the experiment but I am unable to retrieve their genomic fna files. Can you please provide me with the link to these genomic files, I need to build insertions and deletions datasets and I want to include these genotypes as well.
thanks in advance

Some question about TIPs indentifacation in population scale

Hi,

Recently i read the paper: "Transposable element insertion: a hidden major source of domesticated phenotypic variation in Brassica rapa", thats amazing works and very interesting for me, because i want apply this method in my project.

I have one genome assembly with TE annotation files and some illumina paired reads of several population, i want to know if i can skip step1 and step2, just manually extract <ref.referenceTEinsertions_and_flanking1kb.fasta> as input of step3 or there have other way to get correct input file.

Anyway, if ITIPs could not compatiable for our data, could you recommend a few softwares that work with my data?

Thanks,
Wenjie

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.