vsbuffalo / bds-files Goto Github PK

View Code? Open in Web Editor NEW

591.0 591.0 356.0 297.1 MB

Supplementary files for my book, "Bioinformatics Data Skills"

License: MIT License

R 34.90% Python 59.90% Makefile 5.04% Shell 0.16%

bds-files's People

Contributors

Stargazers

Watchers

Forkers

honglongwu samuell grmunjal peterjc butterflyology blumscientific ofanoyi jpgerke sxfmol thomasyu888 dbsiegel idworkin rotifer stephennfitzgerald crazyhottommy aledj2 hjanime harshinamdar rsbaucom rescalante-lilly fernandosansegundo danibom84 gmcobian raonyguimaraes xianyao710 zacshi jstofel pribolla wyim-pgl zzygyx9119 dpolychr springcute tiramisutes cookiess adeelrahman azzaea lmguzman maddocent lopiniatre zengfengbo hhuang2018 unix0000 jsteward2930 bioinfolover juanpablodonoso longbincheng rhapsod25 annesarver mattgrobelny kylescotshank realityjunky jt14den flopezo genomic-medicine-msc-bioinformatics funfunchen aradenne jixing475 cooleel sgordon007 ut-cooleel statapps sbaram1 gctfh vitkl kumarchinnakali mplearning jackyao92 igorhut kevinwkc sznhhnsz arvin580 mjkrause momenehforoutan dataoracleyating jessie-hou dweson supertigerinwater nsteinau bifxapps kmsandlin minghao2016 syzdemonhunter dobo0020 ssarria saraschwarz rattraya reesea22 huanle bclarke2 ednam dnharry davep adoedens colinries vileu dylansosa 86lightyear ganguvamshi ldimusto data-skills

bds-files's Issues

Minor chapter 2 error

Truly amazing book—thank you for this great service. I think, in chapter 2 p. 29, the examples should read:

$ ls seqs/zmays[AB]_R1.fastq
$ ls seqs/zmays[A-B]_R1.fastq

i.e., there should be seqs/ in front.

Errata and updates for Chapter 8

(Page 181)

You access access R’s built-in documentation

should be

You access R’s built-in documentation

(Page 181)

apropos(norm)

printed "Error: is.character(what) is not TRUE", and thus should be

apropos("norm")

(Page 197)

(or use asis;

should be

(or use as.is;

(Page 202)

by dd$start >= 25800000 and dd$end <= 29700000 and

should be

by d$start >= 25800000 and d$end <= 29700000 and

(Page 205)

because d$percent is a vector,

should be

because d$percent.GC is a vector,

(Page 206)

Thus, d[$Pi > 3, ] is identical to d[which(d$Pi > 3), ];

should be

Thus, d[d$Pi > 3, ] is identical to d[which(d$Pi > 3), ];

(Page 208)

gplot2’s grammar through examples.

should be

ggplot2’s grammar through examples.

Page 218

Finding the Right Bin Width

ggplot(d) + geom_bar(aes(x=Pi), binwidth=1) + scale_x_continuous(limits=c(0.01, 80))

printed the following Warning messages:

## Warning: Ignoring unknown parameters: binwidth
## Warning: Removed 3373 rows containing non-finite values (stat_count).

Page 211

—everything is saturated from about 0.05 and below.

should be

—everything is saturated from about 0.005 and below.

(Page 216)

a factor (e.g., d$binned.GC)

should be

a factor (e.g., d$GC.binned)

(Page 217)

Total.SNPs,

should be

total.SNPs,

(Page 219)

try class(d$repClass) and levels(d$repClass)

should be

try class(reps$repClass) and levels(reps$repClass)

Page 222

The pos column is already present in
https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/motif_recombrates.txt

> mtfs <- read.delim("motif_recombrates.txt", header=TRUE, stringsAsFactors=TRUE)
> head(mtfs, 3)
   chr motif_start motif_end    dist recomb_start recomb_end  recom
1 chrX    35471312  35471325 39323.0     35430651   35433340 0.0015
2 chrX    35471312  35471325 36977.0     35433339   35435344 0.0015
3 chrX    35471312  35471325 34797.5     35435343   35437699 0.0015
          motif           pos
1 CCTCCCTGACCAC chrX-35471312
2 CCTCCCTGACCAC chrX-35471312
3 CCTCCCTGACCAC chrX-35471312

(Page 231)

> set.seed(0) # we set the random number seed so this example is reproducible

should be added before

> ll <- list(a=rnorm(6, mean=1), b=rnorm(6, mean=4), c=rnorm(6, mean=6))

Page 232

> lapply(ll, mean) 
$a
[1] 0.5103648
$b
[1] 0.09681026 
$c
[1] -0.2847329

should be

> lapply(ll, mean)
$a
[1] 1.402273

$b
[1] 4.19003

$c
[1] 5.53541

(Page 244)

*My output was
not the order "position GC.binned diversity" as shown in the book
but the order "diversity position GC.binned" as shown below: *

> select(d_df, -(start:cent))
Source: local data frame [59,140 x 3]

   diversity position   GC.binned
       (dbl)    (dbl)      (fctr)
1  0.0000000  55500.5 (51.6,68.5]

because the order of R commands was as follows:

    setwd("~/bds-files/chapter-08-r")
    d <- read.csv("Dataset_S1.txt")
    colnames(d)[12] <- "percent.GC"
    d$cent <- d$start >= 25800000 & d$end <= 29700000
    d$diversity <- d$Pi / (10*1000) # rescale, removing 10x and making per bp
    d$position <- (d$end + d$start) / 2
    d$GC.binned <- cut(d$percent.GC, 5)

(Page 246)

d_df %>% filter(percent.GC > 40) becomes filter(d_df, percent.GC > 40.

should be

d_df %>% filter(percent.GC > 40) becomes filter(d_df, percent.GC > 40).

(Page 249) The Double Backslash

To actually include a black‐ slash in a string,

should be

To actually include a back‐ slash in a string,

(Page 249)

Unlike grep(), regexpr(pattern, x)

should be

Unlike grep(), regexpr(pattern, text)

Page 251

> sub(" *[chrom]+(\\d+|X|Y|M) *", "chr\\1", c("chr19", "chrY"), perl=TRUE)

was changed to

> sub(" *[chrom]+(\\d+|X|Y|M) *", "chr\\1", c("chr19", "chromY"), perl=TRUE)

in Japanese edition.
https://www.oreilly.co.jp/books/9784873118635/

Page 255

• RStudio’s Rmarkdown tutorial
http://bit.ly/rstudio-rmkdwn

The page doesn't exist. the page may have moved

https://rmarkdown.rstudio.com/

(Page 256)

https://stat.ethz.ch/R-manual/R-devel/library/base/html/args.html
Because there is the function args(), variable name of args should be avoided.

## args.R -- a simple script to show command line args
args <- commandArgs(TRUE) print(args)

Page 257

$ ls -l hotspots
[vinceb]% ls -l hotspots

should be

$ ls -l hotspots

(Page 261)

save.history()

should be

savehistory()

I got the following messages:

> savehistory()
Error in .External2(C_savehistory, file) : 
  'savehistory' is not currently implemented

p313 - import function

It is worth noting that the import function on p313 (to import zipped gff file) will not work unless the rtracklayer library has specifically been installed:

library (rtracklayer)
mm_gtf <- import('Mus_mus.....gtf.gz')

Page 418

$find . -name "*.fastq" | xargs basename -s ".fastq"

i tried this but got the following error:

basename: invalid option -- 's'

I am running this in a unix server, would appreciate if you could give some guide

$basename --help
Usage: basename NAME [SUFFIX]
or: basename OPTION
Print NAME with any leading directory components removed.
If specified, also remove a trailing SUFFIX.

  --help     display this help and exit
  --version  output version information and exit

Examples:
basename /usr/bin/sort Output "sort".
basename include/stdio.h .h Output "stdio".

Report basename bugs to [email protected]
GNU coreutils home page: http://www.gnu.org/software/coreutils/
General help using GNU software: http://www.gnu.org/gethelp/
For complete documentation, run: info coreutils 'basename invocation'
[khorseik@ntu03 output]$ basename --version
basename (GNU coreutils) 8.4
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by David MacKenzie.

Chapter 6 curl options of -O and -o <filename>

In the CURL introduction in Chapter 6, there is the mention -O <filename>, but curl takes an output file name as the argument when the option -o.
In addition, if you want the output filename same as the remote, you need the -O option, which doesn't take a filename. While there is explanation that "If you omit the filename argument", the -o option must need a filename, and you cannot add a filename with the -O option, and therefore there is no concept of 'omit' here.

Mistake in Figure 9-2.

I think there is a mistake in Figure 9-2. In the present Figure 9-2 the ranges [1,5) and [2,5], and [8,9) and [9,9] don't seem to represent the same sequences. I think this because the index should start from 1 in 1-based system as shown in the figure below.

Page 164

The Mus_musculus.GRCm38.75.dna_rm.toplevel.fa.gz file described on page 164 does not appear in the github files for Chapter 7.

Errors found in chapter 9

R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

p314

export(pseudogene_sample, con="five_random_pseudogene.gtf", format="GTF"))

bed_data <- pseudogene_sample

p315

promoters() function extract 2000 upstream, not 3000

p316

"BSgenome.Mmusculus.UCSC.mm10" file size is 600.2MB

p323

amy1_tx object should be assigned before amy1_exons. Rearrange of the paragraph needed. amy1_tx coding chunk should preced amy1_exons

Easier way may be:

amy1_txid <- amy1$tx_id  # vector of tx_id
mm_exons[amy1_txid] # This should produce list of GRanges object by the transcripts

Error with psetdiff(), seemed psetdiff() no longer supported by GRangesList instance. Used setdiff() and worked fine.

p324

psetdiff is not working any more. Used setdiff

p326

My number is as below,

> length(unique(queryHits(hits)))
[1] 57623
> length(unique(queryHits(hits)))/length(dbsnp137_resized)
[1] 0.02134185

difference is small enough, but please check it again

In the end of paragraph of this page,
"...the number of unique query hits in this Hits object: 118,594."
where is 118,594 came from? It should be 58343(or 57623)

p327

Code works but number is diffrent from the number produced with my system. Please double check if the number is right

Page 287 - IntervalTree no longer exported in namespace:IRanges

I appears that IntervalTree is no longer exported by IRanges. Source package for 2.10.5 downloaded from Bioconductor has no entries in the NAMESPACE file for IntervalTree.

Errata and updates for Chapter 7

(Page 131)

i() { (head -n 2; tail -n 2) < "$1" | column -t}

should be

i() { (head -n 2; tail -n 2) < "$1" | column -t; }

Page 150

$ tail -n +6 Mus_musculus.GRCm38.75_chr1.gtf | awk -F "\t" '{print NF; exit}'
16

should be

$ tail -n +6 Mus_musculus.GRCm38.75_chr1.gtf | awk -F "\t" '{print NF; exit}'
9

(Page 152)

(e.g., 50% with -S 50%).

should be

(e.g., 50% with -S 50%).

(Page 154)

uniq -d test.bed | wc -l had no output. uniq -d Mus_musculus.GRCm38.75_chr1.bed | wc -l printed 22925. Thus,

A file with duplicates, like the test.bed file, has multiple lines returned:

uniq -d test.bed | wc -l
22925

should be

A file with duplicates, like the Mus_musculus.GRCm38.75_chr1.bed file, has multiple lines returned:

uniq -d Mus_musculus.GRCm38.75_chr1.bed | wc -l
22925

(Page 156)

join -1 1 -2 1 -a 1 example_sorted.bed example_lengths_alt.txt # GNU join only

worked on Mac OS X 10.9.5

(Page 163)

brew tap homebrew/science; brew install bioawk

should be

brew tap brewsci/bio; brew install bioawk

https://github.com/Homebrew/homebrew-science/issues/6617

(Page 163)

(just as regular Awk sets the columns of a tabular text file to $1, $1, $2, etc.).

should be

(just as regular Awk sets the columns of a tabular text file to $1, $2, $3, etc.).

(Page 163)

let’s read in example.bed

should be

let’s read in Mus_musculus.GRCm38.75_chr1.gtf

(Page 164)

$ bioawk -c fastx '{print $name,length($seq)}' \
    Mus_musculus.GRCm38.75.dna_rm.toplevel.fa.gz > mm_genome.txt
$ head -n 4 mm_genome.txt

The file "Mus_musculus.GRCm38.75.dna_rm.toplevel.fa.gz" is located in not "chapter-07-unix-data-tools" but in "chapter-09-working-with-range-data" in the GitHub repository.

$ bioawk -c fastx '{print $name,length($seq)}' \
    ../chapter-09-working-with-range-data/Mus_musculus.GRCm38.75.dna_rm.toplevel_chr1.fa.gz > mm_genome.txt
$ head -n 4 mm_genome.txt
1	195471971
$cat mm_genome.txt 
1	195471971

Errata for Bioinformatics Data Skills

Confirmed Errata | O'Reilly Media

Computer environment

I’ve run examples in this book on Mac OS X 10.9.5 in which I installed Python 3.4.3 and R version 3.2.2.

Python 3.4.3 |Anaconda 2.3.0 (x86_64)| (default, Mar  6 2015, 12:07:41) 
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin


> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

other attached packages:
[1] ggplot2_1.0.1

Chapter 3. Remedial Unix Shell

(Pages 38 and 45)

Doug McIlory

should be

Doug McIlroy

https://en.wikipedia.org/wiki/Douglas_McIlroy

(Page 49)

can can combine pipes and redirects easily:

should be

can combine pipes and redirects easily:

(Page 53)

The shell operator && executes subsequent commands only if previous commands have completed with a nonzero exit status:

should be

The shell operator && executes subsequent commands only if previous commands have completed with a zero exit status:

Chapter 4. Working with Remote Machines

Chapter 5. Git for Scientists

(Page 89)

our remote repository to a local directory named zmay-snps-barbara/.

should be

our remote repository to a local directory named zmays-snps-barbara/.

(Page 89)

Now, in our original zmay-snps/ local

should be

Now, in our original zmays-snps/ local

Chapter 6. Bioinformatics Data

(Page 113)

The option -a enables wrsync’s archive mode,

should be

The option -a enables rsync’s archive mode,

(Page 122)

$ shasum cf5bb5f8bda2803410bb04b708bff59cb575e379 Mus_musculus.GRCm38.74.gtf.gz

should be

$ shasum Mus_musculus.GRCm38.74.gtf.gz
cf5bb5f8bda2803410bb04b708bff59cb575e379  Mus_musculus.GRCm38.74.gtf.gz

Chapter 12. Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks

(Page 403-404)

$ test -r some_file.txt ; echo $? $ is this file readable?
$ test -w some_file.txt ; echo $? $ is this file writable?

should be

$ test -r some_file.txt ; echo $? # is this file readable?
$ test -w some_file.txt ; echo $? # is this file writable?

(Page 404-405)

./script.sh printed "./script.sh: line 6: $1: unbound variable".

(Page 414)

find seqs -type f "!" -name "zmaysC*fastq"

printed "-bash: !: event not found" and should thus be

find seqs -type f ! -name "zmaysC*fastq"

(Page 414)

(we are still in the zmays/data directory):

should be

(we are still in the zmays-snps/data directory):

(Page 415)

find seqs -type f "!" -name "zmaysC*fastq" -and "!" -name "*-temp*"

should be

find seqs -type f ! -name "zmaysC*fastq" -and ! -name "*-temp*"

Chapter 13. Out-of-Memory Approaches: Tabix and SQLite

Using Tabix Version: 1.2.1 on Mac OS X 10.9.5, man tabix printed "No manual entry for tabix"

Chapter 4 .tmux.conf error after updating tmux from 2.8 to newer version

.tmux.conf file works only after changing set-window-option -g window-status-current-bg red to
set-window-option -g window-status-current-style bg=red.

Version of tmux used: tmux 3.0a

chapter5,page71

git://github.com/ih3/seqtk (wrong)
http://github.com/ih3/seqtk (correct)

Homebrew-science deprecated - alternative?

Hi Vince,

I'm setting up my first bioinformatics project by working through your book. I have just installed homebrew and then discovered that homebrew-science has been deprecated. Do you have any recommendations for an alternative I could use? I will be working with viral NGS data.

Thanks,

Kate

plotIRanges: Don't know how to add o to a plot

I need to do some work with ranges data, and naturally, I was re-reading the chapter on this. However, the plotting function no longer works:

> y = IRanges(c(1, 5, 5, 8), end = c(0, 6, 10, 10))
> plotIRanges(y)
Error: Don't know how to add o to a plot

This is a non-specific ggplot2 bug:

The bug happens due to name-spacing. The geom_rect functions also exist in the bioinformatics package ggbio, which I had loaded. To fix is easy, simple add ggplot2:: in front of the calls.

Minor chapter 3 error

In chapter 3 p. 42, because the two protein .fasta files have no newlines at the end, I think the command

$ cat tb1-protein.fasta tga1-protein.fasta

will result in output that has the second fasta header beginning on the same line as the end of the previous protein sequence, rather than what is shown.

missing file

Hiya, I was wondering if this file gwascat2sqlite2table.py could be made available as well? Or was that never intended to be shared?

Cheers
Satz

Chapter 5, there might be two typos.

P91 are two commits ahead of our local repository
might be one commit ?
P105 Figure 5-6 new-methods
might be readme-changes ?

Regards,

Errata and updates for Chapter 10

Page 346

library(BiocInstaller)
biocLite('qrqc')

should be

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("qrqc")

https://www.bioconductor.org/packages/release/bioc/html/qrqc.html

Page 350-351

counts.update(seq.upper())

The Counter.update() method

should be

The counts.update() method

Page 351

Contributed by Itaya, KH.

For Python 3.7.3,
cat contam.fastq | ./nuccount.py printed the following messages:

  File "./nuccount.py", line 18
    print base + "\t" + str(counts[base])
             ^
SyntaxError: invalid syntax

#) nuccount.py

nuccount.py:18:    print base + "\t" + str(counts[base])

should be

nuccount.py:18:    print ( base + "\t" + str(counts[base]) )

#) readfq.py

readfq.py:39:    print n, '\t', slen, '\t', qlen

should be

readfq.py:39:    print ( n, '\t', slen, '\t', qlen )

vsbuffalo / bds-files Goto Github PK

bds-files's People

Contributors

Stargazers

Watchers

Forkers

bds-files's Issues

p314

p315

p316

p323

p324

p326

p327

Errata for Bioinformatics Data Skills

Computer environment

Chapter 3. Remedial Unix Shell

Chapter 4. Working with Remote Machines

Chapter 5. Git for Scientists

Chapter 6. Bioinformatics Data

Chapter 12. Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks

Chapter 13. Out-of-Memory Approaches: Tabix and SQLite

Recommend Projects

Recommend Topics

Recommend Org