Git Product home page Git Product logo

bds-files's People

Contributors

peterjc avatar vsbuffalo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bds-files's Issues

Minor chapter 2 error

Truly amazing book—thank you for this great service. I think, in chapter 2 p. 29, the examples should read:

$ ls seqs/zmays[AB]_R1.fastq
$ ls seqs/zmays[A-B]_R1.fastq

i.e., there should be seqs/ in front.

Errata and updates for Chapter 8


(Page 181)

You access access R’s built-in documentation

should be

You access R’s built-in documentation


(Page 181)

apropos(norm)

printed "Error: is.character(what) is not TRUE", and thus should be

apropos("norm")


(Page 197)

(or use asis;

should be

(or use as.is;


(Page 202)

by dd$start >= 25800000 and dd$end <= 29700000 and

should be

by d$start >= 25800000 and d$end <= 29700000 and


(Page 205)

because d$percent is a vector,

should be

because d$percent.GC is a vector,


(Page 206)

Thus, d[$Pi > 3, ] is identical to d[which(d$Pi > 3), ];

should be

Thus, d[d$Pi > 3, ] is identical to d[which(d$Pi > 3), ];


(Page 208)

gplot2’s grammar through examples.

should be

ggplot2’s grammar through examples.


Page 218

Finding the Right Bin Width

ggplot(d) + geom_bar(aes(x=Pi), binwidth=1) + scale_x_continuous(limits=c(0.01, 80))

printed the following Warning messages:

## Warning: Ignoring unknown parameters: binwidth
## Warning: Removed 3373 rows containing non-finite values (stat_count).

Page 211

—everything is saturated from about 0.05 and below.

should be

—everything is saturated from about 0.005 and below.


(Page 216)

a factor (e.g., d$binned.GC)

should be

a factor (e.g., d$GC.binned)


(Page 217)

Total.SNPs,

should be

total.SNPs,


(Page 219)

try class(d$repClass) and levels(d$repClass)

should be

try class(reps$repClass) and levels(reps$repClass)


Page 222

The pos column is already present in
https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/motif_recombrates.txt

> mtfs <- read.delim("motif_recombrates.txt", header=TRUE, stringsAsFactors=TRUE)
> head(mtfs, 3)
   chr motif_start motif_end    dist recomb_start recomb_end  recom
1 chrX    35471312  35471325 39323.0     35430651   35433340 0.0015
2 chrX    35471312  35471325 36977.0     35433339   35435344 0.0015
3 chrX    35471312  35471325 34797.5     35435343   35437699 0.0015
          motif           pos
1 CCTCCCTGACCAC chrX-35471312
2 CCTCCCTGACCAC chrX-35471312
3 CCTCCCTGACCAC chrX-35471312

(Page 231)

> set.seed(0) # we set the random number seed so this example is reproducible

should be added before

> ll <- list(a=rnorm(6, mean=1), b=rnorm(6, mean=4), c=rnorm(6, mean=6))

Page 232

> lapply(ll, mean) 
$a
[1] 0.5103648
$b
[1] 0.09681026 
$c
[1] -0.2847329

should be

> lapply(ll, mean)
$a
[1] 1.402273

$b
[1] 4.19003

$c
[1] 5.53541

(Page 244)

*My output was
not the order "position GC.binned diversity" as shown in the book
but the order "diversity position GC.binned" as shown below: *

> select(d_df, -(start:cent))
Source: local data frame [59,140 x 3]

   diversity position   GC.binned
       (dbl)    (dbl)      (fctr)
1  0.0000000  55500.5 (51.6,68.5]

because the order of R commands was as follows:

    setwd("~/bds-files/chapter-08-r")
    d <- read.csv("Dataset_S1.txt")
    colnames(d)[12] <- "percent.GC"
    d$cent <- d$start >= 25800000 & d$end <= 29700000
    d$diversity <- d$Pi / (10*1000) # rescale, removing 10x and making per bp
    d$position <- (d$end + d$start) / 2
    d$GC.binned <- cut(d$percent.GC, 5)

(Page 246)

d_df %>% filter(percent.GC > 40) becomes filter(d_df, percent.GC > 40.

should be

d_df %>% filter(percent.GC > 40) becomes filter(d_df, percent.GC > 40).


(Page 249) The Double Backslash

To actually include a black‐ slash in a string,

should be

To actually include a back‐ slash in a string,


(Page 249)

Unlike grep(), regexpr(pattern, x)

should be

Unlike grep(), regexpr(pattern, text)


Page 251

> sub(" *[chrom]+(\\d+|X|Y|M) *", "chr\\1", c("chr19", "chrY"), perl=TRUE)

was changed to

> sub(" *[chrom]+(\\d+|X|Y|M) *", "chr\\1", c("chr19", "chromY"), perl=TRUE)

in Japanese edition.
https://www.oreilly.co.jp/books/9784873118635/


Page 255

• RStudio’s Rmarkdown tutorial
http://bit.ly/rstudio-rmkdwn

The page doesn't exist. the page may have moved

https://rmarkdown.rstudio.com/


(Page 256)

https://stat.ethz.ch/R-manual/R-devel/library/base/html/args.html
Because there is the function args(), variable name of args should be avoided.

## args.R -- a simple script to show command line args
args <- commandArgs(TRUE) print(args)

Page 257

$ ls -l hotspots
[vinceb]% ls -l hotspots

should be

$ ls -l hotspots

(Page 261)

save.history()

should be

savehistory()

I got the following messages:

> savehistory()
Error in .External2(C_savehistory, file) : 
  'savehistory' is not currently implemented

p313 - import function

It is worth noting that the import function on p313 (to import zipped gff file) will not work unless the rtracklayer library has specifically been installed:

library (rtracklayer)
mm_gtf <- import('Mus_mus.....gtf.gz')

Page 418

$find . -name "*.fastq" | xargs basename -s ".fastq"

i tried this but got the following error:

basename: invalid option -- 's'

I am running this in a unix server, would appreciate if you could give some guide

$basename --help
Usage: basename NAME [SUFFIX]
or: basename OPTION
Print NAME with any leading directory components removed.
If specified, also remove a trailing SUFFIX.

  --help     display this help and exit
  --version  output version information and exit

Examples:
basename /usr/bin/sort Output "sort".
basename include/stdio.h .h Output "stdio".

Report basename bugs to [email protected]
GNU coreutils home page: http://www.gnu.org/software/coreutils/
General help using GNU software: http://www.gnu.org/gethelp/
For complete documentation, run: info coreutils 'basename invocation'
[khorseik@ntu03 output]$ basename --version
basename (GNU coreutils) 8.4
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by David MacKenzie.

Chapter 6 curl options of -O and -o <filename>

In the CURL introduction in Chapter 6, there is the mention -O <filename>, but curl takes an output file name as the argument when the option -o.
In addition, if you want the output filename same as the remote, you need the -O option, which doesn't take a filename. While there is explanation that "If you omit the filename argument", the -o option must need a filename, and you cannot add a filename with the -O option, and therefore there is no concept of 'omit' here.

Mistake in Figure 9-2.

I think there is a mistake in Figure 9-2. In the present Figure 9-2 the ranges [1,5) and [2,5], and [8,9) and [9,9] don't seem to represent the same sequences. I think this because the index should start from 1 in 1-based system as shown in the figure below.

image

Page 164

The Mus_musculus.GRCm38.75.dna_rm.toplevel.fa.gz file described on page 164 does not appear in the github files for Chapter 7.

Errors found in chapter 9

R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

p314

export(pseudogene_sample, con="five_random_pseudogene.gtf", format="GTF"))

bed_data <- pseudogene_sample

p315

promoters() function extract 2000 upstream, not 3000

p316

"BSgenome.Mmusculus.UCSC.mm10" file size is 600.2MB

p323

amy1_tx object should be assigned before amy1_exons. Rearrange of the paragraph needed. amy1_tx coding chunk should preced amy1_exons

Easier way may be:

amy1_txid <- amy1$tx_id  # vector of tx_id
mm_exons[amy1_txid] # This should produce list of GRanges object by the transcripts

Error with psetdiff(), seemed psetdiff() no longer supported by GRangesList instance. Used setdiff() and worked fine.

p324

psetdiff is not working any more. Used setdiff

p326

My number is as below,

> length(unique(queryHits(hits)))
[1] 57623
> length(unique(queryHits(hits)))/length(dbsnp137_resized)
[1] 0.02134185

difference is small enough, but please check it again

In the end of paragraph of this page,
"...the number of unique query hits in this Hits object: 118,594."
where is 118,594 came from? It should be 58343(or 57623)

p327

Code works but number is diffrent from the number produced with my system. Please double check if the number is right

Errata and updates for Chapter 7


(Page 131)

i() { (head -n 2; tail -n 2) < "$1" | column -t}

should be

i() { (head -n 2; tail -n 2) < "$1" | column -t; }


Page 150

$ tail -n +6 Mus_musculus.GRCm38.75_chr1.gtf | awk -F "\t" '{print NF; exit}'
16

should be

$ tail -n +6 Mus_musculus.GRCm38.75_chr1.gtf | awk -F "\t" '{print NF; exit}'
9

(Page 152)

(e.g., 50% with -S 50%).

should be

(e.g., 50% with -S 50%).


(Page 154)

uniq -d test.bed | wc -l had no output. uniq -d Mus_musculus.GRCm38.75_chr1.bed | wc -l printed 22925. Thus,

A file with duplicates, like the test.bed file, has multiple lines returned:

uniq -d test.bed | wc -l
22925

should be

A file with duplicates, like the Mus_musculus.GRCm38.75_chr1.bed file, has multiple lines returned:

uniq -d Mus_musculus.GRCm38.75_chr1.bed | wc -l
22925

(Page 156)

join -1 1 -2 1 -a 1 example_sorted.bed example_lengths_alt.txt # GNU join only

worked on Mac OS X 10.9.5


(Page 163)

brew tap homebrew/science; brew install bioawk

should be

brew tap brewsci/bio; brew install bioawk

https://github.com/Homebrew/homebrew-science/issues/6617


(Page 163)

(just as regular Awk sets the columns of a tabular text file to $1, $1, $2, etc.).

should be

(just as regular Awk sets the columns of a tabular text file to $1, $2, $3, etc.).


(Page 163)

let’s read in example.bed

should be

let’s read in Mus_musculus.GRCm38.75_chr1.gtf


(Page 164)

$ bioawk -c fastx '{print $name,length($seq)}' \
    Mus_musculus.GRCm38.75.dna_rm.toplevel.fa.gz > mm_genome.txt
$ head -n 4 mm_genome.txt

The file "Mus_musculus.GRCm38.75.dna_rm.toplevel.fa.gz" is located in not "chapter-07-unix-data-tools" but in "chapter-09-working-with-range-data" in the GitHub repository.

$ bioawk -c fastx '{print $name,length($seq)}' \
    ../chapter-09-working-with-range-data/Mus_musculus.GRCm38.75.dna_rm.toplevel_chr1.fa.gz > mm_genome.txt
$ head -n 4 mm_genome.txt
1	195471971
$cat mm_genome.txt 
1	195471971

Errata for Bioinformatics Data Skills

Errata for Bioinformatics Data Skills

Confirmed Errata | O'Reilly Media


Computer environment

I’ve run examples in this book on Mac OS X 10.9.5 in which I installed Python 3.4.3 and R version 3.2.2.

Python 3.4.3 |Anaconda 2.3.0 (x86_64)| (default, Mar  6 2015, 12:07:41) 
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin


> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

other attached packages:
[1] ggplot2_1.0.1

Chapter 3. Remedial Unix Shell


(Pages 38 and 45)

Doug McIlory

should be

Doug McIlroy

https://en.wikipedia.org/wiki/Douglas_McIlroy


(Page 49)

can can combine pipes and redirects easily:

should be

can combine pipes and redirects easily:


(Page 53)

The shell operator && executes subsequent commands only if previous commands have completed with a nonzero exit status:

should be

The shell operator && executes subsequent commands only if previous commands have completed with a zero exit status:


Chapter 4. Working with Remote Machines


Chapter 5. Git for Scientists


(Page 89)

our remote repository to a local directory named zmay-snps-barbara/.

should be

our remote repository to a local directory named zmays-snps-barbara/.


(Page 89)

Now, in our original zmay-snps/ local

should be

Now, in our original zmays-snps/ local


Chapter 6. Bioinformatics Data


(Page 113)

The option -a enables wrsync’s archive mode,

should be

The option -a enables rsync’s archive mode,


(Page 122)

$ shasum cf5bb5f8bda2803410bb04b708bff59cb575e379 Mus_musculus.GRCm38.74.gtf.gz

should be

$ shasum Mus_musculus.GRCm38.74.gtf.gz
cf5bb5f8bda2803410bb04b708bff59cb575e379  Mus_musculus.GRCm38.74.gtf.gz


Chapter 12. Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks


(Page 403-404)

$ test -r some_file.txt ; echo $? $ is this file readable?
$ test -w some_file.txt ; echo $? $ is this file writable?

should be

$ test -r some_file.txt ; echo $? # is this file readable?
$ test -w some_file.txt ; echo $? # is this file writable?

(Page 404-405)

./script.sh printed "./script.sh: line 6: $1: unbound variable".


(Page 414)

find seqs -type f "!" -name "zmaysC*fastq"

printed "-bash: !: event not found" and should thus be

find seqs -type f ! -name "zmaysC*fastq"

(Page 414)

(we are still in the zmays/data directory):

should be

(we are still in the zmays-snps/data directory):


(Page 415)

find seqs -type f "!" -name "zmaysC*fastq" -and "!" -name "*-temp*"

should be

find seqs -type f ! -name "zmaysC*fastq" -and ! -name "*-temp*"

Chapter 13. Out-of-Memory Approaches: Tabix and SQLite


Using Tabix Version: 1.2.1 on Mac OS X 10.9.5, man tabix printed "No manual entry for tabix"


Homebrew-science deprecated - alternative?

Hi Vince,

I'm setting up my first bioinformatics project by working through your book. I have just installed homebrew and then discovered that homebrew-science has been deprecated. Do you have any recommendations for an alternative I could use? I will be working with viral NGS data.

Thanks,

Kate

plotIRanges: Don't know how to add o to a plot

I need to do some work with ranges data, and naturally, I was re-reading the chapter on this. However, the plotting function no longer works:

> y = IRanges(c(1, 5, 5, 8), end = c(0, 6, 10, 10))
> plotIRanges(y)
Error: Don't know how to add o to a plot

This is a non-specific ggplot2 bug:

The bug happens due to name-spacing. The geom_rect functions also exist in the bioinformatics package ggbio, which I had loaded. To fix is easy, simple add ggplot2:: in front of the calls.

Minor chapter 3 error

In chapter 3 p. 42, because the two protein .fasta files have no newlines at the end, I think the command

$ cat tb1-protein.fasta tga1-protein.fasta

will result in output that has the second fasta header beginning on the same line as the end of the previous protein sequence, rather than what is shown.

missing file

Hiya, I was wondering if this file gwascat2sqlite2table.py could be made available as well? Or was that never intended to be shared?

Cheers
Satz

Errata and updates for Chapter 10


Page 346

library(BiocInstaller)
biocLite('qrqc')

should be

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("qrqc")

https://www.bioconductor.org/packages/release/bioc/html/qrqc.html


Page 350-351

counts.update(seq.upper())

The Counter.update() method

should be

The counts.update() method


Page 351

Contributed by Itaya, KH.

For Python 3.7.3,
cat contam.fastq | ./nuccount.py printed the following messages:

  File "./nuccount.py", line 18
    print base + "\t" + str(counts[base])
             ^
SyntaxError: invalid syntax

#) nuccount.py

nuccount.py:18:    print base + "\t" + str(counts[base])

should be

nuccount.py:18:    print ( base + "\t" + str(counts[base]) )

#) readfq.py

readfq.py:39:    print n, '\t', slen, '\t', qlen

should be

readfq.py:39:    print ( n, '\t', slen, '\t', qlen )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.