vsbuffalo / bds-files Goto Github PK
View Code? Open in Web Editor NEWSupplementary files for my book, "Bioinformatics Data Skills"
License: MIT License
Supplementary files for my book, "Bioinformatics Data Skills"
License: MIT License
Truly amazing book—thank you for this great service. I think, in chapter 2 p. 29, the examples should read:
$ ls seqs/zmays[AB]_R1.fastq
$ ls seqs/zmays[A-B]_R1.fastq
i.e., there should be seqs/ in front.
(Page 181)
You access access R’s built-in documentation
should be
You access R’s built-in documentation
(Page 181)
apropos(norm)
printed "Error: is.character(what) is not TRUE", and thus should be
apropos("norm")
(Page 197)
(or use asis;
should be
(or use as.is;
(Page 202)
by dd$start >= 25800000 and dd$end <= 29700000 and
should be
by d$start >= 25800000 and d$end <= 29700000 and
(Page 205)
because d$percent is a vector,
should be
because d$percent.GC is a vector,
(Page 206)
Thus, d[$Pi > 3, ] is identical to d[which(d$Pi > 3), ];
should be
Thus, d[d$Pi > 3, ] is identical to d[which(d$Pi > 3), ];
(Page 208)
gplot2’s grammar through examples.
should be
ggplot2’s grammar through examples.
Page 218
Finding the Right Bin Width
ggplot(d) + geom_bar(aes(x=Pi), binwidth=1) + scale_x_continuous(limits=c(0.01, 80))
printed the following Warning messages:
## Warning: Ignoring unknown parameters: binwidth
## Warning: Removed 3373 rows containing non-finite values (stat_count).
Page 211
—everything is saturated from about 0.05 and below.
should be
—everything is saturated from about 0.005 and below.
(Page 216)
a factor (e.g., d$binned.GC)
should be
a factor (e.g., d$GC.binned)
(Page 217)
Total.SNPs,
should be
total.SNPs,
(Page 219)
try class(d$repClass) and levels(d$repClass)
should be
try class(reps$repClass) and levels(reps$repClass)
Page 222
The pos
column is already present in
https://raw.githubusercontent.com/vsbuffalo/bds-files/master/chapter-08-r/motif_recombrates.txt
> mtfs <- read.delim("motif_recombrates.txt", header=TRUE, stringsAsFactors=TRUE)
> head(mtfs, 3)
chr motif_start motif_end dist recomb_start recomb_end recom
1 chrX 35471312 35471325 39323.0 35430651 35433340 0.0015
2 chrX 35471312 35471325 36977.0 35433339 35435344 0.0015
3 chrX 35471312 35471325 34797.5 35435343 35437699 0.0015
motif pos
1 CCTCCCTGACCAC chrX-35471312
2 CCTCCCTGACCAC chrX-35471312
3 CCTCCCTGACCAC chrX-35471312
(Page 231)
> set.seed(0) # we set the random number seed so this example is reproducible
should be added before
> ll <- list(a=rnorm(6, mean=1), b=rnorm(6, mean=4), c=rnorm(6, mean=6))
Page 232
> lapply(ll, mean)
$a
[1] 0.5103648
$b
[1] 0.09681026
$c
[1] -0.2847329
should be
> lapply(ll, mean)
$a
[1] 1.402273
$b
[1] 4.19003
$c
[1] 5.53541
(Page 244)
*My output was
not the order "position GC.binned diversity" as shown in the book
but the order "diversity position GC.binned" as shown below: *
> select(d_df, -(start:cent))
Source: local data frame [59,140 x 3]
diversity position GC.binned
(dbl) (dbl) (fctr)
1 0.0000000 55500.5 (51.6,68.5]
because the order of R commands was as follows:
setwd("~/bds-files/chapter-08-r")
d <- read.csv("Dataset_S1.txt")
colnames(d)[12] <- "percent.GC"
d$cent <- d$start >= 25800000 & d$end <= 29700000
d$diversity <- d$Pi / (10*1000) # rescale, removing 10x and making per bp
d$position <- (d$end + d$start) / 2
d$GC.binned <- cut(d$percent.GC, 5)
(Page 246)
d_df %>% filter(percent.GC > 40) becomes filter(d_df, percent.GC > 40.
should be
d_df %>% filter(percent.GC > 40) becomes filter(d_df, percent.GC > 40).
(Page 249) The Double Backslash
To actually include a black‐ slash in a string,
should be
To actually include a back‐ slash in a string,
(Page 249)
Unlike grep(), regexpr(pattern, x)
should be
Unlike grep(), regexpr(pattern, text)
Page 251
> sub(" *[chrom]+(\\d+|X|Y|M) *", "chr\\1", c("chr19", "chrY"), perl=TRUE)
was changed to
> sub(" *[chrom]+(\\d+|X|Y|M) *", "chr\\1", c("chr19", "chromY"), perl=TRUE)
in Japanese edition.
https://www.oreilly.co.jp/books/9784873118635/
Page 255
• RStudio’s Rmarkdown tutorial
http://bit.ly/rstudio-rmkdwn
The page doesn't exist. the page may have moved
https://rmarkdown.rstudio.com/
(Page 256)
https://stat.ethz.ch/R-manual/R-devel/library/base/html/args.html
Because there is the function args()
, variable name of args
should be avoided.
## args.R -- a simple script to show command line args
args <- commandArgs(TRUE) print(args)
Page 257
$ ls -l hotspots
[vinceb]% ls -l hotspots
should be
$ ls -l hotspots
(Page 261)
save.history()
should be
savehistory()
I got the following messages:
> savehistory()
Error in .External2(C_savehistory, file) :
'savehistory' is not currently implemented
It is worth noting that the import function on p313 (to import zipped gff file) will not work unless the rtracklayer library has specifically been installed:
library (rtracklayer)
mm_gtf <- import('Mus_mus.....gtf.gz')
$find . -name "*.fastq" | xargs basename -s ".fastq"
i tried this but got the following error:
basename: invalid option -- 's'
I am running this in a unix server, would appreciate if you could give some guide
$basename --help
Usage: basename NAME [SUFFIX]
or: basename OPTION
Print NAME with any leading directory components removed.
If specified, also remove a trailing SUFFIX.
--help display this help and exit
--version output version information and exit
Examples:
basename /usr/bin/sort Output "sort".
basename include/stdio.h .h Output "stdio".
Report basename bugs to [email protected]
GNU coreutils home page: http://www.gnu.org/software/coreutils/
General help using GNU software: http://www.gnu.org/gethelp/
For complete documentation, run: info coreutils 'basename invocation'
[khorseik@ntu03 output]$ basename --version
basename (GNU coreutils) 8.4
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by David MacKenzie.
In the CURL introduction in Chapter 6, there is the mention -O <filename>
, but curl takes an output file name as the argument when the option -o
.
In addition, if you want the output filename same as the remote, you need the -O
option, which doesn't take a filename. While there is explanation that "If you omit the filename argument", the -o
option must need a filename, and you cannot add a filename with the -O
option, and therefore there is no concept of 'omit' here.
The Mus_musculus.GRCm38.75.dna_rm.toplevel.fa.gz file described on page 164 does not appear in the github files for Chapter 7.
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
export(pseudogene_sample, con="five_random_pseudogene.gtf", format="GTF"))
bed_data <- pseudogene_sample
promoters() function extract 2000 upstream, not 3000
"BSgenome.Mmusculus.UCSC.mm10" file size is 600.2MB
amy1_tx object should be assigned before amy1_exons. Rearrange of the paragraph needed. amy1_tx coding chunk should preced amy1_exons
Easier way may be:
amy1_txid <- amy1$tx_id # vector of tx_id
mm_exons[amy1_txid] # This should produce list of GRanges object by the transcripts
Error with psetdiff(), seemed psetdiff() no longer supported by GRangesList instance. Used setdiff() and worked fine.
psetdiff is not working any more. Used setdiff
My number is as below,
> length(unique(queryHits(hits)))
[1] 57623
> length(unique(queryHits(hits)))/length(dbsnp137_resized)
[1] 0.02134185
difference is small enough, but please check it again
In the end of paragraph of this page,
"...the number of unique query hits in this Hits object: 118,594."
where is 118,594 came from? It should be 58343(or 57623)
Code works but number is diffrent from the number produced with my system. Please double check if the number is right
I appears that IntervalTree is no longer exported by IRanges. Source package for 2.10.5 downloaded from Bioconductor has no entries in the NAMESPACE file for IntervalTree.
(Page 131)
i() { (head -n 2; tail -n 2) < "$1" | column -t}
should be
i() { (head -n 2; tail -n 2) < "$1" | column -t; }
Page 150
$ tail -n +6 Mus_musculus.GRCm38.75_chr1.gtf | awk -F "\t" '{print NF; exit}'
16
should be
$ tail -n +6 Mus_musculus.GRCm38.75_chr1.gtf | awk -F "\t" '{print NF; exit}'
9
(Page 152)
(e.g., 50% with -S 50%
).
should be
(e.g., 50% with -S 50%
).
(Page 154)
uniq -d test.bed | wc -l
had no output. uniq -d Mus_musculus.GRCm38.75_chr1.bed | wc -l
printed 22925. Thus,
A file with duplicates, like the test.bed file, has multiple lines returned:
uniq -d test.bed | wc -l
22925
should be
A file with duplicates, like the Mus_musculus.GRCm38.75_chr1.bed file, has multiple lines returned:
uniq -d Mus_musculus.GRCm38.75_chr1.bed | wc -l
22925
(Page 156)
join -1 1 -2 1 -a 1 example_sorted.bed example_lengths_alt.txt # GNU join only
worked on Mac OS X 10.9.5
(Page 163)
brew tap homebrew/science; brew install bioawk
should be
brew tap brewsci/bio; brew install bioawk
https://github.com/Homebrew/homebrew-science/issues/6617
(Page 163)
(just as regular Awk sets the columns of a tabular text file to $1, $1, $2,
etc.).
should be
(just as regular Awk sets the columns of a tabular text file to $1, $2, $3,
etc.).
(Page 163)
let’s read in example.bed
should be
let’s read in Mus_musculus.GRCm38.75_chr1.gtf
(Page 164)
$ bioawk -c fastx '{print $name,length($seq)}' \
Mus_musculus.GRCm38.75.dna_rm.toplevel.fa.gz > mm_genome.txt
$ head -n 4 mm_genome.txt
The file "Mus_musculus.GRCm38.75.dna_rm.toplevel.fa.gz" is located in not "chapter-07-unix-data-tools" but in "chapter-09-working-with-range-data" in the GitHub repository.
$ bioawk -c fastx '{print $name,length($seq)}' \
../chapter-09-working-with-range-data/Mus_musculus.GRCm38.75.dna_rm.toplevel_chr1.fa.gz > mm_genome.txt
$ head -n 4 mm_genome.txt
1 195471971
$cat mm_genome.txt
1 195471971
Confirmed Errata | O'Reilly Media
I’ve run examples in this book on Mac OS X 10.9.5 in which I installed Python 3.4.3 and R version 3.2.2.
Python 3.4.3 |Anaconda 2.3.0 (x86_64)| (default, Mar 6 2015, 12:07:41)
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)
other attached packages:
[1] ggplot2_1.0.1
(Pages 38 and 45)
Doug McIlory
should be
Doug McIlroy
https://en.wikipedia.org/wiki/Douglas_McIlroy
(Page 49)
can can combine pipes and redirects easily:
should be
can combine pipes and redirects easily:
(Page 53)
The shell operator && executes subsequent commands only if previous commands have completed with a nonzero exit status:
should be
The shell operator && executes subsequent commands only if previous commands have completed with a zero exit status:
(Page 89)
our remote repository to a local directory named zmay-snps-barbara/.
should be
our remote repository to a local directory named zmays-snps-barbara/.
(Page 89)
Now, in our original zmay-snps/ local
should be
Now, in our original zmays-snps/ local
(Page 113)
The option -a enables wrsync’s archive mode,
should be
The option -a enables rsync’s archive mode,
(Page 122)
$ shasum cf5bb5f8bda2803410bb04b708bff59cb575e379 Mus_musculus.GRCm38.74.gtf.gz
should be
$ shasum Mus_musculus.GRCm38.74.gtf.gz
cf5bb5f8bda2803410bb04b708bff59cb575e379 Mus_musculus.GRCm38.74.gtf.gz
(Page 403-404)
$ test -r some_file.txt ; echo $? $ is this file readable?
$ test -w some_file.txt ; echo $? $ is this file writable?
should be
$ test -r some_file.txt ; echo $? # is this file readable?
$ test -w some_file.txt ; echo $? # is this file writable?
(Page 404-405)
./script.sh
printed "./script.sh: line 6: $1: unbound variable".
(Page 414)
find seqs -type f "!" -name "zmaysC*fastq"
printed "-bash: !: event not found" and should thus be
find seqs -type f ! -name "zmaysC*fastq"
(Page 414)
(we are still in the zmays/data directory):
should be
(we are still in the zmays-snps/data directory):
(Page 415)
find seqs -type f "!" -name "zmaysC*fastq" -and "!" -name "*-temp*"
should be
find seqs -type f ! -name "zmaysC*fastq" -and ! -name "*-temp*"
Using Tabix Version: 1.2.1 on Mac OS X 10.9.5, man tabix
printed "No manual entry for tabix"
.tmux.conf file works only after changing set-window-option -g window-status-current-bg red
to
set-window-option -g window-status-current-style bg=red
.
Version of tmux used: tmux 3.0a
git://github.com/ih3/seqtk (wrong)
http://github.com/ih3/seqtk (correct)
Hi Vince,
I'm setting up my first bioinformatics project by working through your book. I have just installed homebrew and then discovered that homebrew-science has been deprecated. Do you have any recommendations for an alternative I could use? I will be working with viral NGS data.
Thanks,
Kate
I need to do some work with ranges data, and naturally, I was re-reading the chapter on this. However, the plotting function no longer works:
> y = IRanges(c(1, 5, 5, 8), end = c(0, 6, 10, 10))
> plotIRanges(y)
Error: Don't know how to add o to a plot
This is a non-specific ggplot2 bug:
The bug happens due to name-spacing. The geom_rect
functions also exist in the bioinformatics package ggbio, which I had loaded. To fix is easy, simple add ggplot2::
in front of the calls.
In chapter 3 p. 42, because the two protein .fasta files have no newlines at the end, I think the command
$ cat tb1-protein.fasta tga1-protein.fasta
will result in output that has the second fasta header beginning on the same line as the end of the previous protein sequence, rather than what is shown.
Hiya, I was wondering if this file gwascat2sqlite2table.py could be made available as well? Or was that never intended to be shared?
Cheers
Satz
P91 are two commits ahead of our local repository
might be one commit
?
P105 Figure 5-6 new-methods
might be readme-changes
?
Regards,
Page 346
library(BiocInstaller)
biocLite('qrqc')
should be
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("qrqc")
https://www.bioconductor.org/packages/release/bioc/html/qrqc.html
Page 350-351
counts.update(seq.upper())
The Counter.update()
method
should be
The counts.update()
method
Page 351
Contributed by Itaya, KH.
For Python 3.7.3,
cat contam.fastq | ./nuccount.py
printed the following messages:
File "./nuccount.py", line 18
print base + "\t" + str(counts[base])
^
SyntaxError: invalid syntax
#) nuccount.py
nuccount.py:18: print base + "\t" + str(counts[base])
should be
nuccount.py:18: print ( base + "\t" + str(counts[base]) )
#) readfq.py
readfq.py:39: print n, '\t', slen, '\t', qlen
should be
readfq.py:39: print ( n, '\t', slen, '\t', qlen )
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.