shawhahnlab / chiimp Goto Github PK
View Code? Open in Web Editor NEWComputational, High-throughput Individual Identification through Microsatellite Profiling
License: GNU Affero General Public License v3.0
Computational, High-throughput Individual Identification through Microsatellite Profiling
License: GNU Affero General Public License v3.0
There is currently no feature for automated scoring of results for known samples. Most information is already provided in the existing input files, though, and the mapping of sample identifiers to individual names could be done via an extra column in the input spreadsheet for the dataset. A logical home for this step might be inside summarize_dataset
, for the case where the known genotypes table is supplied and a name column was loaded from the dataset spreadsheet.
prepare_dataset should be updated to give a more intuitive error message if the data directory is missing at run time:
Error in locusmap[[locus]] :
attempt to select less than one element in get1index
Since an installation via R replaces the package files but doesn't include the extra step of gunzip "$chiimp_path/Contents/MacOS/droplet.gz"
from install_mac.command, the Desktop icon breaks.
I can never remember how the integers in the ord
argument to prepare_datset or dataset_opts$ord
configuration setting match up with the Replicate, Sample, and Locus fields. prepare_dataset should be updated to allow a case-insensitive character vector using field names instead.
Automated installation fails on Windows for a username with spaces in it. This is due to how install_windows.cmd builds command strings for R and for the shortcut creation command, so moving the bulk of the commands directly into R may simplify that and should avoid these shell escaping issues.
Currently all sample analysis options are provided in the configuration file and applied for all loci, but there's an obvious use case for allowing per-locus values (e.g. varying propensity for PCR stutter in one locus over another). Allowing optional columns in the locus attributes table would be a straightforward way to handle this.
Locus-multiplexed sequence files currently get separately loaded and analyzed for each locus, even though the output of analyze_sample
will be identical each time. It would be more efficient to track files separately from sample/locus pairs and avoid the duplicated processing.
The automated installer script for Mac OS (install_mac.command) isn't marked executable in the repository. This needs to be fixed for the one-click install.
Currently load_csv
always assigns its own row names, but that should be skipped if it was called with a row.names argument.
Running R CMD check chiimp_0.2.2.tar.gz
fails on recent development versions of R:
...
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘chiimp’ can be installed ... ERROR
Installation failed.
See ‘/data/home/jesse/dev/chiimp-r-ver/chiimp.Rcheck/00install.out’ for details.
* DONE
Status: 1 ERROR
See
‘/data/home/jesse/dev/chiimp-r-ver/chiimp.Rcheck/00check.log’
for details.
The file chiimp.Rcheck/00install.out shows:
* installing *source* package ‘chiimp’ ...
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
ERROR: hard-coded installation path: please report to the package maintainer and use ‘--no-staged-install’
* removing ‘/data/home/jesse/dev/chiimp-r-ver/chiimp.Rcheck/chiimp’
This is causing CI via Travis to fail for the "devel" and "bioc-devel" jobs. it's fine on the latest stable release, though. Is this a bug in the development version or a more stringent check that didn't previously exist?
Samples with reads present in the file but none matching for that particular sample cause spurious warnings when saving histograms, like:
Warning in min(x) : no non-missing arguments to min; returning Inf
Warning in max(x) : no non-missing arguments to max; returning -Inf
This is because str_hist_render
calls things like max
/min
/range
on potentially-empty data frames. The tests curently miss this because they skip rendering the plots.
The Desktop icon is "not marked as trusted" by default in the Ubuntu 18 desktop manager.
It looks like gio set ~/Desktop/CHIIMP.desktop "metadata::trusted" yes
should fix that as described in https://askubuntu.com/a/1070133
The CountTotal value from summarize_sample
is found by working backward from the filtered data frame supplied by analyze_sample
to determine the original raw read count, but if there are no matching reads for that particular locus, CountTotal is set to zero. It should always report the original read count for the input file, no matter the locus-specific details.
Blank entries in report tables and heatmaps are shown for sample/locus combinations that either failed to produce a genotype, or simply were not present in the dataset to begin with. (For example, eight loci could be tested across N samples, and then an additional two loci with just select samples. The output tables would currently be 10-by-N with blanks for those combinations not present and for any that failed to yield a genotype.)
tabulate_allele_names
already makes this distinction (empty string for blank results and NA for combinations not tested). The report and markdown functions would just need to take this into account.
While running devtools::test during installation on Windows, the test that "save_seqfile_data works with Windows-style paths" fails because of mixed path separators in the two path lists. Calling normalizePath on each just before the expect_equal call should fix this.
There are a bunch of CSV file reader/writer functions in io.R
repeating the same calls to the utils package's table functions. These should be standardized with one wrapper for reading and one for writing.
The compilation can be automated with osacompile -o chiimp.app chiimp.AppleScript
. If that works out of the box on Mac OS (are Developer Tools required?) it can be done at install time, either in R/installer.R or automatically with a configure script.
On a fresh install of Mac OS some of the developer tools needed when installing devtools aren't present by default, so we see a dialog box pop open prompting to install. The installer should handle this better instead of just proceeding and failing to install packages from that first attempt.
On the 10.15.1 system I have access to, the binary that handles the installation is:
/System/Library/CoreServices/Install Command Line Developer Tools.app/Contents/MacOS/Install Command Line Developer Tools
...so we could monitor that to see when it's finished. (The commands themselves that prompt for the install return immediately, so we can't wait on those.)
Maybe helpful:
http://osxdaily.com/2014/02/12/install-command-line-tools-mac-os-x/
Right now any relative paths for input files/directories supplied in a configuration file are accepted as-is, which means they are effectively taken relative to the current working directory of the R process (wherever that might be). These should instead be taken relative to the configuration file path itself.
The desktop icon is mentioned near the beginning but the drag-and-drop behavior is only mentioned much later near the end. There should be a sentence about it in the installation section.
Minor change, but save_csv
should create any missing parent directories prior to saving the file. It should also return the data invisibly instead of visibly as it currently does.
plot_heatmap
does not correctly handle the case where all attribute values are the same, but with interspersed NA values for blank entries. For example, if the Artifact column of the results summary is all FALSE, and some samples had no alleles called, the heatmap should be gray for the blanks and white for the rest but is instead all gray. na.rm = TRUE
should be used in the scale breaks definition to fix this.
On recent installs we're seeing the complied .app on Mac OS trying to access the script with a path like:
/Library/Frameworks/R.framework/Versions/3.5/Resources/library/chiimp//../exec/chiimp.command
But that ..
isn't needed with the prefix it's being added to so the command script is't found. (Did something change in the path-handling parts of AppleScript used to put that path together, maybe?)
Something in the YAML parsing has changed between pandoc 2.5 and 2.6 so that the --metadata=key:val
arguments we give keep internal quotes rather than parsing them as encapsulating strings.
In render_report
:
rmarkdown::render(fp_report_in, quiet = TRUE, output_file = fp_report_out,
output_options = list(pandoc_args = pandoc_args))
If pandoc_args
contains --metadata=title:"report title"
, the quotes would be parsed out in pandoc < 2.6 but kept in pandoc>=2.6.
I'm not sure the quotes were ever actually necessary (they'd be optional in YAML anyway, and these pandoc arguments are already handled as a vector so word-splitting shouldn't be an issue).
Currently the desktop icons just launch the command-line script as-is, so the usage instructions shown are a bit cryptic if the icon is double-clicked. This should be replaced with instructions tailored to the desktop icon usage specifically.
Right now some input files are tab-delimited and some comma-delimited. load_locus_attrs
and load_allele_names
should be made to expect comma-delimited text so that CSV is used consistently for input and output.
The output data tables, both per-file and per-sample, have FractionOfTotal
and FractionOfLocus
columns, and we have a configurable threshold for the fraction of reads required to consider a peak as a candidate allele, fraction.min
. But this fraction isn't either of those two listed columns; instead the denominator is the sum of the read counts in each processed-samples table, which is a more stringent set than just the matching locus via primer(s).
To summarize:
analyze_sample()
, which currently has no explicit column defined: denominator is the number of reads matching per-locus primer(s), repeat motif, and length rangeThis should be clarified in the documentation and outputs.
As of version 2022.02 RStudio has a newfangled thing called quarto in place of its previous rmarkdown-rendering system. This has the pandoc executable in a new place, for example on a test install on Windows:
C:\Program Files\RStudio\bin\quarto\bin
This means rendering reports fails (confirmed on Windows and expected on Mac and Linux) due to hardcoded RSTUDIO_PANDOC
paths in all three of the Windows/Mac/Linux wrapper scripts chiimp.cmd / chiimp.command / chiimp.sh. The correct path to pandoc should be (well, always should have been) auto-detected.
The prep_release.sh script is only useful for development and only when preparing a release. It should be moved to a less-prominent place in the repository.
We never enabled it because our forward primers alone matched well enough in practice, but the reverse primer really should be included in the locus-matching behavior (probably an option to start with but the default in a later compatibility-breaking release).
CHIIMP should support IUPAC codes for ambiguous nucleotides in primer sequences in order to recognize multiple sets of possible primers for a single locus.
We have demo input files available at https://upenn.app.box.com/v/chiimp-demo. This should go in the README.
Currently load_config
will return whatever structure is in a given YAML file, but there's no error or warning message for unrecognized entries. All valid names should already be present in config.defaults
(even if they just point to NULL
) so this should be used as a simple check when loading configurations. This would easily catch YAML formatting problems like key:val
versus key: val
since the former would show a mismatch with the default config.
Currently the read count ratio threshold for second candidate allele to first is hardcoded at 1/3 for any sort of suspected artifact sequence (see find_artifact/find_stutter count.ratio_max argument). This should be an option exposed via the configuration file.
plot_heatmap fails if the given attribute has only a single unique value and NA entries present:
Error in grid.pretty(range(as.vector(breaks))) :
infinite axis extents [GEPretty(nan,nan,5)]
We can fix this by updating the breaks calculation to exclude NA when finding the max attribute value.
For datasets producing no genotypes at all (e.g. consistently low quality samples, sequencing error, etc.) output files are saved but report generation fails. Tests should be added to explicitly check each report section to work in this case.
During reinstallation on Windows, devtools' default auto-update behavior attempts to update the "glue" package but fails to fully remove it. Re-running the installer again then fails because devtools remains installed as before, but has broken one of its own dependencies and no longer functions. Disabling the automatic updating during install is probably the best approach anyway and looks like it will avoid this problem.
Currently sequences including N for base calls (or anything outside of [ACTG]) are not handled specially, so they may appear in final output including in purported alleles. By default these sequences should marked in the output of analyze_sample
and filtered in the output of summarize_sample
to avoid this.
Within save_seqfile_data, remove_shared_root_dir calls normalizePath on path strings but then assumes / as a path separator. On Windows normalizePath uses \ as the separator, and so the returned paths from remove_shared_root_dir are garbled. This causes save_seqfile_data to fail. This is a regression in 0.2.0 from 03df4fa.
I think we may just be able to give winslash="/" in normalizePath and keep the rest the same. .Platform$file.sep
still returns "/" on Windows so apparently we can't just use that value instead of a literal /.
Currently the install scripts use the equivalent of Rscript --vanilla ./tools/install.R
during installation, but this can fail if, for example, the R installation requires a particular package repository setup such as what I just saw on CircleCI. Thinking it over again, why disallow loading of the site init and environment files anyway? The existing R installation should be taken as-is rather than skipping those files.
The only pre-processing on the FASTQ files absolutely required before running CHIIMP is adapter trimming, so if we can bundle this step into the package, it will be able to operate on output directly from a sequencing run.
We may as well also provide an option to strip out the PCR primers late in the processing, which would make the output more in line with what some users may expect and what MEGASAT produces.
The general CSV input/output functions (added in #43) should be able to handle arbitrary column names, but currently loading fails if none of the recognized columns are present (it tries to make convenient row names from certain columns and fails).
With the new reorganization in a0e288c the cross-locus information is kept in results$files
rather than results$samples
. Currently tally_cts_per_locus
still uses samples, so the cross-locus heatmap never shows other loci in a given row. This should be corrected to use files instead.
We should consider switching from the kable function to the DT package's datatable function to present interactive, sortable tables in the HTML reports.
The Mac OS and Linux install scripts could benefit from the same R-based method used on Windows now. (If all three pointed to a generalized R script that handled things like user library creation, the main install scripts would only need to focus on detecting R.) This might also be a good use of the tools subdirectory.
If the analyzed dataset includes locus names that are missing in the locus attributes table, tally_cts_per_locus
fails with Error in tbl[i, as.character(cols.match[i])] : subscript out of bounds
.
Currently processed samples are stored in an identical form to the files they came from, with all sequences from all loci combined. It would be more useful to store a filtered version and mark how CHIIMP categorized the sequences for each sample (allele, stutter, etc.). Then the file output would contain processed-files
for the full data for each sequence file and processed-samples
for the specific per-locus sequences for each sample.
The public functions in the chiimp R package should have usable code examples built in as described here. Since the code will actually be run during the package check, we could use the built-in test dataset to make usable examples.
Travis CI is no longer a viable option for automated testing under their new pricing model.
We will be offering an allotment of OSS minutes that will be reviewed and allocated on a case by case basis. Should you want to apply for these credits please open a request with Travis CI support stating that you’d like to be considered for the OSS allotment.
plot_dist_mat
is supposed to pass along any extra arguments to pheatmap
but the argument list isn't handled correctly so it doesn't actually work.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.