shawhahnlab / chiimp Goto Github PK

View Code? Open in Web Editor NEW

2.0 4.0 3.0 3.7 MB

Computational, High-throughput Individual Identification through Microsatellite Profiling

License: GNU Affero General Public License v3.0

R 95.99% Shell 2.65% CSS 0.01% Batchfile 0.61% AppleScript 0.75%

microsatellites r-package short-tandem-repeats tandem-repeats

chiimp's People

Contributors

Stargazers

Watchers

Forkers

omer-1 xiaoyi-2019 tong2200

chiimp's Issues

Support automatic scoring of analysis results for known samples

There is currently no feature for automated scoring of results for known samples. Most information is already provided in the existing input files, though, and the mapping of sample identifiers to individual names could be done via an extra column in the input spreadsheet for the dataset. A logical home for this step might be inside summarize_dataset, for the case where the known genotypes table is supplied and a name column was loaded from the dataset spreadsheet.

prepare_dataset gives unintuitive error if data directory missing

prepare_dataset should be updated to give a more intuitive error message if the data directory is missing at run time:

Error in locusmap[[locus]] : 
  attempt to select less than one element in get1index

Standard reinstall breaks Desktop icon on Mac OS

Since an installation via R replaces the package files but doesn't include the extra step of gunzip "$chiimp_path/Contents/MacOS/droplet.gz" from install_mac.command, the Desktop icon breaks.

Field order vector for prepare_dataset should optionally be text

I can never remember how the integers in the ord argument to prepare_datset or dataset_opts$ord configuration setting match up with the Replicate, Sample, and Locus fields. prepare_dataset should be updated to allow a case-insensitive character vector using field names instead.

Windows install fails if username has spaces

Automated installation fails on Windows for a username with spaces in it. This is due to how install_windows.cmd builds command strings for R and for the shortcut creation command, so moving the bulk of the commands directly into R may simplify that and should avoid these shell escaping issues.

Defining sample analysis options per-locus should be available

Currently all sample analysis options are provided in the configuration file and applied for all loci, but there's an obvious use case for allowing per-locus values (e.g. varying propensity for PCR stutter in one locus over another). Allowing optional columns in the locus attributes table would be a straightforward way to handle this.

Data files should only be analyzed once

Locus-multiplexed sequence files currently get separately loaded and analyzed for each locus, even though the output of analyze_sample will be identical each time. It would be more efficient to track files separately from sample/locus pairs and avoid the duplicated processing.

Mac install script should be executable

The automated installer script for Mac OS (install_mac.command) isn't marked executable in the repository. This needs to be fixed for the one-click install.

load_csv should accept row.names if given

Currently load_csv always assigns its own row names, but that should be skipped if it was called with a row.names argument.

R CMD check fails with R-devel_2019-03-11

Running R CMD check chiimp_0.2.2.tar.gz fails on recent development versions of R:

...
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘chiimp’ can be installed ... ERROR
Installation failed.
See ‘/data/home/jesse/dev/chiimp-r-ver/chiimp.Rcheck/00install.out’ for details.
* DONE

Status: 1 ERROR
See
  ‘/data/home/jesse/dev/chiimp-r-ver/chiimp.Rcheck/00check.log’
for details.

The file chiimp.Rcheck/00install.out shows:

* installing *source* package ‘chiimp’ ...
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
ERROR: hard-coded installation path: please report to the package maintainer and use ‘--no-staged-install’
* removing ‘/data/home/jesse/dev/chiimp-r-ver/chiimp.Rcheck/chiimp’

This is causing CI via Travis to fail for the "devel" and "bioc-devel" jobs. it's fine on the latest stable release, though. Is this a bug in the development version or a more stringent check that didn't previously exist?

Histogram rendering doesn't handle missing data gracefully

Samples with reads present in the file but none matching for that particular sample cause spurious warnings when saving histograms, like:

Warning in min(x) : no non-missing arguments to min; returning Inf
Warning in max(x) : no non-missing arguments to max; returning -Inf

This is because str_hist_render calls things like max/min/range on potentially-empty data frames. The tests curently miss this because they skip rendering the plots.

Desktop icon needs to be marked as trusted on Ubuntu

The Desktop icon is "not marked as trusted" by default in the Ubuntu 18 desktop manager.

It looks like gio set ~/Desktop/CHIIMP.desktop "metadata::trusted" yes should fix that as described in https://askubuntu.com/a/1070133

Sample summaries show zero CountTotal when no reads match locus

The CountTotal value from summarize_sample is found by working backward from the filtered data frame supplied by analyze_sample to determine the original raw read count, but if there are no matching reads for that particular locus, CountTotal is set to zero. It should always report the original read count for the input file, no matter the locus-specific details.

Report should distinguish missing results from untested sample/locus combinations

Blank entries in report tables and heatmaps are shown for sample/locus combinations that either failed to produce a genotype, or simply were not present in the dataset to begin with. (For example, eight loci could be tested across N samples, and then an additional two loci with just select samples. The output tables would currently be 10-by-N with blanks for those combinations not present and for any that failed to yield a genotype.)

tabulate_allele_names already makes this distinction (empty string for blank results and NA for combinations not tested). The report and markdown functions would just need to take this into account.

Spurious test failure during Windows install

While running devtools::test during installation on Windows, the test that "save_seqfile_data works with Windows-style paths" fails because of mixed path separators in the two path lists. Calling normalizePath on each just before the expect_equal call should fix this.

CSV input/output should be unified under one function pair

There are a bunch of CSV file reader/writer functions in io.R repeating the same calls to the utils package's table functions. These should be standardized with one wrapper for reading and one for writing.

Mac OS installer should build .app automatically

The compilation can be automated with osacompile -o chiimp.app chiimp.AppleScript. If that works out of the box on Mac OS (are Developer Tools required?) it can be done at install time, either in R/installer.R or automatically with a configure script.

Install should check for for cmd-line dev tools on Mac OS

On a fresh install of Mac OS some of the developer tools needed when installing devtools aren't present by default, so we see a dialog box pop open prompting to install. The installer should handle this better instead of just proceeding and failing to install packages from that first attempt.

On the 10.15.1 system I have access to, the binary that handles the installation is:

/System/Library/CoreServices/Install Command Line Developer Tools.app/Contents/MacOS/Install Command Line Developer Tools

...so we could monitor that to see when it's finished. (The commands themselves that prompt for the install return immediately, so we can't wait on those.)

Maybe helpful:

http://osxdaily.com/2014/02/12/install-command-line-tools-mac-os-x/

Relative paths in config file should be taken relative to config file path

Right now any relative paths for input files/directories supplied in a configuration file are accepted as-is, which means they are effectively taken relative to the current working directory of the R process (wherever that might be). These should instead be taken relative to the configuration file path itself.

User guide should mention icon usage early on

The desktop icon is mentioned near the beginning but the drag-and-drop behavior is only mentioned much later near the end. There should be a sentence about it in the installation section.

save_csv should handle missing directories

Minor change, but save_csv should create any missing parent directories prior to saving the file. It should also return the data invisibly instead of visibly as it currently does.

Heatmaps with a constant value are all gray

plot_heatmap does not correctly handle the case where all attribute values are the same, but with interspersed NA values for blank entries. For example, if the Artifact column of the results summary is all FALSE, and some samples had no alleles called, the heatmap should be gray for the blanks and white for the rest but is instead all gray. na.rm = TRUE should be used in the scale breaks definition to fix this.

Desktop icon has incorrect script path on Mac OS

On recent installs we're seeing the complied .app on Mac OS trying to access the script with a path like:

/Library/Frameworks/R.framework/Versions/3.5/Resources/library/chiimp//../exec/chiimp.command

But that .. isn't needed with the prefix it's being added to so the command script is't found. (Did something change in the path-handling parts of AppleScript used to put that path together, maybe?)

Report rendered with pandoc>=2.6 has quotes around title, author, and date

Something in the YAML parsing has changed between pandoc 2.5 and 2.6 so that the --metadata=key:val arguments we give keep internal quotes rather than parsing them as encapsulating strings.

In render_report:

rmarkdown::render(fp_report_in, quiet = TRUE, output_file = fp_report_out,
                      output_options = list(pandoc_args = pandoc_args))

If pandoc_args contains --metadata=title:"report title", the quotes would be parsed out in pandoc < 2.6 but kept in pandoc>=2.6.

I'm not sure the quotes were ever actually necessary (they'd be optional in YAML anyway, and these pandoc arguments are already handled as a vector so word-splitting shouldn't be an issue).

Desktop icons should give better message when opened directly

Currently the desktop icons just launch the command-line script as-is, so the usage instructions shown are a bit cryptic if the icon is double-clicked. This should be replaced with instructions tailored to the desktop icon usage specifically.

Make text file delimiter consistent

Right now some input files are tab-delimited and some comma-delimited. load_locus_attrs and load_allele_names should be made to expect comma-delimited text so that CSV is used consistently for input and output.

sample_analysis_opts fraction.min is confusing

The output data tables, both per-file and per-sample, have FractionOfTotal and FractionOfLocus columns, and we have a configurable threshold for the fraction of reads required to consider a peak as a candidate allele, fraction.min. But this fraction isn't either of those two listed columns; instead the denominator is the sum of the read counts in each processed-samples table, which is a more stringent set than just the matching locus via primer(s).

To summarize:

FractionOfTotal: denominator is the number of reads in the whole input file
FractionOfLocus: denominator is the number of reads for all entries sharing a MatchingLocus column (determined by forward primer and optionally reverse primer)
fraction applied when categorizing each row via analyze_sample(), which currently has no explicit column defined: denominator is the number of reads matching per-locus primer(s), repeat motif, and length range

This should be clarified in the documentation and outputs.

Report rendering fails with RStudio >= 2022.02 due to pandoc change

As of version 2022.02 RStudio has a newfangled thing called quarto in place of its previous rmarkdown-rendering system. This has the pandoc executable in a new place, for example on a test install on Windows:

C:\Program Files\RStudio\bin\quarto\bin

This means rendering reports fails (confirmed on Windows and expected on Mac and Linux) due to hardcoded RSTUDIO_PANDOC paths in all three of the Windows/Mac/Linux wrapper scripts chiimp.cmd / chiimp.command / chiimp.sh. The correct path to pandoc should be (well, always should have been) auto-detected.

Tidy release prep utility

The prep_release.sh script is only useful for development and only when preparing a release. It should be moved to a less-prominent place in the repository.

Reverse primer should be available for use in locus-matching

We never enabled it because our forward primers alone matched well enough in practice, but the reverse primer really should be included in the locus-matching behavior (probably an option to start with but the default in a later compatibility-breaking release).

Support IUPAC ambiguity codes in primer sequences

CHIIMP should support IUPAC codes for ambiguous nucleotides in primer sequences in order to recognize multiple sets of possible primers for a single locus.

Link to demo files in README

We have demo input files available at https://upenn.app.box.com/v/chiimp-demo. This should go in the README.

load_config should check for names not present in config.default

Currently load_config will return whatever structure is in a given YAML file, but there's no error or warning message for unrecognized entries. All valid names should already be present in config.defaults (even if they just point to NULL) so this should be used as a simple check when loading configurations. This would easily catch YAML formatting problems like key:val versus key: val since the former would show a mismatch with the default config.

Stutter/Artifact read count ratio should be configurable

Currently the read count ratio threshold for second candidate allele to first is hardcoded at 1/3 for any sort of suspected artifact sequence (see find_artifact/find_stutter count.ratio_max argument). This should be an option exposed via the configuration file.

Heatmap plotting fails when attribute has blanks and a single unique value

plot_heatmap fails if the given attribute has only a single unique value and NA entries present:

Error in grid.pretty(range(as.vector(breaks))) :
  infinite axis extents [GEPretty(nan,nan,5)]

We can fix this by updating the breaks calculation to exclude NA when finding the max attribute value.

Report generation fails when all results are blank

For datasets producing no genotypes at all (e.g. consistently low quality samples, sequencing error, etc.) output files are saved but report generation fails. Tests should be added to explicitly check each report section to work in this case.

Reinstall on Windows fails with broken glue package

During reinstallation on Windows, devtools' default auto-update behavior attempts to update the "glue" package but fails to fully remove it. Re-running the installer again then fails because devtools remains installed as before, but has broken one of its own dependencies and no longer functions. Disabling the automatic updating during install is probably the best approach anyway and looks like it will avoid this problem.

Ambiguous reads should be filtered by default

Currently sequences including N for base calls (or anything outside of [ACTG]) are not handled specially, so they may appear in final output including in purported alleles. By default these sequences should marked in the output of analyze_sample and filtered in the output of summarize_sample to avoid this.

save_seqfile_data fails with Windows paths

Within save_seqfile_data, remove_shared_root_dir calls normalizePath on path strings but then assumes / as a path separator. On Windows normalizePath uses \ as the separator, and so the returned paths from remove_shared_root_dir are garbled. This causes save_seqfile_data to fail. This is a regression in 0.2.0 from 03df4fa.

I think we may just be able to give winslash="/" in normalizePath and keep the rest the same. .Platform$file.sep still returns "/" on Windows so apparently we can't just use that value instead of a literal /.

Installer commands should obey site-wide configuration

Currently the install scripts use the equivalent of Rscript --vanilla ./tools/install.R during installation, but this can fail if, for example, the R installation requires a particular package repository setup such as what I just saw on CircleCI. Thinking it over again, why disallow loading of the site init and environment files anyway? The existing R installation should be taken as-is rather than skipping those files.

Include adapter trimming and optional primer removal

The only pre-processing on the FASTQ files absolutely required before running CHIIMP is adapter trimming, so if we can bundle this step into the package, it will be able to operate on output directly from a sequencing run.

We may as well also provide an option to strip out the PCR primers late in the processing, which would make the output more in line with what some users may expect and what MEGASAT produces.

CSV helpers should handle arbitrary columns

The general CSV input/output functions (added in #43) should be able to handle arbitrary column names, but currently loading fails if none of the recognized columns are present (it tries to make convenient row names from certain columns and fails).

Counts per Locus should use per-file and not per-sample data

With the new reorganization in a0e288c the cross-locus information is kept in results$files rather than results$samples. Currently tally_cts_per_locus still uses samples, so the cross-locus heatmap never shows other loci in a given row. This should be corrected to use files instead.

Add interactive tables in reports

We should consider switching from the kable function to the DT package's datatable function to present interactive, sortable tables in the HTML reports.

Reorganize installers

The Mac OS and Linux install scripts could benefit from the same R-based method used on Windows now. (If all three pointed to a generalized R script that handled things like user library creation, the main install scripts would only need to focus on detecting R.) This might also be a good use of the tools subdirectory.

tally_cts_per_locus fails if loci are missing in locus attributes table

If the analyzed dataset includes locus names that are missing in the locus attributes table, tally_cts_per_locus fails with Error in tbl[i, as.character(cols.match[i])] : subscript out of bounds.

Processed samples should be stored in filtered and categorized form

Currently processed samples are stored in an identical form to the files they came from, with all sequences from all loci combined. It would be more useful to store a filtered version and mark how CHIIMP categorized the sequences for each sample (allele, stutter, etc.). Then the file output would contain processed-files for the full data for each sequence file and processed-samples for the specific per-locus sequences for each sample.

Add examples to R package docs for public functions

The public functions in the chiimp R package should have usable code examples built in as described here. Since the code will actually be run during the package check, we could use the built-in test dataset to make usable examples.

Migrate from Travis CI for automated testing

Travis CI is no longer a viable option for automated testing under their new pricing model.

We will be offering an allotment of OSS minutes that will be reviewed and allocated on a case by case basis. Should you want to apply for these credits please open a request with Travis CI support stating that you’d like to be considered for the OSS allotment.

plot_dist_mat doesn't handle pheatmap arguments correctly

plot_dist_mat is supposed to pass along any extra arguments to pheatmap but the argument list isn't handled correctly so it doesn't actually work.