clintval / sample-sheet Goto Github PK

View Code? Open in Web Editor NEW

49.0 12.0 15.0 3.86 MB

Parse Illumina sample sheets with Python

Home Page: https://sample-sheet.rtfd.io

License: MIT License

Python 100.00%

illumina bioinformatics demultiplexing samplesheet experiment-manager python

sample-sheet's Issues

SyntaxError: invalid syntax

When running

import os
import sys
import csv
from sample_sheet import SampleSheet

url = 'https://raw.githubusercontent.com/clintval/sample-sheet/master/tests/resources/{}'
sample_sheet = SampleSheet(url.format('paired-end-single-index.csv'))

print sample_sheet

I get this

>> python3 SampleSheet.csv
  File "SampleSheet.csv", line 1
    [Header],,,,,,,,,,
             ^
SyntaxError: invalid syntax

both in the example SampleSheet and my own.

Running code installed with pip3, not the one on the repository.

CellRanger indexes are not recognized

Is it possible to bypass index validation?

    sample_sheet = SampleSheet(args.samplesheet)
  File "/home/ec2-user/cellranger-docker/env/lib/python3.6/site-packages/sample_sheet/__init__.py", line 418, in __init__
    self._parse(self.path)
  File "/home/ec2-user/cellranger-docker/env/lib/python3.6/site-packages/sample_sheet/__init__.py", line 524, in _parse
    self.add_sample(Sample(dict(zip(sample_header, line))))
  File "/home/ec2-user/cellranger-docker/env/lib/python3.6/site-packages/sample_sheet/__init__.py", line 296, in __init__
    raise ValueError(f'Not a valid index: {value}')
ValueError: Not a valid index: SI-GA-F8

Validations should be parameterized. Such as for 10x indexes vs Illumina.

See #95

Review and update `add_sample` docstring

Clearly describe the add_sample() method along with all validation checks performed.

sample-sheet/sample_sheet/_sample_sheet.py

Lines 482 to 502 in 87deb18

 def add_sample(self, sample): 

 """Validate and add a ``Sample`` to this ``SampleSheet``. 

  All samples are checked against the first sample added to ensure they 

  all have the sample ``read_structure`` attribute, if supplied. The 

  ``SampleSheet`` will inherit the same ``read_structure`` attribute. 

  Samples cannot be added if the following criteria is met: 

  - ``Sample_ID`` and ``Sample_Library`` combination exists 

  - ``index`` and/or ``index2`` combination exists 

  - Samplesheet.reads and Sample.Read_Structure are incompatible 

  - Sample does not have ``index`` defined but others do 

  - Sample does not have ``index2`` defined but others do 

  - If defined, sample ``read_structure`` is different than others 

  Parameters 

  ---------- 

  sample : Sample 

  Sample to add to this sample sheet. 

  """

Support entire Illumina specification

Manifests section
Arbitrary user-defined sections
Update README to advertise spec. adherence

https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/sequencing-sheet-format-specifications-technical-note-970-2017-004.pdf

Feature Request: SampleSheet v2 support

Hi @clintval,

do you have any plans of supporting the new SampleSheet v2 format?

Seems Illumina has released new tools and along a new version of the Samplesheet.
(https://blog.software.illumina.com/2020/07/30/announcing-the-release-of-bcl-convert-software/)

Cheers,
Florian

Error parsing samplesheet

I have a sample sheet that looks like this:

[Header]
IEMFileVersion,4
Date,11/16/2015
Workflow,GenerateFASTQ
Application,RNA-Seq
Assay,TruSeq LT
Description
Chemistry,Default

[Reads]
75
75

[Settings]
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

[Data]
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,GenomeFolder,Sample_Project,Description
MiSeq_L20151106_A1,MiSeq_L20151106_A1,,,AR001,ATCACG,Homo_sapiens\UCSC\hg19\Sequence\WholeGenomeFasta,QC,GeneexperessionQC
MiSeq_L20151106_B1,MiSeq_L20151106_B1,,,AR003,TTAGGC,Homo_sapiens\UCSC\hg19\Sequence\WholeGenomeFasta,QC,GeneexperessionQC

When reading this file using:

ss = IlluminaSampleSheet('SampleSheet.csv')

results in

$ python test.py 
Traceback (most recent call last):
  File "test.py", line 5, in <module>
    ss = IlluminaSampleSheet(sample_sheet_path)
  File "/Users/golharr/workspace/.venv/lib/python3.6/site-packages/sample_sheet/__init__.py", line 419, in __init__
    self._parse(self.path)
  File "/Users/golharr/workspace/.venv/lib/python3.6/site-packages/sample_sheet/__init__.py", line 537, in _parse
    key, value, *_ = line
ValueError: not enough values to unpack (expected at least 2, got 1)

Look like the trailing *_ is causing the problem on 537. If you remove that, the key, value gets read correctly. The *_ is not used and hence no need to include it here.

After making the change to line 537, the same problem arises for the Description line, since there is no comma, only a key is provided, and no corresponding value. I think a better check would be to execute these lines is if len(line) >= 2. I'l submit a PR that works for me.

SampleSheet.write()?

Hi there. Cool lib!

Do you have any intention on adding the ability to use this to create new (and write) sample sheets? Seems like most of the mechanics are already in place, probably just need a function to write back to string.

Support for empty lines

Hi, thanks for putting together this great library. It seems like currently it is not supporting empty lines which the spec allows:

Empty lines or lines that consist entirely of commas and/or whitespace characters are valid, but ignored.

Traceback (most recent call last):
  File "/snap/pycharm-community/62/helpers/pydev/pydevd.py", line 1664, in <module>
    main()
  File "/snap/pycharm-community/62/helpers/pydev/pydevd.py", line 1658, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/snap/pycharm-community/62/helpers/pydev/pydevd.py", line 1068, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/snap/pycharm-community/62/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/cborroto/projects/samplesheet/scripts/wdlhelper.py", line 210, in <module>
    main()
  File "/home/cborroto/projects/samplesheet/scripts/wdlhelper.py", line 205, in main
    'summary': summary,
  File "/home/cborroto/.pyenv/versions/miniconda3-4.3.30/envs/samplesheet/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/home/cborroto/.pyenv/versions/miniconda3-4.3.30/envs/samplesheet/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/home/cborroto/.pyenv/versions/miniconda3-4.3.30/envs/samplesheet/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "/home/cborroto/projects/samplesheet/scripts/wdlhelper.py", line 39, in launch
    sample_sheet = SampleSheet(sample_sheet)
  File "/home/cborroto/.pyenv/versions/miniconda3-4.3.30/envs/samplesheet/lib/python3.6/site-packages/sample_sheet/_sample_sheet.py", line 376, in __init__
    self._parse(str(self.path))
  File "/home/cborroto/.pyenv/versions/miniconda3-4.3.30/envs/samplesheet/lib/python3.6/site-packages/sample_sheet/_sample_sheet.py", line 468, in _parse
    header_match = self._section_header_re.match(line[0])
IndexError: list index out of range

Bugs on README

Self-host static badges
Documentation link is broken

Error installing with pip

Getting a syntax error when installing on a fresh virtualenv system

(pipeline_run3) data_management> python
Python 3.4.5 (default, Sep 08 2016, 13:41:53) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
(pipeline_run3) pipeline@SequenceKing:~/cancerplus/code/data_management> pip3 install sample_sheet
Collecting sample_sheet
  Using cached https://files.pythonhosted.org/packages/fc/75/69cab3b91ea745a909bedc53f30789414eedc11ce2ebe6189733560a9583/sample_sheet-0.6.0.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-5eqifgel/sample-sheet/setup.py", line 12
        URL = f'https://github.com/clintval/{PACKAGE_NAME}'
                                                          ^
    SyntaxError: invalid syntax

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-5eqifgel/sample-sheet/

Samples to have `sample_sheet` reference when instantiated through parsing

Something like:

>>> sample_sheet = SampleSheet('SampleSheet.csv')
>>> first_sample = sample_sheet.samples[0]
>>> first_sample.sample_sheet
SampleSheet("SampleSheet.csv")

But not on a Sample instantiated plainly:

>>> sample = Sample(dict(Sample_Name='test', Sample_ID='XXX', index='ACGT'))
>>> sample.sample_sheet
None

Would help with tracking the source of samples when reading in more than one sample sheet into an environment and then processing on the samples in other data structures like dictionaries.

Make `requests` installation mandatory or drop dependency support

We only use it for a case-insensitive dictionary.

test_to_picard_basecalling_params_output_files fails on MacOS

Running the test suite on the master branch fails on MacOS.
I am running MacOS Monterey 12.2.1 on a INTEL mac with python 3.7.

pytest Error:

E           AssertionError: 'BARC[68 chars]TC\t/System/Volumes/Data/home/user/49-tissue.e[278 chars]\t\n' != 'BARC[68 chars]TC\t/home/user/49-tissue.exp001/49-tissue.GAAC[218 chars]\t\n'
E           Diff is 748 characters long. Set self.maxDiff to None to see it.

Full Error:

>tox
...cut
________________________________________ TestSampleSheet.test_to_picard_basecalling_params_output_files _________________________________________

self = <test_sample_sheet.TestSampleSheet testMethod=test_to_picard_basecalling_params_output_files>

    def test_to_picard_basecalling_params_output_files(self):
        """Test ``to_picard_basecalling_params()`` output files"""
        bam_prefix = '/home/user'
        lanes = [1, 2]
        with TemporaryDirectory() as temp_dir:
            sample1 = Sample(
                {
                    'Sample_ID': 49,
                    'Sample_Name': '49-tissue',
                    'Library_ID': 'exp001',
                    'Description': 'Lorum ipsum!',
                    'index': 'GAACT',
                    'index2': 'AGTTC',
                }
            )
            sample2 = Sample(
                {
                    'Sample_ID': 23,
                    'Sample_Name': '23-tissue',
                    'Library_ID': 'exp001',
                    'Description': 'Test description!',
                    'index': 'TGGGT',
                    'index2': 'ACCCA',
                }
            )

            sample_sheet = SampleSheet()
            sample_sheet.add_sample(sample1)
            sample_sheet.add_sample(sample2)
            sample_sheet.to_picard_basecalling_params(
                directory=temp_dir, bam_prefix=bam_prefix, lanes=lanes
            )

            prefix = Path(temp_dir)
            assert_true((prefix / 'barcode_params.1.txt').exists())
            assert_true((prefix / 'barcode_params.2.txt').exists())
            assert_true((prefix / 'library_params.1.txt').exists())
            assert_true((prefix / 'library_params.2.txt').exists())

            barcode_params = (
                'barcode_sequence_1\tbarcode_sequence_2\tbarcode_name\tlibrary_name\n'  # noqa
                'GAACT\tAGTTC\tGAACTAGTTC\texp001\n'  # noqa
                'TGGGT\tACCCA\tTGGGTACCCA\texp001\n'
            )  # noqa

            library_params = (
                'BARCODE_1\tBARCODE_2\tOUTPUT\tSAMPLE_ALIAS\tLIBRARY_NAME\tDS\n'  # noqa
                'GAACT\tAGTTC\t/home/user/49-tissue.exp001/49-tissue.GAACTAGTTC.{lane}.bam\t49-tissue\texp001\tLorum ipsum!\n'  # noqa
                'TGGGT\tACCCA\t/home/user/23-tissue.exp001/23-tissue.TGGGTACCCA.{lane}.bam\t23-tissue\texp001\tTest description!\n'  # noqa
                'N\tN\t/home/user/unmatched.{lane}.bam\tunmatched\tunmatchedunmatched\t\n'
            )  # noqa

            self.assertMultiLineEqual(
                (prefix / 'barcode_params.1.txt').read_text(), barcode_params
            )
            self.assertMultiLineEqual(
                (prefix / 'barcode_params.2.txt').read_text(), barcode_params
            )
            self.assertMultiLineEqual(
                (prefix / 'library_params.1.txt').read_text(),
>               library_params.format(lane=1),
            )
E           AssertionError: 'BARC[68 chars]TC\t/System/Volumes/Data/home/user/49-tissue.e[278 chars]\t\n' != 'BARC[68 chars]TC\t/home/user/49-tissue.exp001/49-tissue.GAAC[218 chars]\t\n'
E           Diff is 748 characters long. Set self.maxDiff to None to see it.

tests/test_sample_sheet.py:535: AssertionError

Supporting TrueSight 170 sample sheets

Would be wonderful if it supported the TrueSight 170 style sample sheets.

How to update the sample sheet object?

I am trying to write a simple logic to reverse complement the i5 index within the sample_sheet object using a custom reverse_complement function

Can you let me know how can I update the object such as to accomodate the reverse complemented i5 index and then write an updated CSV sample sheet file ?

for sample in sample_sheet:

	index_i5 = sample.index2

	index_i5_rc = reverse_complement(index_i5)
	
	sample.index2 = index_i5_rc

Installation issues

I'm experiencing two separate issues when trying to install this package:

Install with conda

The README says to do the following:

conda install -c bioconda sample-sheet

However this results in a missing dependency:

Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - sample-sheet -> terminaltables

Current channels:
<truncated>

This can be fixed by explicitly including -c conda-forge like so:

conda install -c bioconda -c conda-forge sample-sheet

Install with pip on bitbucket pipelines

Using Bitbucket pipelines to test a package, I get the following output when listing sample-sheet as a dependency in my setup.py and installing my package:

  Downloading https://files.pythonhosted.org/packages/e1/4e/c4af36fcf9f7d3364723a49ec802637e6dbe73725bd2be97e5c7647a0669/sample-sheet-0.9.0.tar.gz
    ERROR: Complete output from command python setup.py egg_info:
    ERROR: Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-py7rsp3d/sample-sheet/setup.py", line 20, in <module>
        long_description=Path('README.md').read_text(),
      File "/opt/miniconda3/envs/test/lib/python3.6/pathlib.py", line 1197, in read_text
        return f.read()
      File "/opt/miniconda3/envs/test/lib/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1207: ordinal not in range(128)
    ----------------------------------------
ERROR: Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-py7rsp3d/sample-sheet/

Note that I do not get this error in a local development environment, so it could be specific to Bitbucket pipelines.

Duplicated Sample_ID

Hi Clint,

we had an issue before (#32) where the same Sample_ID caused issues, even if the lanes were different.

Now we are having the same issue, but without specifying lanes. This time it's 10X data, which you mentioned briefly in the previous issue.

Basically we would like to merge across lanes and indexes and the corresponding sample sheets end up having the same Sample_ID across multiple Data lines.

An example:

Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Index_Plate_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
PRJ180538_VPH20T,,,,,SI-GA-G6_1,CTGACGCG,,,,
PRJ180538_VPH20T,,,,,SI-GA-G6_2,GGTCGTAC,,,,
PRJ180538_VPH20T,,,,,SI-GA-G6_3,TCCTTCTT,,,,
PRJ180538_VPH20T,,,,,SI-GA-G6_4,AAAGAAGA,,,,

Here we merge across the four indexes and all lanes. However, your library does not currently allow this.

Could you allow duplicated Sample_IDs or recommend an alternative approach?

Thanks!
Florian

Use type annotations

import typing

Confusing error on file with no sections

If the file passed to the SampleSheet constructor contains data but no sections the error produced is:

'SampleSheet' object has no attribute ''

Should output an invalid formatted sample sheet error like "missing sections"

Order of columns in Data section

Hi,

I am using using your fantastic library to manipulate existing sample sheets and I have a question/request:

Is it possible to retain/control the order of the columns in the Data section?

I know the order should not really matter, but we try to stick to internal conventions for readability. Besides, the reordering makes it a bit more difficult to compare the original and modified versions of the sample sheets we process.

Perhaps using an OrderedDict instead of just a dict could be a possiblity...?
https://docs.python.org/3/library/collections.html#collections.OrderedDict

Thanks!
Florian

iter() should be a generator

def __iter__(self):
    for sample in self.samples:
        yield sample

This will negate the need for __next__() and for binding a temporary _iter = [] variable

Intended behavior of empty samplesheet with missing sections

One of our lab operators provided an empty samplesheet intending to fill it out later, but never did. As a result, this run made it into our pipeline where we use this package to read the samplesheet. The samplesheet provided is:

[Data],,,,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description

No other sections, and no samples. When I loaded this samplesheet using

from sample_sheet import SampleSheet as IlluminaSampleSheet
ss = IlluminaSampleSheet(sample_sheet_path)

no error was thrown. Is that the intended behavior?

Use Optional instead of Union[None, ...]

from typing import Optional

Support Python2

I finally have a need. And it's a sad sad but very important need.

smart_open Version 1.8.1 Incompatibilities

The smart_open.smart_open function has been deprecated in version 1.8.1 of the smart_open library in favor of smart_open.open.

This throws a warning in the console whenever a sample sheet is opened.
piskvorky/smart_open#268

Version 1.8.1 also added SSH/SCP/SFTP support which throws a warning if paramiko is not installed.
piskvorky/smart_open#267

This could be solved by adding a paramiko requirement and changing smart_open calls or setting the version of smart_open to 1.8.0 in the setup file.

Index vs index

Recently downloaded some example data from Illumina and I noticed the sample sheet had "Index" in it, not "index". Unfortunately wasn't able to read the sample sheet because of this. FYI, the data header line looked like:

Sample_ID,Sample_Name,BASESPACE_SAMPLE_RESOURCE_ID,GenomeFolder,I7_Index_ID,I5_Index_ID,Index,Index2,Sample_Well,Manifest,FastqFolder

Request - handling nonstandard tables

We have a sample sheet where we've added a custom section which is a headered table (along the lines of the [Data] section. It would be handy to be able to parse such a section using the sample-sheet library. (Currently, the library assumes that the custom section must be a dictionary of key-value pairs).

A made-up example of such a section follows:
[Animals],,
Name,Species,Status
Benji,Dog,Good
Sparky,Cat,Aloof
Sparky,Bird,Sleeping

Improve ReadStructure support and logic

Support inline barcodes
Confirm ReadStructure class supports all example read structures
- https://github.com/nh13/read-structure-examples
- samtools/hts-specs#270
Support + operator on last read structure token

Best practices badge?

https://bestpractices.coreinfrastructure.org/en/projects/2330

relationship with PEP

Hey just came across this repo -- I'm not sure if you're still developing this, but just wanted to alert you to our work on PEP, and more specifically, peppy. I thought you might find it interesting and may trigger some possibility for collaborating.

Unit test for IPython Interpreters

Try this:

import sample_sheet._sample_sheet

from sample_sheet._sample_sheet import is_ipython_interpreter

class TestIsIpythonInterpreter(TestCase):
    """Unit tests for ``is_ipython_interpreter()``"""

    def test_is_ipython_interpreter(self):
        """Test if this test framework is run in an IPython interpreter."""
        assert_false(is_ipython_interpreter())

        _sample_sheet.__IPYTHON__ = None

        assert_true(is_python_interpreter())

Add code coverage hook to remote repository

Add code coverage hook to commits and PRs:

https://codecov.io/

Same sample ID on multiple lanes causes error

Hi, I have a sample sheet that has the same sample ID but on multiple lanes, which in our case can happen quite frequently. This case is currently not supported. Could this be added?

Example [Data] section:

[Data],,,,,,,
Lane,Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,Sample_Project,Description
1,WES013BL,,,,A010,TAGCTT,,
1,WES013FR,,,,A027,ATTCCT,,
1,MDx150891,,,,A012,CTTGTA,,
1,MDx150892,,,,A016,CCGTCC,,
2,WES013BL,,,,A010,TAGCTT,,
2,WES013FR,,,,A027,ATTCCT,,
2,MDx150891,,,,A012,CTTGTA,,
2,MDx150892,,,,A016,CCGTCC,,

Re-write a few un-Pythonic lines

self.__dict__.get(attr)

sample-sheet/sample_sheet/_sample_sheet.py

Line 259 in 87deb18

return self.__dict__.get(attr, None)
There must be a better way to change a class name without needing useless inheritance

sample-sheet/sample_sheet/_sample_sheet.py

Lines 320 to 325 in 87deb18

class Header(SampleSheetSection):

pass

class Settings(SampleSheetSection):

pass
return None if len(self.Reads) == 0 else len(self.Reads) == 1

sample-sheet/sample_sheet/_sample_sheet.py

Lines 437 to 439 in 87deb18

if len(self.Reads) == 0:

return None

return len(self.Reads) == 1
VALID_ASCII = {string.ascii_letters + string.digits + '-_'}

sample-sheet/sample_sheet/_sample_sheet.py

Lines 31 to 36 in 87deb18

VALID_ASCII_CODES = [

10,

13,

32,

*list(range(33, 47)),

*list(range(48, 127))]

Multiple samples with same index

When opening a SampleSheet with multiple samples with the same index (usual for me), where the final index for the sample is determined by the pair and not by uniqueness of each single index, we get an ValueError:

ValueError: Sample index combination for XX-XX-4Y has already been added: XX-XX-5G

It would be good to allow for share indexes among samples.

Create a conda recipe

https://conda.io/docs/build.html

Project may be suited for the bioconda channel:

https://github.com/bioconda/bioconda-recipes

index validation

This 10X index name is being rejected, SI-P03-C9. I know why. I submitted a few fixes to support 10X indices, but would be nice to turn off validation of indexes, generally speaking.

Offline test support for doctests

https://github.com/astropy/pytest-remotedata

Index validation on indexes with spaces at the end

Our users copy/paste to create the samplesheet. They save the samplesheet using Excel. Somewhere along the way a space is added to the end of the index column. When we parse the samplesheet with this package, we get a validation error on the index since the space is present. Is it possible to strip the field of invalid characters before parsing?

package fails installing on Windows

I tried to install the package on a Windows machine with pip install sample_sheet following the docs. Unfortunately it resulted in an error that looks like a Linux/Windows path compatibility issue:

Traceback (most recent call last):
     File "<string>", line 1, in <module>
     File "C:\Users\Foo\AppData\Local\Temp\pip-install-a02d2846\sample-sheet\setup.py", line 28, in <module>
       packages=setuptools.find_packages(where='./'),
     File "c:\users\tpham1.unimelb\appdata\local\programs\python\python37\lib\site-packages\setuptools\__init__.py", line 71, in find
       convert_path(where),
     File "c:\users\tpham1.unimelb\appdata\local\programs\python\python37\lib\distutils\util.py", line 112, in convert_path
       raise ValueError("path '%s' cannot end with '/'" % pathname)
   ValueError: path './' cannot end with '/'

   ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\TPHAM1~1.UNI\AppData\Local\Temp\pip-install-a02d2846\sample-sheet\

Support comments

Technically this is not specified in Illumina's sample sheet specification, but it would be handy if the parser could handle "comments" i.e. lines that start with a # character. If these lines could be stored in the SampleSheet class and then saved when writing to a file that would be very handy.

Move tutorial to readthedocs.com

Move CONTRIBUTING.md
Move tutorial
Link to readthedocs.com

Cleaner and more explicit documentation

Inline short assignments

>>> sample = Sample(dict(
>>>     Sample_ID='1823A',
>>>     Sample_Name='1823A-tissue',
>>>     index='ACGT'))

Output as json

Would be useful to have the option to output the sample sheet as json.

Conda package

Make one

Error when attempting to parse non-comma-padded sample sheet files

Hi there, I'm having some trouble parsing sample sheet files that don't use the comma-padded format. E.g.

[Header]
IEM1FileVersion,4
Investigator Name,jdoe
Experiment Name,exp001
Date,11/16/2017
Workflow,SureSelectXT
Application,NextSeq FASTQ Only
Assay,SureSelectXT
Description,A description of this flow cell
Chemistry,Default

[Reads]
151
151

[Settings]
CreateFastqForIndexReads,1
BarcodeMismatches,2

[Data]
Sample_ID,Sample_Name,index,Description,Library_ID,Read_Structure,Reference_Name,Sample_Project,Target_Set
1823A,1823A-tissue,GAATCTGA,0.5x treatment,2017-01-20,151T8B151T,mm10,exp001,Intervals-001
1823B,1823B-tissue,AGCAGGAA,0.5x treatment,2017-01-20,151T8B151T,mm10,exp001,Intervals-001
1824A,1824A-tissue,GAGCTGAA,1.0x treatment,2017-01-20,151T8B151T,mm10,exp001,Intervals-001
1825A,1825A-tissue,AAACATCG,10.0x treatment,2017-01-20,151T8B151T,mm10,exp001,Intervals-001
1826A,1826A-tissue,GAGTTAGC,100.0x treatment,2017-01-20,151T8B151T,mm10,exp001,Intervals-001
1826B,1823A-tissue,CGAACTTA,0.5x treatment,2017-01-17,151T8B151T,mm10,exp001,Intervals-001
1829A,1823B-tissue,GATAGACA,0.5x treatment,2017-01-17,151T8B151T,mm10,exp001,Intervals-001

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "sample_sheet/_sample_sheet.py", line 376, in __init__
    self._parse(str(self.path))
  File "sample_sheet/_sample_sheet.py", line 468, in _parse
    header_match = self._section_header_re.match(line[0])
IndexError: list index out of range
list index out of range

It seems the lack of commas is causing csv.reader to produce an empty list for the empty lines. Moving the empty line check to the top of the for block fixes this.

Experimental support SampleCollections and Loading_ID

A SampleCollection is a container for samples which may have originated from multiple sample sheets / flow cells / lanes.

A SampleCollection will facilitate organizing samples by their Sample_Name or Library_ID. A few methods will help with merge strategies for identical samples that have either been topped-off (same library, sequenced on different flow cells or lanes) or re-prepared (different library, can exist on same flow cell or lane).

>>> from sample_sheet import SampleCollection

>>> collection = SampleCollection(samples)
>>> collection.visualize()
"""
collection(n=4)
├─ sample1
│  ├─ library1
|  │  ├─ loading1
|  │  └─ loading2
│  └─ library2
|     └─ loading1
└─ sample2
   └─ library1
      └─ loading1
"""

Grouping samples by loading returns a new collection. Samples that can be merged at this level will be equivalent (see L261-L265)

>>> collection = collection.group_by_loading(attr='Loading_ID')
>>> collection.visualize()
"""
collection(n=3)
├─ sample1
│  ├─ library1
│  └─ library2
└─ sample2
   └─ library1
"""

Grouping samples by library returns a final collection.

>>> collection = collection.group_by_library(attr='Library_ID')
>>> collection.visualize()
"""
collection(n=2)
├─ sample1
└─ sample2
"""

validator

Hi,

thanks for this.

I have a lot of problems with incorrect and weird samplesheets from the lab generated with "copy-paste" and strange barcode schemes, such as mixed Truseq and Nextera.

I was starting to write a very simple validator to pick up on the worst errors, but now see you have done much more.

Are you planning to write a standalone validator or is this already possible via your library?

Thanks, Colin

	def is_ipython_interpreter():
	try:
	# The presence of this global name indicates we are in an
	# IPython interpreter and are safe to render Markdown.
	__IPYTHON__ # noqa
	# Attempt to import the IPython library
	import IPython # noqa
	return True
	except (ImportError, NameError):
	return False

	def add_sample(self, sample):
	"""Validate and add a ``Sample`` to this ``SampleSheet``.

	All samples are checked against the first sample added to ensure they
	all have the sample ``read_structure`` attribute, if supplied. The
	``SampleSheet`` will inherit the same ``read_structure`` attribute.

	Samples cannot be added if the following criteria is met:
	- ``Sample_ID`` and ``Sample_Library`` combination exists
	- ``index`` and/or ``index2`` combination exists
	- Samplesheet.reads and Sample.Read_Structure are incompatible
	- Sample does not have ``index`` defined but others do
	- Sample does not have ``index2`` defined but others do
	- If defined, sample ``read_structure`` is different than others

	Parameters
	----------
	sample : Sample
	Sample to add to this sample sheet.

	"""

	class Header(SampleSheetSection):
	pass


	class Settings(SampleSheetSection):
	pass

	if len(self.Reads) == 0:
	return None
	return len(self.Reads) == 1

	VALID_ASCII_CODES = [
	10,
	13,
	32,
	*list(range(33, 47)),
	*list(range(48, 127))]

clintval / sample-sheet Goto Github PK

sample-sheet's Issues

Install with conda

Install with pip on bitbucket pipelines

Recommend Projects

Recommend Topics

Recommend Org