malariagen / fits Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 1.92 MB

File tracking system for group DK

Makefile 0.04% Shell 0.02% C++ 51.70% Jupyter Notebook 46.54% PLpgSQL 1.70%

fits's People

Contributors

Watchers

fits's Issues

Create automated process to update iRODs/MLWH/subtrack databases

Sync from MLWH
Sync from iRods (baton)
Sync from subtrack

Use case: Determine which samples from a given study have been sequenced

This is currently considered satisfied for the case of sequencescape study IDs. For more recent samples there should be a 1-to-1 mapping from sequencescape to Alfresco study codes, and therefore this use case can be considered satisfied going forwards.

Possible with current version; for “sample” meaning “Multi-LIMS warehouse sample ID”:

SELECT `value` AS sequenscape_study_id,count(*) AS cnt FROM vw_sample_tag WHERE tag_id=3593 GROUP BY `value` ORDER BY cnt DESC;

Expanding the scope to include pipeline outputs, e.g. release files

FITS is agnostic to file types. Processes to import and update such data will have to be developed.

Samples missing sample accessions

The following query pulled in run accessions but not sample accessions:

SELECT # The main SELECT statement
    # List of columns to return for each file
    (SELECT full_path FROM file WHERE file.id=sample2file.file_id) AS file_full_path, # The full path of the file
    (SELECT `value` FROM sample2tag WHERE sample2file.sample_id=sample2tag.sample_id AND tag_id=3594 LIMIT 1) AS `study_name`, # Sequenscape study name
    (SELECT value FROM sample2tag WHERE sample2tag.sample_id=sample2file.sample_id AND tag_id=3561 LIMIT 1) AS oxford_code, # The Oxford code of the sample; note the "LIMIT 1" to ensure only one value!
    (SELECT `value` FROM file2tag WHERE sample2file.file_id=file2tag.file_id AND tag_id=3602 LIMIT 1) AS `ebi_run_acc`,
    (SELECT `value` FROM file2tag WHERE sample2file.file_id=file2tag.file_id AND tag_id=3587 LIMIT 1) AS `ebi_sample_acc`
FROM sample2file # The "backbone table" of the query
WHERE sample_id IN (SELECT DISTINCT sample_id FROM sample2tag WHERE tag_id=3600 AND value IN (5833,36329,5847,57267,137071)) # Get all P falciparum samples
AND NOT EXISTS (SELECT * FROM file WHERE file.id=file_id AND (full_path LIKE "%phix%" OR full_path LIKE "%\_human%") ) # Exclude phiX/human
AND NOT EXISTS (SELECT * FROM file2tag WHERE sample2file.file_id=file2tag.file_id AND tag_id=8 AND `value`='0') # Exclude zero-size files, where file size is set
AND file_id IN (SELECT file_id FROM file2tag WHERE tag_id=3576 AND value IN ('bam','cram')) # Only BAM or CRAM
AND file_id NOT IN (SELECT parent FROM file_relation WHERE relation=3595) # Only one of multiple data files (e.g. BAM and CRAM with identical data)
AND file_id NOT IN (SELECT DISTINCT file_id FROM file2tag WHERE tag_id=3592 AND `value`='GBS') # Exclude genotyping-by-sequencing files
AND file_id NOT IN (SELECT DISTINCT file_id FROM file2tag WHERE tag_id=3577 AND `value`='0') # Exclude files with zero reads
AND file_id NOT IN (SELECT DISTINCT file_id FROM file2tag WHERE tag_id=3581 AND `value`='0') # Exclude manual_qc 0
HAVING
    `study_name` LIKE "%1126%"
;

Is there a problem with the query, or is the data missing from FITS?

Review data sources and updates document

Human Burkina Faso samples missing oxford codes

From the solaris_human database, the samples with sequencescape_sample_name like ILB_BFFUL* have oxford codes like BB<4 digits>-<C, M, F>.

EG)

SELECT sanger_sample_code, oxford_code 
FROM solaris_human.oxford_to_sequencescape 
where sanger_sample_code like '%ILB_BFFUL%' limit 3;

sanger_sample_code	oxford_code
ILB_BFFUL5541309	BB0001-C
ILB_BFFUL5541325	BB0001-F
ILB_BFFUL5541317	BB0001-M

In the fits database, the oxford_code is NULL, and one of the samples ILB_BFFUL5541309 seems to be missing:

select fits_sample_id, oxford_sample_id, sequenscape_sample_name from mm6_fits.vw_pivot_sample where sequenscape_sample_name in 
('ILB_BFFUL5541309', 'ILB_BFFUL5541325', 'ILB_BFFUL5541317');

fits_sample_id	oxford_sample_id	sequenscape_sample_name
20389		ILB_BFFUL5541317
20403		ILB_BFFUL5541325

Creating Anopheles manifest for Alistair

file-to-sample relations check

Solaris/VR-pipe and the Multi-LIMS warehouse database disagree in some cases which sample a file belongs to. This should be resolved, using VRpipe as the "truth".

Rewrite database description document

based on previous comments from Richard in email 28/06/2018 13:59

Add source column to tag table

At present, some rows in the tag table have the "source" of the data in the note column, but this is not done consistently or completely. As an example the note column for tag 3587 (manual qc) has "Sequenscape manual QC (1 or 0)". However, I don't think Sequencescape has been queried directly, right? So it would good to know whether this information has come from mlwh (and if so which table and column) or from iRODS (and if so which tag).

Could we have a new source column in the tag table, which details the source of where the data came from?

In cases where the source is a relational database, I would suggest populating the source column with <database name>.<table name>.<column name>, e.g. subtrack.submission.ebi_sample_acc for id 3587. I think the current DB sources are one of the following (please expand as necessary):

mlwh
subtrack
solaris

In cases where the source is iRODS /seq, perhaps the source column could be populated as, for example, iRODS.seq.sample_supplier_name.

If the data is from some other data source, e.g. a flat file supplied by someone, please could you make sure the source of the file is clearly indicated, and also that the file is stored in github?

Fix taxon IDs 62297 to 62323

The queries at #29 highlight there was just one sample for taxon IDs from 62297 to 62323. This is an issue I have previously spotted with Pf samples (see rows 189-273 at https://github.com/malariagen/SIMS/blob/master/meta/mlwh/mlwh_taxon_exceptions/mlwh_taxon_exceptions_20180203.txt). My guess is that these were incorrectly by copying down or Excel or similar. My guess is that these should all be taxon 62324 (A. funestus), but would be good to double-check sample IDs.

MalariaGEN sample ID tag?

When creating build manifests, we need an unambiguous set of files and file-sample mappings, where by sample I here mean either Oxford code or ROMA ID. At present, this sample ID can come from either the tag 3561 (oxford sample ID) or 3589 (sample supplier name). This requires a certain amount of domain knowledge when creating a build manifest, e.g. knowing that you should use 3561 if populated, else use 3589.

I am wondering if it might make things easier going forwards if we have a new tag, (MalariaGEN_sample_ID?) that is populated with the definitive sample ID for the sample (either Oxford code or ROMA ID). I haven't fully thought through the possible consequences of this, but thought I would put it out there as an idea. Do you think there could be any value in this?

Fix other taxon IDs

As well as the issues identified in #53 and #54 , the queries at #29 highlighted various other taxa with very few samples, many of which look like nothing we might have wanted to sequence. We should attempt to figure out what species these samples are, e.g. by looking at sample IDs. It is recommended this is done after #53 and #54. I would suggest looking at all taxa with < 50 samples, but also the 384 samples with taxon 5283 (Sporidiobolales?).

Once this has been done, we should hopefully have all samples attached to a taxon that is something to do with malaria (e.g. parasite, vector or human).

Compare FITS to Richard's manifest

https://github.com/wtsi-team112/Pipelines_issues/issues/66

Review database description document

Wrong ENA sample accession given to some samples

Sometimes fits lists an additional erroneous ENA sample accession for a sample.

Here is an example:

select fits_sample_id, mlw_sample_id, oxford_sample_id, ena_sample_accession_id from vw_pivot_sample
where vw_pivot_sample.oxford_sample_id like 'AC0009-C%';

# fits_sample_id	mlw_sample_id	oxford_sample_id	ena_sample_accession_id
13498	1579623	AC0009-Cx	ERS223874
104435	1579623	AC0009-C	ERS177536\|ERS223874

Sample AC0009-C was resequenced and given oxford code AC0009-Cx, which would explain why AC0009-Cx has the same ENA sample accession id as AC0009-C. However, AC0009-C has two ENA sample accessions: ERS223874, which is accurate and ERS177536, which is incorrect.

ERS177536 should belong to sample 'AD0687-C', but is attributed to multiple samples in fits:

select fits_sample_id, mlw_sample_id, oxford_sample_id, ena_sample_accession_id from vw_pivot_sample
where vw_pivot_sample.ena_sample_accession_id like '%ERS177536%';

# fits_sample_id	mlw_sample_id	oxford_sample_id	ena_sample_accession_id
3700	1465120	AD0687-C	ERS177536
18403	2590377		ERS177536
104435	1579623	AC0009-C	ERS177536\|ERS223874
104459	1582001	AV0004-C	ERS177536\|ERS224852

The files for sample accession ERS177536 are not multiplexed with samples with other ENA accessions:

SELECT file_name,ebi_sample_acc from submission as submission1
where submission1.ebi_sample_acc like '%ERS177536%';

# file_name	ebi_sample_acc
8763_2#22.bam	ERS177536
8812_7#22.bam	ERS177536
8812_8#22.bam	ERS177536

select  file_name, ebi_sample_acc from submission
where submission.file_name in 
('8763_2#22.bam', '8812_7#22.bam', '8812_8#22.bam') ;

# file_name	ebi_sample_acc
8763_2#22.bam	ERS177536
8812_7#22.bam	ERS177536
8812_8#22.bam	ERS177536

Write documentation for fits command line tool

Use case: Create a build manifest containing all samples from a species

In the following, a build manifest is considered to be a file with one line per file, and to contain at the minimum:

iRODs path
Sample ID (Oxford code or ROMA ID)
Alfresco study code
NCBI taxon ID
Manual QC status
Date manual QC complete
ENA run accession
ENA sample accession

Possible with current version; for “sample” meaning “Multi-LIMS warehouse sample ID”, and species defined as Sequenscape taxon ID for “Plasmodium falciparum”:

SELECT DISTINCT sample_id,(SELECT group_concat(DISTINCT `value`) FROM vw_sample_tag st2 WHERE tag_id IN (3589,3586) AND st1.sample_id=st2.sample_id) AS sample_name FROM vw_sample_tag st1 WHERE tag_id=3600 AND `value`='5833';

System for retrieving all samples from an Alfresco study

This is a placeholder for a potential future piece of work.

I get a few requests from @sclaugoncalves of the type "could you please send me a list of the samples sequenced under study 1128-PV-GSKMULTI-RAYNER". It would be great to have a system whereby Sonia can do this for herself.

This is not for the first release of FITS, and will probably need some discussion with Sonia before starting any actual work.

Command line query tool

A command line tool exists. It can be used to update FITS from the Multi-LIMS warehouse database. A query engine to ease generation of manifest files without compromising generic queries is under development.

Sequencing not done at WSI

FITS can store any file, given a unique path and a storage “engine”. Currently, the only engine supported is Sanger Sequencing iRODs, but more can be added without changing the database schema. Processes to import and update such data will have to be developed.

The first use case here is likely to be storing ENA run accessions.

Add historical context to manifest creation doc

I think it would be useful to have some historical context on how build manifests have been created in the past, in order to understand what some of the issues are.

Expanding the scope to include human data

FITS is agnostic to species, however, storing of data about human samples might influence the storage location (US cloud?).

Create Pf 6.2 manifest from FITS

(https://github.com/wtsi-team112/Pipelines_issues/issues/64, spec can be found at https://github.com/wtsi-team112/Pipelines_issues/raw/master/work/63_Pf_6_2_build_manifest_spec/Pf%206.2%20build%20manifest%20specification%2020180906.docx).

Also compare to Richard's manifest: https://github.com/wtsi-team112/Pipelines_issues/blob/master/work/65_Pf_6_2_Pv4_like_manifest/Pf_62_Pv4_like_manifest_20180906.txt

Import relevant data from Solaris vw_vrpipe into FITS

Decision on whether to keep mlwh/iRODS and FITS in sync

At a recent meeting, @sclaugoncalves , @roamato and @podpearson decided it might be good to ensure mlwh, iRODS and FITS are kept in sync going forwards, e.g. if any mistakes such as incorrect file-sample mappings are identified, these should be updated in all three systems.

Before making the decision on whether we want to do this, we first need to understand what the data flows within Sanger Institute core systems are, e.g. if we make updates to SequenceScape do these flow through to mlwh and iRODS? There is a issue within the SIMS project to determine this (https://github.com/malariagen/SIMS/issues/31, https://github.com/malariagen/SIMS/issues/34). Of particular interest here will be understanding if there are cases where changes might be made to mlwh or iRODS without our explicit consent (that at present might not get automatically pulled through to FITS).

If we do decide to do this, there are three possible ways of how we might implement

Make any updates to both Sanger systems and FITS at the same time
Ensure any updates to Sanger systems get pulled through to FITS in automatic updates
Ensure any updates to FITS get pushed through to Sanger systems

This issue is to make a decision on whether we want to attempt to keep the systems in sync.

Some anopheles samples missing alfresco study

The samples for the Sequencescape study with name "Ag 1000g" should be associated with the Alfresco study "1087-AN-HAPMAP-DONNELLY". However, some of these samples have null vw_pivot_sample.alfresco_study:

select count(*) from vw_pivot_sample where vw_pivot_sample.alfresco_study is null and vw_pivot_sample.sequenscape_study_name like 'Ag 1000g%';

count(*)
1486

This might affect other samples and studies, but I'm only listing the details of this particular study since we know that the SequenceScape study "Ag 1000g" should map to the alfresco study "1087-AN-HAPMAP-DONNELLY"

Populate Alfresco study using sequencescape-alfresco study mapping file

In many cases there is a 1-to-1 mapping between sequencescape study and alfresco study. In some cases, the names are identical. In some cases, alfresco study can (and has) been inferred from sequencescape study name, (e.g. "IHTP_PWGS 1134-PF-ML-CONWAY" study is Alfresco study 1134-PF-ML-CONWAY). In some cases, the two are subtly different (IHTP_1131-PF-BN-BERTIN vs 1131-PF-BJ-BERTIN - note BN vs BJ). In some cases, domain knowledge is required (e.g. sequencescape study "Plasmodium HB3xDD2 progeny" maps to Alfresco study "1041-PF-US-FERDIG").

Rather than inferring based on some rule, a more complete and accurate method for populating "Alfresco study" from sequencescape study might be to use a mapping file. I previously created such a thing when I was building manifests. A symlink to the latest version can be found at /nfs/team112_internal/rp7/src/github/malariagen/SIMS/meta/mlwh/sequencescape_alfresco_study_mappings.txt.

Could we consider incorporating such a mapping file into the process of populating the "Alfresco study" tags?

Use case: Create a build manifest given a set of sequencescape IDs, oxford code/ROMA IDs or Alfresco study codes

In the following, a build manifest is considered to be a file with one line per file, and to contain at the minimum:

iRODs path
Sample ID (Oxford code or ROMA ID)
Alfresco study code
NCBI taxon ID
Manual QC status
Date manual QC complete
ENA run accession
ENA sample accession

Create a build manifest given a set of sequencescape IDs

Possible with current version:

SELECT vw_sample_tag.value,full_path FROM vw_sample_tag,vw_sample_file WHERE tag_id=3585 AND vw_sample_tag.sample_id=vw_sample_file.sample_id AND `value` IN (list_of_sequenscape_IDs);

See (how to build a manifest)[https://github.com/wtsi-team112/fits/blob/master/documentation/How_to_build_a_manifest.md].

Create a build manifest given a set of Oxford codes and/or ROMA IDs

Possible with current version, similar to above.

Create a build manifest given a set of Alfresco study codes

The current version does not track Alfresco study codes. These can be imported, though a sample tracking system might be a more appropriate place for this information. Alfresco study names (number&text) are present in FITS for many samples, imported from Solaris and study names from sequenscape, in various stages of completion/correctness.

Some samples are missing oxford codes but they exist in solaris.vw_samples

EG)

SELECT * FROM solaris.vw_samples where oxford_code like '%BK0042%';

# id	oxford_code	oxford_src_code	oxford_donor_code	taxon_id	type
70585	BK0042-C	RCA_49	RCA_49	36	dna

From FITS the sample_supplier_name is populated but not the oxford_sample_id.

SELECT vw_pivot_sample.oxford_sample_id, vw_pivot_file.sample_supplier_name 
FROM mm6_fits.vw_pivot_file, vw_pivot_sample, vw_sample_file 
where vw_pivot_file.full_path like '%19884_8#1.cram'
and vw_pivot_file.fits_file_id = vw_sample_file.file_id
and vw_sample_file.sample_id = vw_pivot_sample.fits_sample_id;

# oxford_sample_id	sample_supplier_name
	BK0042-C

Perhaps the culprit is solaris.vw_vrpipe? It's missing an entry for that oxford code.

SELECT count(*) FROM solaris.vw_vrpipe where oxford_code like '%BK0042%';

# count(*)
0

Investigate discrepancies between files and Quan's list of submissions

At the time of writing this is waiting on a file of all submissions from Quan. See following email thread:

Hi Richard,

Just to let you know, Quan will generate the full list but it will take her some time.

Best, Sonia

From: Sonia Morgado Goncalves
Sent: 20 November 2018 09:46
To: 'Richard Pearson'
Cc: Roberto Amato
Subject: RE: FW: files with missing run accessions

Hi Richard,

We submitted 274 samples from study 1209-PF-VN-IMPEQN-GENRE to sequencing. So, yes, probably all other files are from miseq runs.

Yes, all gets released immediately (both miseq and hiseq) because the release policy is for the study as a whole.

I’ll ask Quan the full list.

Thanks, best

Sonia

From: Richard Pearson [mailto:[email protected]]
Sent: 19 November 2018 19:23
To: Sonia Morgado Goncalves
Cc: Roberto Amato
Subject: Re: FW: files with missing run accessions

Hi Sonia

The four files highlighted by Quan are very old. These are all from PG0002-C which was a version of 3D7 from a study called 1032-PF-BRHN-SMITHEE which I think is essentially an R&D study. Although there are no run accessions for these, they all have the sample accession ERS010299. PG0002-C has more different accessions than any other sample, having 28 different run accessions and 16 different sample accessions. It is strange that only these 4 files are missing run accessions, and it would be interesting to know why this is, but probably not worth the effort since I'm fairly sure we'll never use these files.

What I think is worth chasing up from Quan's email is the attached spreadsheet for which the number of files I think we have is less than the number that were submitted. Especially striking is study 1209-PF-VN-IMPEQN-GENRE. I have 274 files from this study, but Quan's spreadsheet suggests we have submitted 1,042 files to the ENA. My guess would be that Quan is including amplicon sequencing files, but if this was the case, I would expect many more differences than what we actually see in the spreadsheet. Do you know how many samples from 1209-PF-VN-IMPEQN-GENRE were submitted for whole genome sequencing? Also, what is our policy for submitting amplicon sequencing files to the ENA - is it the same as for whole genome, i.e. we submit everything immediately?

I think it would be good to know why we have these discrepancies in case we are missing samples. Could you ask Quan to send the full list of files that the spreadsheet was based on, ideally including study and our sample identifier (supplier_sample_name)?

Thanks, Richard

On 14/11/2018 11:54, Sonia Morgado Goncalves wrote:
Richard,

 

Can you please help here?

 

Thanks, best

Sonia

 

From: Quan Lin [mailto:[email protected]]
Sent: 14 November 2018 11:20
To: Sonia Morgado Goncalves; Quan Lin
Cc: Elizabeth Cook; Catherine Baker; Data submission service
Subject: Re: files with missing run accessions

 

Hi Sonia,

I've checked all of your studies. Number of submitted files match "number of files" you gave for most studies.  For the ones that differ I have added the number in the "submitted files" column on the attached spreadsheet.

It's not possible to submit the following files as I could not find any metadata for them in the database.
245_8_nonhuman.bam
245_7_nonhuman.bam
368_8_nonhuman.bam
368_7_nonhuman.bam

Quan

Web front end

A web frontend is not planned at this time, however, one could be developed rather easily.

Check out Alistair's suggestions re. managing the development process

(email 19/07/2018 14:48)

Use case: Manually alter FITS mappings

The key mappings that need to be captured are:

File-to-sample
Sample-to-taxon
Sample-to-alfresco_study_code
The changes will need to be done in such a way that data can be overwritten or removed by later updates from mlwh/iRODS or elsewhere.

Finalise MVP overview draft

Use case: Create a detailed BAM/CRAM manifest file with metadata, based on species common names

Possible with current version; example for Plasmodium vivax V4 (this included a lot of filtering of unwanted files/samples):

SELECT file.id,file.full_path,
(SELECT group_concat(value) FROM vw_sample_tag,vw_sample_file WHERE file.id=file_id AND vw_sample_file.sample_id=vw_sample_tag.sample_id AND tag_id=3561) AS ox_code,
(SELECT group_concat(value) FROM vw_file_tag WHERE file.id=file_id AND tag_id=3591) AS common_name,
(SELECT group_concat(value) FROM vw_sample_tag,vw_sample_file WHERE file.id=file_id AND vw_sample_file.sample_id=vw_sample_tag.sample_id AND tag_id=3600) AS taxon_code
FROM file
WHERE storage=1
AND file.id IN (SELECT file_id FROM vw_file_tag WHERE tag_id=3591 AND value in ('Plasmodium vivax','vivax','P.vivax','Plasmodium Vvax','P. Vivax')) /*SPECIES NAMES*/
AND full_path NOT LIKE "%_human%" AND full_path NOT LIKE "%_phix%" /*NOT HUMAN OR PHIX*/
AND file.id NOT IN (SELECT parent FROM file_relation WHERE relation=3595) /*NOT IDENTICAL TO OTHER FILE*/
AND EXISTS (SELECT * FROM vw_file_tag WHERE file.id=vw_file_tag.file_id AND tag_id=3576 AND value IN ('bam','cram')) /*FILE TYPE*/
AND EXISTS (SELECT * FROM vw_file_tag WHERE file.id=vw_file_tag.file_id AND tag_id=3581 AND value=1) /*MANUAL QC*/
AND NOT EXISTS (SELECT * FROM vw_file_tag WHERE file.id=vw_file_tag.file_id AND tag_id=3582 AND value=1) /*NO R&D*/

Moving the system to a public cloud

The FITS database currently resides at Sanger. For better interoperability with Oxford and/or cloud locations, a move to a cloud-based database system is planned once the MVP has stabilized. A database-as-a-service, rather than a generic server running a MySQL client, would be preferred, for ease of maintenance, backups, availability etc.
Storage of human sample data might complicate finding an adequate cloud location.

Review build manifest creation document

Add Sanger sequencing and Sequenom raw data files to FITS

Expanding the scope to include amplicon sequencing data

FITS is agnostic to file types. Storage engines such as “Team 112 iRODs” or “Sanger NFS” can be added. Processes to import and update such data will have to be developed.

Files with no sample_supplier_name in iRODS metadata

FITS contains two files for 8 samples in study 1126-PF-LAB-FAIRHURST (PG0469-C, PG0454-C, PG0446-C, PG0455-C, PG0457-C, PG0470-C, PG0456-C, PG0445-C). It seems that in each case, the two files both have the same sequencescape sample ID, but only one of the files has a sample_supplier_name attached. It also appears to be the case that the files without the sample_supplier_name have far fewer reads and as such are unlikely to be useful even if they are actually from the same sample. The reasons my query against mlwh is not pulling in the second files for these samples is that iseq_product_metrics.id_iseq_flowcell_tmp is not populated for them. See https://github.com/malariagen/fits/blob/master/work/40_sample_supplier_name_not_in_irods/20181130_FITS_vs_mlwh_differences_for_study_1126.ipynb for further details.

It is recommended that we do the following:

Talk to Sonia about following up with sequencing core to understand why the second file was given the same sequencescape ID, but not a sample_supplier_name
Determine how many other cases we have in FITS for which the supplier_sample_name is not populated
Make a decision on what we should do with such samples, e.g.
- Exclude entirely
- Keep in FITS but mark in some way
- Leave them as are
Determine and document a way to implement the above decision
Implement the decision

A second issue is that sample_accessions are not being populated with the query above run against FITS. I'll create a separate issue to track this.

Write document describing the data sources and database update process

Use case: Determine how many samples have been sequenced, broken down by species

Possible with current version; for “sample” meaning “Multi-LIMS warehouse sample ID”:

SELECT `value`,count(*) AS cnt,(SELECT taxon_name FROM sequenscape_taxa WHERE taxon_id=`value`) AS name FROM vw_sample_tag WHERE tag_id=3600 GROUP BY `value` ORDER BY cnt DESC;

Use case: Update the file tracking database with the latest from Multi-LIMS warehouse/baton (iRODs)

A command-line interface software exists. The command to perform the above operation on Sanger systems is:
./fits update_sanger

Decision on whether to apply solaris-vs-mlwh differences to mlwh

FITS has data feeds from both Sanger core systems (e.g. mlwh and iRODS) and the team112 system Solaris. Where there have been conflicts between the data from the two, we have been treating Solaris as the "truth set". For completeness, and to minimise data loss through redundancy, we might decide we want to apply any difference between Solaris and Sanger core systems back to those core systems. This issue is to make a decision on this. Note this probably only makes sense if we also decide to keep FITS in sync with core systems, and hence this issue depends on #48 .

To date, we have not used any manifest created from FITS for any Plasmodium builds, but this is the aim going forwards. @magnusmanske has written code relevant to this task (https://github.com/malariagen/fits/blob/master/documentation/How_to_build_a_manifest.md) and has been comparing results (#4) with a manifest created by @podpearson using mlwh and a set of "exceptions files" (https://github.com/malariagen/SIMS/tree/master/meta/mlwh).

In this issue, I intend to create a manifest with FITS to convince myself I know how to do it.

Write document describing how to create a build manifest from the database

vw_pivot_file.alignment_filter column sometimes different from IRODS

Sometimes the alignment_filter is empty in FITS, but populated in IRODS.

EG) FIts query:

SELECT * FROM mm6_fits.vw_pivot_file where full_path like '%_human.bam' and alignment_filter is null


# full_path, alignment_filter
/seq/245/245_3_nonhuman.bam, 
/seq/245/245_5_nonhuman.bam, 
/seq/245/245_6_nonhuman.bam, 
/seq/245/245_7_nonhuman.bam, 
/seq/245/245_8_nonhuman.bam, 
/seq/368/368_3_nonhuman.bam, 
/seq/368/368_5_nonhuman.bam, 
/seq/368/368_6_nonhuman.bam, 
/seq/368/368_7_nonhuman.bam, 
/seq/368/368_8_nonhuman.bam, 
/seq/531/531_5_nonhuman.bam, 
/seq/531/531_6_nonhuman.bam, 
/seq/585/585_1_nonhuman.bam, 
/seq/585/585_2_nonhuman.bam, 
/seq/585/585_3_nonhuman.bam, 
/seq/585/585_5_nonhuman.bam,

IRODS query:

$ imeta ls -d /seq/245/245_3_nonhuman.bam alignment_filter 
AVUs defined for dataObj /seq/245/245_3_nonhuman.bam:
attribute: alignment_filter
value: nonhuman
units: