whitews / flowio Goto Github PK
View Code? Open in Web Editor NEWA Python library for reading and writing Flow Cytometry Standard (FCS) files
Home Page: https://flowio.readthedocs.io
License: BSD 3-Clause "New" or "Revised" License
A Python library for reading and writing Flow Cytometry Standard (FCS) files
Home Page: https://flowio.readthedocs.io
License: BSD 3-Clause "New" or "Revised" License
I have some flow cytometry data with LMD format Is it possible to work with them in python using Flowio or I should look for another library
Thanks for helping me
The FlowData class uses the header values to locate the text section. Then, the text segment is parsed and the values in the text metadata are used to locate the data section. However, there are reports of rare occurrences of FCS files where the HEADER & TEXT data offset locations are different. FlowIO should raise a custom exception when there is a discrepancy in the data section byte offset location. However, in the case of large files (>99,999,999 bytes) a different value is expected. This is because the HEADER values are limited to 8 bytes.
Proposed solution:
By default, FlowIO will use the TEXT data offsets and load a file normally when the HEADER & TEXT agree or if the HEADER value is 0 for large files.
For a discrepancy in the 2 values, FlowIO will raise an Exception when the discrepancy is not due to the large file scenario.
For the case where the user wants to ignore a discrepancy and use the TEXT value they can force the loading of the file with a new ignore_offset_discrepancy option (which defaults to False).
For the case where the user wants to ignore a discrepancy and use the HEADER value they can force the loading of the file with a new use_header_offsets option (which also defaults to False).
Setting both ignore_offset_discrepancy & use_header_offsets to True will be equivalent to just setting use_header_offsets to True.
Hi there,
I'm sure this is a going to be trivial but I just can't seem to get my head around it:
Opening up a standard FCS2.0 file in FlowIO results in values which appear as integers (312, 412 etc).
Opening the same file in FlowKit seems to result in the values being imported as floats with reasonable precision which is what I would expect. I lose quite a bit of information when importing with FlowIO vs FlowKit because of this.
As I said I'm sure this is something trivial like a missing transform argument but I was hoping I could be pointed in the right direction. Thanks!
Hi Scott,
I just came across a FlowKit created file that couldn't be read by FlowCore, because checkOffset failed (could be read with FlowKit though). After some debugging we found that the data segment offset was larger than 99,999,999. In that case start and end should be set to zero according to the standard (see below). This limit is not checked in create_fcs.py
, but the code contains the following comment:
# TODO: set zeroes if data offsets are greater than header max data size
It's an easy fix. Will try to issue a PR over the coming days
From https://www.genepattern.org/attachments/fcs_3_1_standard.pdf:
FCS 3.1 maintains support introduced in FCS 3.0 for data sets larger than 99,999,999 bytes.
When any portion of a segment falls outside the 99,999,999 byte limit, '0's are substituted in the
HEADER for that segments begin and end byte offset. The byte offsets for begin DATA, end
DATA, begin ANALYSIS, end ANALYSIS (begin and end supplemental TEXT if appropriate) will
then only be found as keyword-value pairs in the primary TEXT segment. Note, when a segment
is contained completely within the first 99,999,999 bytes of a data set, the byte offsets for that
segment will be duplicated in the TEXT segment as keyword values. Note also, if the ANALYSIS
offsets in the HEADER are zero, the $BEGINANALYSIS and $ENDANALYSIS keywords must be
checked to determine if an ANALYSIS segment is present. "
In R with FlowCore, the file 100715.fcs
has sensible values if using the truncate_max_range
option:
library(flowCore)
ff <- read.FCS("100715.fcs", truncate_max_range = FALSE)
print(summary(ff))
FSC-A FSC-H SSC-A B515-A R780-A R710-A
Min. 23406.00 27008.50 -8.014621 -67.28254 -67.11903 -44.55855
1st Qu. 34158.00 33993.69 168.284748 1988.90176 528.52808 1127.26749
Median 41644.25 40842.88 219.902458 2851.75439 898.31769 1732.62372
Mean 44672.32 43189.90 331.918749 3203.20944 1253.38825 2342.41606
3rd Qu. 50878.94 49323.50 278.990265 3344.79010 1515.64224 2637.69879
Max. 262143.50 256543.75 46248.464844 261572.65625 261566.53125 261455.73438
R660-A V800-A V655-A V585-A V450-A
Min. -79.8198 -110.4093 -66.27671 -110.4727 -28.87656
1st Qu. 738.0033 1302.3726 1136.13980 2360.6066 1785.75531
Median 1100.6382 1879.1382 1725.22186 3601.5178 2450.64038
Mean 1507.4539 2668.0656 2587.16628 4795.2020 2889.03029
3rd Qu. 1521.9344 2623.6032 2211.24213 4681.3687 3266.69647
Max. 261584.7656 261410.8906 261585.40625 261586.5312 253481.90625
G780-A G710-A G660-A G610-A G560-A
Min. -110.5278 -89.50339 -51.95771 -61.93935 -33.25663
1st Qu. 2011.9602 1462.61804 1812.38644 1411.36789 1940.51962
Median 3073.3077 2159.56616 2705.18494 2117.23547 2812.36877
Mean 4140.8475 2629.49203 3336.95553 2536.82272 3581.05912
3rd Qu. 4898.1101 2993.53485 3459.33716 2751.39990 4092.05743
Max. 261539.9219 261563.79688 261581.21875 261537.01562 261576.21875
Warning message:
No '$PnE' keyword available for the following channels: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
Using '0,0' as default.
With FlowIO
(and also fcsparser
), the raw values are way more variable, missing the correction of truncate_max_range
:
import pandas as pd
import numpy as np
import flowio
fcs_data = flowio.FlowData('example.fcs')
npy_data = numpy.reshape(fcs_data.events, (-1, fcs_data.channel_count))
df_describe = pd.DataFrame(npy_data)
df_describe.describe()
0 1 2 3 4 \
count 6.498900e+04 6.498500e+04 6.498000e+04 6.497800e+04 6.498300e+04
mean inf inf inf inf inf
std inf inf inf inf inf
min -1.186825e+29 -1.186825e+29 -1.186825e+29 -1.186825e+29 -1.186825e+29
25% -1.483879e+00 -1.483879e+00 -1.483879e+00 -1.483879e+00 -1.483879e+00
50% -4.832434e-02 -4.832434e-02 -4.832434e-02 -4.820263e-02 -4.832434e-02
75% 1.710476e+02 1.722441e+02 1.485543e+02 1.969365e+02 1.921069e+02
max 3.393232e+38 3.286675e+38 3.379840e+38 3.339942e+38 3.339699e+38
5 6 7 8 9 \
count 6.498400e+04 6.498500e+04 6.498600e+04 6.498100e+04 6.498800e+04
mean inf inf inf inf inf
std inf inf inf inf inf
min -1.186825e+29 -1.186825e+29 -1.186825e+29 -1.186825e+29 -1.186825e+29
25% -1.483879e+00 -1.483879e+00 -1.483879e+00 -1.483879e+00 -1.483879e+00
50% -4.832434e-02 -4.832434e-02 -4.820570e-02 -4.832434e-02 -4.821291e-02
75% 1.402682e+02 1.932709e+02 2.060314e+02 1.809365e+02 1.839689e+02
max 3.335518e+38 3.379818e+38 3.401980e+38 3.379732e+38 3.393125e+38
10 11 12 13 14 \
count 6.497800e+04 6.498600e+04 6.498200e+04 6.497900e+04 6.498600e+04
mean inf inf inf inf inf
std inf inf inf inf inf
min -1.186825e+29 -1.186825e+29 -1.186825e+29 -1.186825e+29 -1.186825e+29
25% -1.483879e+00 -1.483879e+00 -1.483879e+00 -1.483879e+00 -1.483879e+00
50% -4.832434e-02 -4.819864e-02 -4.832434e-02 -4.832434e-02 -4.820315e-02
75% 1.917931e+02 2.158215e+02 1.689365e+02 2.039365e+02 2.217711e+02
max 3.379818e+38 3.180088e+38 3.326649e+38 3.393111e+38 3.261503e+38
15
count 6.498800e+04
mean inf
std inf
min -1.186825e+29
25% -1.483879e+00
50% -4.817861e-02
75% 2.440290e+02
max 3.299823e+38
It seems to be some mismatch between 64 and 32-bit integer types, which should raise a warning. Is there a parsing option similar to truncate_max_range
in Python?
Hey,
The current release doesn't allow users to choose which dataset they want to load if a file contains multiple datasets or am I missing something?
It seems like all the functions are already ready to read a selected dataset, but there is no check for the nextdata
keyword to recognize if there is more then one dataset available.
I could try to open a pull request for this, but wanted to check in with you first.
Best wishes
Max
When using write_fcs to save a FlowData to file, the PNN and PNS labels are mixed up, if there are more than 9 channels.
This is because the channel dictionary is sorted by lexicographic order in this method.
FlowData.write_fcs
will fail in Python3
Running the following under Python3.X:
data = flowio.FlowData(fcsfile)
data.write_fcs(fcsfile, extra=annotations)
will result in:
File "/home/campus.ncl.ac.uk/b8051106/.local/lib/python3.8/site-packages/flowio/flowdata.py", line 396, in write_fcs
fh.write('FCS3.1')
TypeError: a bytes-like object is required, not 'str'
When parsing an FCS file that contains int data with variable lengths, this line raises an exception in Python3:
Line 259 in 51d10fb
OS: windows, but should happen on everything.
This is pretty easy to fix; but a secondary issue is that parsing each value individually is very slow (multiple minutes for a decent sized file). I've got a fix for both (which also simplifies that part of the code); will make a pull request!
I am using flowio to parse an FCS 3.0 file generated from the iQue. When I use flowio to parse the file, no event data is returned even though there are 72374 events across 32 channels. Looking at the code I believe it is because in the metadata of the fcs, end_data = begin_data which makes the estimation of data_start and data_stop for function __calc_data_item_count
incorrect. If instead I had data_stop to be equal to data_start + event_count * num_channels *4 - 1 then I am able to correctly read out event data. Also when using FCSParser, it produces the expected behavior.
Is this and edge case scenario, and if so, there someway to account for this via a parameter (i.e. estimate data_stop from events) or am I non instantiating the object correctly.
Hi
I am looking for manipulating a certain amount of channels from an .fcs file. I am reading to the code and it is not entirely clear to me how I would proceed to do that using your library. Is it possible ?
Thanks in advance!
I'm just starting to look into implementing a python-based interpretation of fcs files using flowkit. However, I'm having trouble right at the beginning, with flowIO unable to load the fcs files I'm working with.
I'm getting the warning UserWarning: text in segment does not start and end with delimiter warn("text in segment does not start and end with delimiter")
and later:
error Traceback (most recent call last)
<ipython-input-5-bca60f0715fe> in <module>
----> 1 fd = flowio.FlowData('G11.fcs')
~/anaconda2/envs/analyzefacsmore/lib/python3.6/site-packages/flowio/flowdata.py in __init__(self, filename)
81 d_start,
82 d_stop,
---> 83 self.text)
84
85 try:
~/anaconda2/envs/analyzefacsmore/lib/python3.6/site-packages/flowio/flowdata.py in __parse_data(self, offset, start, stop, text)
192 stop,
193 data_type.lower(),
--> 194 order)
195 else: # ascii
196 data = self.__parse_ascii_data(
~/anaconda2/envs/analyzefacsmore/lib/python3.6/site-packages/flowio/flowdata.py in __parse_float_data(self, offset, start, stop, data_type, order)
256
257 tmp = unpack('%s%d%s' % (order, num_items, data_type),
--> 258 self.__read_bytes(offset, start, stop))
259 return tmp
260
error: unpack requires a buffer of 277676 bytes
I'm consistently getting this error for all the fcs files we are producing (an Attune, not sure the software version). The flowIO load works great for an example fcs file in the flowKit example.
I uploaded one file as an example.
issue_file.zip
Hey @whitews,
there is the new FCS3.2 standard which has some new keywords, especially
the PnDATATYPE
. This allows single columns to have another datatype then the one set in DATATYPE
keyword. See 3.3.41 on page 41. Do you already have plans to support that? It might be a problem cause the values in the used array must be of the same type as I understand it.
I will also think about a solution but wanted to check in with you first.
Best wishes
Max
Hi,
Would it be possible to have a license file o we can use your package?
Cheers,
Andrea
The error Exception: REPORT BUG: error calculating text offset
is thrown when exporting some FCS files. See FlowKit issue here.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.