shenwei356 / csvtk Goto Github PK

View Code? Open in Web Editor NEW

956.0 25.0 84.0 10.63 MB

A cross-platform, efficient and practical CSV/TSV toolkit in Golang

Home Page: http://bioinf.shenwei.me/csvtk

License: MIT License

Go 99.90% Shell 0.10%

csv tsv cross-platform golang toolkit bioinformatics command-line tool

csvtk's Introduction

csvtk - a cross-platform, efficient and practical CSV/TSV toolkit

Documents: http://bioinf.shenwei.me/csvtk ( Usage, Tutorial and FAQs). 中文介绍
Source code: https://github.com/shenwei356/csvtk
Latest version:

Introduction

Similar to FASTA/Q format in field of Bioinformatics, CSV/TSV formats are basic and ubiquitous file formats in both Bioinformatics and data science.

People usually use spreadsheet software like MS Excel to process table data. However this is all by clicking and typing, which is not automated and is time-consuming to repeat, especially when you want to apply similar operations with different datasets or purposes.

You can also accomplish some CSV/TSV manipulations using shell commands, but more code is needed to handle the header line. Shell commands do not support selecting columns with column names either.

csvtk is convenient for rapid data investigation and also easy to integrate into analysis pipelines. It could save you lots of time in (not) writing Python/R scripts.

Features
Subcommands
Installation
Command-line completion
Compared to csvkit
Examples
Acknowledgements
Contact
License
Starchart

Features

Cross-platform (Linux/Windows/Mac OS X/OpenBSD/FreeBSD)
Light weight and out-of-the-box, no dependencies, no compilation, no configuration
Fast, multiple-CPUs supported (some commands)
Practical functions provided by N subcommands
Support STDIN and gziped input/output file, easy being used in pipe
Most of the subcommands support unselecting fields and fuzzy fields, e.g. -f "-id,-name" for all fields except "id" and "name", -F -f "a.*" for all fields with prefix "a.".
Support some common plots (see usage)
~~Seamlessly support for data with meta line (e.g., sep=,) of separator declaration used by MS Excel~~

Subcommands

53 subcommands in total.

Information

headers: prints headers
dim: dimensions of CSV file
nrow: print number of records
ncol: print number of columns
summary: summary statistics of selected numeric or text fields (groupby group fields)
watch: online monitoring and histogram of selected field
corr: calculate Pearson correlation between numeric columns

Format conversion

pretty: converts CSV to a readable aligned table
csv2tab: converts CSV to tabular format
tab2csv: converts tabular format to CSV
space2tab: converts space delimited format to TSV
csv2md: converts CSV to markdown format
csv2rst: converts CSV to reStructuredText format
csv2json: converts CSV to JSON format
csv2xlsx: converts CSV/TSV files to XLSX file
xlsx2csv: converts XLSX to CSV format

Set operations

head: prints first N records
concat: concatenates CSV/TSV files by rows
sample: sampling by proportion
cut: select and arrange fields
grep: greps data by selected fields with patterns/regular expressions
uniq: unique data without sorting
freq: frequencies of selected fields
inter: intersection of multiple files
filter: filters rows by values of selected fields with arithmetic expression
filter2: filters rows by awk-like arithmetic/string expressions
join: join files by selected fields (inner, left and outer join)
split splits CSV/TSV into multiple files according to column values
splitxlsx: splits XLSX sheet into multiple sheets according to column values
comb: compute combinations of items at every row

Edit

fix: fix CSV/TSV with different numbers of columns in rows
fix-quotes: fix malformed CSV/TSV caused by double-quotes
del-quotes: remove extra double-quotes added by fix-quotes
add-header: add column names
del-header: delete column names
rename: renames column names with new names
rename2: renames column names by regular expression
replace: replaces data of selected fields by regular expression
round: round float to n decimal places
mutate: creates new columns from selected fields by regular expression
mutate2: creates a new column from selected fields by awk-like arithmetic/string expressions
fmtdate: format date of selected fields

Transform

transpose: transposes CSV data
sep: separate column into multiple columns
gather: gather columns into key-value pairs, like tidyr::gather/pivot_longer
spread: spread a key-value pair across multiple columns, like tidyr::spread/pivot_wider
unfold: unfold multiple values in cells of a field
fold: fold multiple values of a field into cells of groups

Ordering

sort: sorts by selected fields

Ploting

plot see usage
- plot hist histogram
- plot box boxplot
- plot line line plot and scatter plot

Misc

cat stream file and report progress
version print version information and check for update
genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)

Installation

Download Page

csvtk is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.

Method 1: Download binaries (latest stable/dev version)

Just download compressed executable file of your operating system, and decompress it with tar -zxvf *.tar.gz command or other tools. And then:

For Linux-like systems
1. If you have root privilege simply copy it to /usr/local/bin:
```
 sudo cp csvtk /usr/local/bin/
```
2. Or copy to anywhere in the environment variable PATH:
```
 mkdir -p $HOME/bin/; cp csvtk $HOME/bin/
```
For windows, just copy csvtk.exe to C:\WINDOWS\system32.

Method 2: Install via conda (latest stable version)

conda install -c bioconda csvtk

Method 3: Install via homebrew

brew install csvtk

Method 4: For Go developer (latest stable/dev version)

go get -u github.com/shenwei356/csvtk/csvtk

Method 5: For ArchLinux AUR users (may be not the latest)

yaourt -S csvtk

Command-line completion

Bash:

# generate completion shell
csvtk genautocomplete --shell bash

# configure if never did.
# install bash-completion if the "complete" command is not found.
echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
echo "source ~/.bash_completion" >> ~/.bashrc

Zsh:

# generate completion shell
csvtk genautocomplete --shell zsh --file ~/.zfunc/_csvtk

# configure if never did
echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
echo "autoload -U compinit; compinit" >> ~/.zshrc

fish:

csvtk genautocomplete --shell fish --file ~/.config/fish/completions/csvtk.fish

Compared to `csvkit`

csvkit, attention: this table wasn't updated for many years.

Features	csvtk	csvkit	Note
Read Gzip	Yes	Yes	read gzip files
Fields ranges	Yes	Yes	e.g. `-f 1-4,6`
Unselect fileds	Yes	--	e.g. `-1` for excluding first column
Fuzzy fields	Yes	--	e.g. `ab*` for columns with name prefix "ab"
Reorder fields	Yes	Yes	it means `-f 1,2` is different from `-f 2,1`
Rename columns	Yes	--	rename with new name(s) or from existed names
Sort by multiple keys	Yes	Yes	bash sort like operations
Sort by number	Yes	--	e.g. `-k 1:n`
Multiple sort	Yes	--	e.g. `-k 2:r -k 1:nr`
Pretty output	Yes	Yes	convert CSV to readable aligned table
Unique data	Yes	--	unique data of selected fields
frequency	Yes	--	frequencies of selected fields
Sampling	Yes	--	sampling by proportion
Mutate fields	Yes	--	create new columns from selected fields
Replace	Yes	--	replace data of selected fields

Similar tools:

csvkit - A suite of utilities for converting to and working with CSV, the king of tabular file formats. http://csvkit.rtfd.org/
xsv - A fast CSV toolkit written in Rust.
miller - Miller is like sed, awk, cut, join, and sort for name-indexed data such as CSV and tabular JSON http://johnkerl.org/miller
tsv-utils - Command line utilities for tab-separated value files written in the D programming language.

Examples

More examples and tutorial.

Attention

By default, csvtk assumes input files have header row, if not, switch flag -H on.
By default, csvtk handles CSV files, use flag -t for tab-delimited files.
Column names should be unique.
By default, lines starting with # will be ignored, if the header row starts with #, please assign flag -C another rare symbol, e.g. $.
Do not mix use field (column) numbers and names to specify columns to operate.
The CSV parser requires all the lines have same numbers of fields/columns. Even lines with spaces will cause error. Use -I/--ignore-illegal-row to skip these lines if neccessary. You can also use "csvtk fix" to fix files with different numbers of columns in rows.
If double-quotes exist in fields not enclosed with double-quotes, e.g.,
```
 x,a "b" c,1
```
It would report error:
```
 bare `"` in non-quoted-field.
```
Please switch on the flag -l or use csvtk fix-quotes to fix it.
If somes fields have only a double-quote eighter in the beginning or in the end, e.g.,
```
 x,d "e","a" b c,1
```
It would report error:
```
 extraneous or missing " in quoted-field
```
Please use csvtk fix-quotes to fix it, and use csvtk del-quotes to reset to the original format as needed.

Examples

Pretty result

 $ csvtk pretty names.csv
 id   first_name   last_name   username
 --   ----------   ---------   --------
 11   Rob          Pike        rob
 2    Ken          Thompson    ken
 4    Robert       Griesemer   gri
 1    Robert       Thompson    abc
 NA   Robert       Abel        123

 $ csvtk pretty names.csv -S 3line
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  id   first_name   last_name   username
 ----------------------------------------
  11   Rob          Pike        rob
  2    Ken          Thompson    ken
  4    Robert       Griesemer   gri
  1    Robert       Thompson    abc
  NA   Robert       Abel        123
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 $ csvtk pretty names.csv -S bold -w 5 -m 1-
 ┏━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
 ┃  id   ┃ first_name ┃ last_name ┃ username ┃
 ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
 ┃  11   ┃    Rob     ┃   Pike    ┃   rob    ┃
 ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
 ┃   2   ┃    Ken     ┃ Thompson  ┃   ken    ┃
 ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
 ┃   4   ┃   Robert   ┃ Griesemer ┃   gri    ┃
 ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
 ┃   1   ┃   Robert   ┃ Thompson  ┃   abc    ┃
 ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
 ┃  NA   ┃   Robert   ┃   Abel    ┃   123    ┃
 ┗━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━┛

Summary of selected numeric fields, supporting "group-by"

 $ cat testdata/digitals2.csv \
     | csvtk summary -i -f f4:sum,f5:sum -g f1,f2 \
     | csvtk pretty
 f1    f2     f4:sum   f5:sum
 bar   xyz    7.00     106.00
 bar   xyz2   4.00     4.00
 foo   bar    6.00     3.00
 foo   bar2   4.50     5.00

Select fields/columns (cut)
- By index: csvtk cut -f 1,2
- By names: csvtk cut -f first_name,username
- Unselect: csvtk cut -f -1,-2 or csvtk cut -f -first_name
- Fuzzy fields: csvtk cut -F -f "*_name,username"
- Field ranges: csvtk cut -f 2-4 for column 2,3,4 or csvtk cut -f -3--1 for discarding column 1,2,3
- All fields: csvtk cut -f 1- or csvtk cut -F -f "*"
Search by selected fields (grep) (matched parts will be highlighted as red)
- By exactly matching: csvtk grep -f first_name -p Robert -p Rob
- By regular expression: csvtk grep -f first_name -r -p Rob
- By pattern list: csvtk grep -f first_name -P name_list.txt
- Remore rows containing missing data (NA): csvtk grep -F -f "*" -r -p "^$" -v
Rename column names (rename and rename2)
- Setting new names: csvtk rename -f A,B -n a,b or csvtk rename -f 1-3 -n a,b,c
- Replacing with original names by regular express: csvtk rename2 -f 1- -p "(.*)" -r 'prefix_$1' for adding prefix to all column names.
Edit data with regular expression (replace)
- Remove Chinese charactors: csvtk replace -F -f "*_name" -p "\p{Han}+" -r ""
Create new column from selected fields by regular expression (mutate)
- In default, copy a column: csvtk mutate -f id
- Extract prefix of data as group name (get "A" from "A.1" as group name): csvtk mutate -f sample -n group -p "^(.+?)\." --after sample
Sort by multiple keys (sort)
- By single column : csvtk sort -k 1 or csvtk sort -k last_name
- By multiple columns: csvtk sort -k 1,2 or csvtk sort -k 1 -k 2 or csvtk sort -k last_name,age
- Sort by number: csvtk sort -k 1:n or csvtk sort -k 1:nr for reverse number
- Complex sort: csvtk sort -k region -k age:n -k id:nr
- In natural order: csvtk sort -k chr:N
Join multiple files by keys (join)
- All files have same key column: csvtk join -f id file1.csv file2.csv
- Files have different key columns: csvtk join -f "username;username;name" names.csv phone.csv adress.csv -k
Filter by numbers (filter)
- Single field: csvtk filter -f "id>0"
- Multiple fields: csvtk filter -f "1-3>0"
- Using --any to print record if any of the field satisfy the condition: csvtk filter -f "1-3>0" --any
- fuzzy fields: csvtk filter -F -f "A*!=0"
Filter rows by awk-like arithmetic/string expressions (filter2)
- Using field index: csvtk filter2 -f '$3>0'
- Using column names: csvtk filter2 -f '$id > 0'
- Both arithmetic and string expressions: csvtk filter2 -f '$id > 3 || $username=="ken"'
- More complicated: csvtk filter2 -H -t -f '$1 > 2 && $2 % 2 == 0'

Ploting

plot histogram with data of the second column:

  csvtk -t plot hist testdata/grouped_data.tsv.gz -f 2 | display

plot boxplot with data of the "GC Content" (third) column, group information is the "Group" column.

  csvtk -t plot box testdata/grouped_data.tsv.gz -g "Group" \
      -f "GC Content" --width 3 | display

plot horiz boxplot with data of the "Length" (second) column, group information is the "Group" column.

 csvtk -t plot box testdata/grouped_data.tsv.gz -g "Group" -f "Length"  \
     --height 3 --width 5 --horiz --title "Horiz box plot" | display

plot line plot with X-Y data

  csvtk -t plot line testdata/xy.tsv -x X -y Y -g Group | display

plot scatter plot with X-Y data

  csvtk -t plot line testdata/xy.tsv -x X -y Y -g Group --scatter | display

Acknowledgements

We are grateful to Zhiluo Deng and Li Peng for suggesting features and reporting bugs.

Thanks Albert Vilella for features suggestion, which makes csvtk feature-rich。

Contact

Create an issue to report bugs, propose new functions or ask for help.

Or leave a comment.

License

MIT License

Starchart

csvtk's People

Contributors

Stargazers

Watchers

csvtk's Issues

new command "gather" like "tidyr::gather" in R

Always 'bad file descriptor ' on OSX

I install csvtk_darwin_amd64.tar.gz on my OSX 10.11.6 and get the error below :

csvtk stat query_result.csv
[ERRO] read query_result.csv: bad file descriptor

Is there any thing wrong ?

Problem running dynamically linked binary

Hi,
Thanks for creating this very useful library. I have one observation which might help other users.
The binaries, that you helpfully provide for users, should be statically linked instead of dynamically linked to avoid problems with library paths.

I downloaded the Linux 64 bit binary but could not run it because of a library path mismatch for my distro (NixOS). When I created a statically linked binary it ran successfully.

Regards,
Chris

sort by custom order

Sorting according to value of some field in custom defined order.

For example, a custom order file:

$ cat order.txt
C
A
T

Data:

A,hello
C,wei
B,dear
T,shen

Wanted output:

$ csvtk sort -H -k 1:c -L 1:order.txt
C,wei
A,hello
T,shen
B,dear

Parsing CSV files with multi-line fields

The stdlib encoding/csv supports this, but the https://github.com/mzimmerman/multicorecsv doesn't.

csvtk cut Error: unknown shorthand flag: 'i' in -i

Prerequisites

make sure you're are using the latest version by csvtk version
read the usage

Describe your issue

describe the problem
provide a reproducible example

Test code

cat seqIDvsChrName.txt | csvtk cut -f -F "chr*" -i

Error: unknown shorthand flag: 'i' in -i

`split`: asynchronously writing files

Feature Request: Include handling of space character

Hi,

You explicitly noted that the space character will lead to errors. Do you plan to remove that restriction at some future update? I would love to use this tool, but pretty much all of my datasets contain spaces.

Best,
Carsten

output figures display at terminal

Hi shenwei,

Thanks for this nice tool. I really like the plot functions.

It will be more useful if it can be piped to display from imagemagic to display on terminal rather saving the figures. see some examples http://crazyhottommy.blogspot.com/2014/11/csvkit-to-manipulate-csv-at-command.html. Most of the time, I just want to glance the distribution of the data. If I want serious figures, I will import to R and plot with ggplot2.

Best,
Tommy

add new command to split table according to column value

csvtk filter2 problem when header column names contain "_"

Prerequisites

make sure you're are using the latest version by csvtk version

Version: 0.13.0

read the usage

Describe your issue

describe the problem
if the header names contain "_", like “ExAC_ALL”，

cat test.hg19_multianno.csv |csvtk filter2 -f ' $6 < 0.05 && $7<0.05' works

cat test.hg19_multianno.csv |csvtk filter2 -f ' $1000g2015aug_all < 0.05 && $1000g2015aug_eas<0.05'

doesn't work

provide a reproducible example

cat test.hg19_multianno.csv

Chr,Start,End,Ref,Alt,1000g2015aug_all,1000g2015aug_eas,ExAC_ALL,ExAC_AFR,ExAC_AMR,ExAC_EAS,ExAC_FIN,ExAC_NFE,ExAC_OTH,ExAC_SAS
1,907540,907540,T,C,0.00499201,0.0238,0.0020,0,8.672e-05,0.0252,0,1.533e-05,0.0011,0.0012
1,1153073,1153073,G,A,0.000399361,0.002,0.0002,0,8.66e-05,0.0021,0,0,0,6.061e-05
1,1273569,1273569,G,C,0.000399361,0.002,0.0002,0,0,0.0025,0,0,0,0
1,1372668,1372668,C,T,0.00698882,0.0347,0.0033,0,0,0.0441,0,0,0.0014,0.0003
1,1455586,1455586,C,T,0.00119808,0.006,0.0009,0,0,0.0087,0,0.0004,0.0013,0.0002
1,1458237,1458237,G,A,0.00758786,0.0377,0.0033,0,0,0.0430,0,0.0002,0.0039,0.0004
1,1572708,1572708,C,A,0.00319489,0.0089,,,,,,,,
1,2440518,2440518,C,T,0.00279553,0.0099,0.0039,0.0005,0.0260,0.0192,0,0.0005,0.0023,0.0008
1,6257863,6257863,G,A,0.00259585,0.0119,0.0013,0.0001,0.0001,0.0130,0.0030,0.0001,0.0037,0.0004

Homebrew support

Could you submit this CLI tool to Homebrew so it could be convinently installed and upgraded on MacOS with brew.

sort: sorting in natural order

Reference

Code:

[ERRO] parse error on line 2, column 104: extraneous or missing " in quoted-field

Prerequisites

make sure you're are using the latest version by csvtk version
read the usage

Describe your issue

I am generating a .tsv file which contains single-quotes and double-quotes in the text, and when I pass it through csvtk pretty, I get this error:

[ERRO] parse error on line 2, column 105: extraneous or missing " in quoted-field

I tried getting rid of it by quoting the double-quotes as shown below, but the problem persists:

      my $this_record = "$this_created_at\t$this_created_time\t$this_user\t$this_message\t$this_comments";
      $this_record        =~ s/\"/\\\"/g;

Any ideas?

example split query

Hi, I have an example operation that I think I will be able to do with csvtk but I am unsure how.
From a table that looks like this:

col1,col2,col3
100,2,ABC
101,2,DEF

I would like to have an operation on col2, so that foreach entry, two new entries are created where the value of col2 is divided by 2, and the value in col1 is suffixed with 'A' and 'B', e.g.:

col1,col2,col3
100A,1,ABC
100B,1,ABC
101A,1,DEF
101B,1,DEF

Any ideas how to construct that operation with csvtk?

Some performance improvements

csvtk inter: decreasing RAM occupation of keysMaps by adding an option to delete keys that keysMaps[key] don't have #. files keys right after handle one file.
- ~~Or use inverse bloomfilter, which maybe better.~~

~~csvtk uniq: use bloomfilter (BoomFilters) instead of map~~

  for every record:
      if record not in bloomfilter:
          output record
          add record to bloomfilter

csvtk grep: decrease RAM occupation for finding unique elements in a relative small dataset against a very big (millions) dataset, i.e. A - B (csvtk grep -v -P very-big small). We may add a new command substract
- Workaround: traversing very-big file and marking elements existed in small, and excluding them from small:
```
  csvtk grep -P small very-big > tmp
  csvtk grep -v -P tmp small > result
```
- ~~building a bloomfilter from very big and checking every element in small, only outputting that not in the bloomfilter~~

Improvement of using experience

-n is not needed when flag -H in csvtk mutate

new command: mutate2 for creating new column by artithmetic/string expressions

with help of https://github.com/Knetic/govaluate

Call for new function: freq, counting frequency

flags:

-n, -- ormalization

use mutate2 for concat with separator behaviour

Hi, how can I use mutate2 to create a new column as the concatenation of two columns with a separator?

E.g. something like:

echo "A,1" | csvtk -H mutate -e '$1,"/",$2' -n new

To produce:

A,1,A/1

Any ideas?

supporting reading (and writing) .xlsx file

https://github.com/360EntSecGroup-Skylar/excelize

csvtk xlsx2csv file -a truncates worksheet name if too long

Prerequisites

[x ] make sure you're are using the latest version by csvtk version
[x ] read the usage

Describe your issue

I have an .xlsx file with long worksheet names, that get truncated when applying the -a as shown below:

csvtk xlsx2csv file.xlsx -a

The file.xlsx contains a worksheet like the one on the left, but csvtk returns a truncated name with the last character missing:

Import_XTP101_20181018102046_551 > Import_XTP101_20181018102046_55

[x ] describe the problem
provide a reproducible example

request for more options of keep in "csvtk join"

csvtk join:

keep rows according to specified files (one or more based on row union)

For instance:

-k 1,2 # to keep all rows of the union of first and second file
-k 1 # to keep all rows of first file
-k 2-7 # to keep all rows of the union of file 2 to 7 
-k 1- # to keep all rows of the union of all files

csvtk on Arch Linux

Hi! Thanks for the useful tool, started using it today.

Just wanted to let you know I am maintaining a package for this utility for anyone on Arch Linux here: https://aur.archlinux.org/packages/csvtk/

Anyone using an Arch OS can install with their package manager now.

Feel free to close, just wanted to inform you!

Support for separator declaration (used by Excel)

In order to get Excel to always accept CSV files some of our legacy lab equipment adds a line such as

sep=,

as a first line. Trying to manipulate them, e.g. with csvtk transpose results in

[ERRO] read .\Testing.csv: The handle is invalid.

Would you consider catching this line, ignoring it and adding it to the resulting file if present? Could very well be a command-line option as well.

handle BOM

https://github.com/shenwei356/multicorecsv

https://en.wikipedia.org/wiki/Byte_order_mark

csvtk split: add option --gz to force using .gz for data from stdin

TODO

csvtk join keep source records with same ID. v0.4.5 and later versions support it.
new command header
new command head
new command sample
new command csvtk filter2: support comparison between fields (like awk '$1 < $4') using govaluate.
csvtk grep: keep output in order of input
add alias to stat and stat2
fix english expression: number fields -> digital
add functional tests. too many cases to cover 😭

support env variables for `-t` and `-H`

functionality request for the partial match of the grep command

Hope the grep command can suport the partial match of given field and highlight the matched part just like the grep on Linux.

I am having trouble with using csvtk concat ; Error: unknown command "concat" for "csvtk"

Prerequisites

make sure you're are using the latest version by csvtk version
read the usage

Describe your issue

describe the problem
provide a reproducible example

bug of "csvtk cut"

$ cat t 
a,b,c
1,2,3

$ cat t | csvtk cut -f -a
[ERRO] column "-a" not existed in file: -

$ cat t | csvtk cut -f -1
b,c
2,3

TODO: Adding plotting commands

Including:

Histogram. csvtk plot hist. example
Bar chart. example
Box plot. example
line plot

submodule doc/site does not exist

Tried to clone the repo, got a failure:

$ git clone --recursive https://github.com/shenwei356/csvtk.git
...
fatal: No url found for submodule path 'doc/site' in .gitmodules

In the github UI it shows up as a non-clickable link. Commands like:

$ go get -u github.com/shenwei356/csvtk/csvtk

fail also showing the same issue.

mutate2 converts datetime expression to float

Prerequisites

make sure you're are using the latest version by csvtk version
read the usage

Describe your issue

describe the problem
provide a reproducible example

Datetime expression with single quotes (e.g. '2018-10-31 01:23:45') will be converted to float when using csvtk mutate2 in order to append a new column to original file.
Is this an expected behavior or not?

$ csvtk version
csvtk v0.15.0

Checking new version...
You are using the latest version of csvtk

$ cat source.csv
aaa,bbb
1,x
2,y
3,z

$ cat source.csv | csvtk mutate2 -n ts -e "'2018-10-31 01:23:45'"
aaa,bbb,ts
1,x,1.540916625e+09
2,y,1.540916625e+09
3,z,1.540916625e+09

$ cat source.csv | csvtk mutate2 -s -n ts -e "'2018-10-31 01:23:45'"
aaa,bbb,ts
1,x,1.540916625e+09
2,y,1.540916625e+09
3,z,1.540916625e+09

problem with running csvtk on SGI UV server

I tested csvtk on a SGI UV server with 1024 cores, and got the following error:

fatal error: procresize: invalid arg

runtime stack:
runtime.throw(0xa2f880, 0x17)
    /usr/local/app/go/src/runtime/panic.go:530 +0x90
runtime.procresize(0x400, 0x0)
    /usr/local/app/go/src/runtime/proc.go:3158 +0xaee
runtime.schedinit()
    /usr/local/app/go/src/runtime/proc.go:452 +0x13d
runtime.rt0_go(0x7fffffffde28, 0x2, 0x7fffffffde28, 0x0, 0x0, 0x2, 0x7fffffffe17f, 0x7fffffffe185, 0x0, 0x7fffffffe18a, ...)
    /usr/local/app/go/src/runtime/asm_amd64.s:138 +0x132

It seems due to the default setting of the max number of processors, which refers to golang/go#13908

mutate2 concatenate strings without string evaluation

I have an example .csv with the values:

patient_id,frequency
102900008,2
102100016,2

And I want to create two more columns like so:

patient_id,frequency,extraction_from_sample_name,extraction_to_sample_name
102900008,2,102900008pool,102900008xt
102100016,2,102100016pool,102100016xt

In trying to append the "pool" to the patient_id, it evaluates it as a number, and changes it to scientific notation, see below. How can I avoid this and simply get the string as text?

cat example.csv | csvtk mutate2 -n extraction_from_sample_name -e ' $patient_id + "pool"'
patient_id,frequency,extraction_from_sample_name
102900008,2,1.02900008e+08pool
102100016,2,1.02100016e+08pool

Shall we support data with duplicated column names?

If it's supported, involved commands include:

The usings may be confusing. By which one of the columns of all of them?
- cut
- filter
- grep
- inter
- mutate
- stat2
- uniq
Not affected, operating on all the columns with same names:
- rename
- rename2
- replace

Well, I think the answer is no.

Solution of handling data with duplicated colnames:

Getting the column indexes by csvtk headers and using them instead of colnames.
Renaming duplicated colnames by csvtk rename to make them unique.

csvtk cut remove multiple occurences

Prerequisites

make sure you're are using the latest version by csvtk version
read the usage

Describe your issue

Hi, first thanks for this awesome tool! Works quite well with my data but I couldn't find a way to do the following with featureCounts output:

I have multiple files with this header (always the same header!):

Geneid	Chr	Start	End	Strand	Length	gene_name	blablabla.bam

If I join these together by Geneid, the table looks nice but "duplicated rows" exist (Chr, Start,End,Strand,Length,gene_name).

I'd like to remove all columns Matching Chr,Start,End,Strand,Length (I assume with - as a prefix) but this only removes the first occurence of such a column. Can one specify to remove all occurences of such a pattern, e.g. with an extra option -g ?

Thanks, feel free to ask more questions if I'm being unclear!

-n argv needed

for rownumber; especially for grep function.

wrong error message on grep -f option

From 76-78 lines:

if len(patterns) == 0 && patternFile == "" {
            checkError(fmt.Errorf("one of flags -p (--pattern) or -f (--pattern-file) should be given"))
}

As I understand here -f should be the "--fields string" and -P gives the "--pattern-file".

easiest way to add percentage column with stats/mutate2

Prerequisites

make sure you're are using the latest version by csvtk version
read the usage

Describe your issue

I couldn't find an easy way to do this in the docs. I presume it would involve stats/mutate2, but I couldn't find a recipe. What I want is the following: I would like to add a percentage column next to column 3, by grouping over column 1,2:

CGT     0       5094192
CGT     1       1031270
CGT     2       141416 
CGT     3       25743  
CGG     4       16268  
CGG     5       6180   
CGG     6       2855   
CGG     7       1231   
CGG     8       477    
CGG     9       157    
CGG     10      77     
CGG     11      16     
CGG     12      3      
CGG     13      2      
CGG     0       5123225
CGG     1       911882 
CGG     2       187348 
CGG     3       47305

Enhancement: to_json command?

Would be useful!

% cat data.csv

ID,room,name
3,G13,Simon
5,103,Anna

% csvtk to_json data.csv

{
  ID: 3,
  room: G13,
  name: Simon
},
{
  ID: 5,
  room: 103,
  name: Anna
}

How can two or more files be merged/joined with all data from both files kept?

Prerequisites

make sure you're are using the latest version by csvtk version
read the usage

Describe your issue

describe the problem
provide a reproducible example

I would like to merge two or more files with all data from both files kept? However, it seems that csvtk join only allows unmatched data from the first file kept with -k parameter. More specifically, I would like to merge the following example data:
username,phone
gri,11111
rob,12345
ken,22222
shenwei,999999

name,region
ken,nowhere
gri,somewhere
shenwei,another
Thompson,there

and get the following result:
username,phone,region
gri,11111,somewhere
rob,12345,NA
ken,22222,nowhere
shenwei,999999,another
Thompson,NA,there

How can I do it with csvtk? Thank you!

Xuhang

add new command to list items

Example data

dept   name    item
A      Tom     I_A
A      Tom     I_B
A      Jerry   I_A
B      Wei     I_B

What I want

$ csvtk xxx -t -f dept -v name
dept   name
A      Tom;Jerry
B      Wei

$ csvtk xxx -t -f item -v name
item   name
I_A    Tom;Jerry
I_B    Tome;Wei

$ csvtk xxx -t -f dept,name -v item
dept    name    item
A       Tom     I_A;I_B
A       Jerry   I_A
B       Wei     I_B

use mutate2 to add new column(s) based on bash variables

Prerequisites

make sure you're are using the latest version by csvtk version
read the usage

Describe your issue

I am trying to use mutate2 to add a column or series of columns but based on values from my bash command-line. E.g. this does not work as it does not pick up the value of the bash variable $val:

val=RUN515 && cat testdata/region.csv | csvtk mutate2 -n run_id -e '1 > 0 ? "${val}" : "${val}"'

The context in which I am doing this is that I have a bunch of worksheets on an xlsx, and I want to aggregate them together, but add some extra columns for each of them (runid) that I can pass as bash variables like in the example below ($runid is the name of each worksheet):

csvtk xlsx2csv -n RUNLIST $meta | grep completed | csvtk cut -T -f 1,7 | xargs -n2 echo | while read runid runcomplete; do csvtk xlsx2csv -n $runid $meta | head -n19 | csvtk transpose -; done

My thinking was that I could pass the bash variable $runid to the csvtk mutate2 command somehow, e.g.:

csvtk xlsx2csv -n RUNLIST $meta | grep completed | csvtk cut -T -f 1,7 | xargs -n2 echo | while read runid runcomplete; do csvtk xlsx2csv -n $runid $meta | head -n19 | csvtk transpose - | csvtk mutate2 -n run_id -e ' 1 > 0 ? $runid : $runid'; done

Any ideas?

Q: Does csvtk have an equivalent operation to csvkit's csvclean ?

I need a fast filter to parse csv file lines and drop those that are unparsable, as with https://csvkit.readthedocs.io/en/1.0.2/scripts/csvclean.html

csvclean works in a shell pipe (PR 781) but is limited in speed.

ignore empty rows flag

Hi,

I have a use case where I am using csvtk on a table from an .xlsx file where there could be rows at the bottom of the worksheet that are completely empty, e.g. where all the fields have empty cells. This is common if people manually edit the worksheet and just leave the empty rows in there rather than explicitly deleting them.

Could there be a flag in csvtk so that the empty rows are ignored in the csvtk operation?

There may be a workaround using csvtk filter or some other operation, but I couldn't find one.

E.g. in this example, all the values for the last two rows are empty

Col1,Col2,Col3
A,1,2
B,2,3
C,3,4
,,
,,

Currently when applying any operation, e.g. csvtk freq, it takes into account the empty rows:

csvtk freq -f 1 example_empty_rows.csv 
Col1,frequency
A,1
B,1
C,1
,2

It would be great to have an --ignore-empty e.g. -e flag that would filter out the rows where all fields are empty, then return the equivalent of the operation, e.g.:

csvtk freq --ignore-empty -f 1 example_empty_rows.csv 
Col1,frequency
A,1
B,1
C,1

-H always blocked/masked automatically.

Such as:

gyangdeiMac:untitled folder gyang$ csvtk csv2md  -H -t chr2L_RNA.matrix | head -10 
[WARN] colnames detected, flag -H (--no-header-row) ignored
yw_wt|1    |chr2L |62 |0  |1  |0  |0  |0  |1
:----|:----|:-----|:--|:--|:--|:--|:--|:--|:--
yw_wt|chr2L|63    |0  |1
yw_wt|chr2L|64    |0  |1
yw_wt|chr2L|65    |0  |1
yw_wt|chr2L|66    |0  |1
yw_wt|chr2L|67    |0  |1
yw_wt|chr2L|68    |0  |1
yw_wt|chr2L|69    |0  |1
yw_wt|chr2L|70    |0  |1

collate / concatenate / aggregate sets of files with identical fields

Prerequisites

make sure you're are using the latest version by csvtk version
read the usage

Describe your issue

I have a collection of worksheets in an xlsx file, each worksheet should be of the same type. I want to collate them together (or concatenate or aggregate, not sure what the best word is), so that the order in which I concatenate them is preserved, but the duplicated headers disappear. I am currently doing this with datamash rmdup 1 below, but I rather have a pure csvtk recipe for this:

csvtk xlsx2csv -n LISTOFWORKSHEETS $meta | csvtk cut -f1 | while read ws; do csvtk xslx2csv -n $ws $meta; done | datamash rmdup 1 -t ',' > final_collated_file.csv

What would be the recipe for this collate operation in csvtk? Thanks in advance.

shenwei356 / csvtk Goto Github PK

csvtk's Introduction

csvtk - a cross-platform, efficient and practical CSV/TSV toolkit

Introduction

Table of Contents

Features

Subcommands

Installation

Method 1: Download binaries (latest stable/dev version)

Method 2: Install via conda (latest stable version)

Method 3: Install via homebrew

Method 4: For Go developer (latest stable/dev version)

Method 5: For ArchLinux AUR users (may be not the latest)

Command-line completion

Compared to csvkit

Examples

Acknowledgements

Contact

License

Starchart

csvtk's People

Contributors

Stargazers

Watchers

Forkers

csvtk's Issues

Prerequisites

Describe your issue

Prerequisites

Describe your issue

Prerequisites

Describe your issue

Prerequisites

Describe your issue

keep rows according to specified files (one or more based on row union)

Prerequisites

Describe your issue

Prerequisites

Describe your issue

Prerequisites

Describe your issue

Prerequisites

Describe your issue

Prerequisites

Describe your issue

Prerequisites

Describe your issue

Prerequisites

Describe your issue

Recommend Projects

Recommend Topics

Recommend Org

Compared to `csvkit`