Requires the Rust programming language.
# select option 1, default installation
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Then install the package using the following command:
install.packages("https://github.com/mrvillage/lmutils.r/archive/refs/heads/master.tar.gz", repos=NULL) # use .zip for Windows
# OR
devtools::install_github("mrvillage/lmutils.r")
- RData files CANNOT be read in parallel like the other formats. It is HIGHLY recommended to convert RData files to another format using
lmutils::convert_file
before processing them. The fastest and smallest format isrkyv.gz
. - RData files are assumed to be compressed without looking for a
.gz
file extension. - All files are looked for in the current working directory.
- All files are written to the current working directory.
- All files are assumed to be matrices of floats, unless otherwise specified.
Converts matrix files from one format to another. Supported formats are:
csv
(requires column headers)tsv
(requires column headers)txt
(requires column headers)json
cbor
rkyv
rdata
(NOTE: these files can only be processed sequentially, not in parallel like the rest)
All files can be optionally compressed with gzip
, rdata
files are assumed to be compressed without looking for a .gz
file extension.
lmutils::convert_file(
c("file1.csv", "file2.RData"),
c("file1.json", "file2.rkyv.gz"),
)
Calculates the R^2 and adjusted R^2 values for blocks and outcomes.
The first argument is a character vector of file names to read the blocks from, a list of matrices to use as the blocks, or a single matrix.
The second argument is a single file name or matrix to use as the outcomes. Each outcome is a column in the matrix.
The function returns a data frame with columns r2
, adj_r2
, data
, outcome
, n
, m
, and predicted
.
results <- lmutils::calculate_r2(
c("block1.csv", "block2.rkyv.gz"),
"outcomes1.RData",
)
Calculates the R^2 and adjusted R^2 values for blocks and outcomes for a range of columns. Each block is a one range of columns in the provided matrix.
You will likely never need this function lmutils::calculate_r2
is much more useful.
The first argument is file name to read the matrix from or a matrix.
The second argument is a single file name or matrix to use as the outcomes. Each outcome is a column in the matrix.
The third argument is a matrix with two columns, the start and end columns to use (inclusive).
The function returns a data frame with columns r2
, adj_r2
, and outcome
corresponding to each range in order.
results <- lmutils::calculate_r2_ranges(
"blocks1.csv",
"outcomes1.RData",
matrix(c(1, 10, 11, 20), ncol=2),
)
Combines matrices into a single matrix. The matrices must have the same number of rows.
The first argument is a character vector of file names to read the matrices from or a list of matrices.
The second argument is a file name to write the combined matrix to.
If the second argument is NULL
, the function will return the combined matrix.
lmutils::combine_matrices(
c("matrix1.csv", "matrix2.rkyv.gz"),
"combined_matrix.rkyv.gz",
)
Removes rows from a matrix.
The first argument is a string file name or a matrix to remove rows from.
The second argument is a vector of row indices to remove.
The third argument is a string file name to write the new matrix to.
If the third argument is NULL
, the function will return the new matrix.
lmutils::remove_rows(
"matrix1.csv",
c(1, 2, 3),
"matrix1_removed_rows.csv",
)
Saves a matrix to a file.
The first argument is a matrix to save.
The second argument is a string file name to save the matrix to.
lmutils::save_matrix(
matrix(1:9, nrow=3),
"matrix1.csv",
)
Converts a numeric data frame, numeric matrix, or RData file to a matrix.
The first argument is the data frame, matrix, or file name to convert.
The second argument is a string file name to save the matrix to.
If the second argument is NULL
, the function will return the matrix.
lmutils::to_matrix(
data.frame(a=1:3, b=4:6),
"matrix1.csv",
)
lmutils::to_matrix(
matrix(1:9, nrow=3),
"matrix2.csv",
)
lmutils::to_matrix(
"matrix1.RData",
"matrix3.csv",
)
Calculates the cross product of two matrices. Equivalent to t(data) %*% data
.
The first argument is a string file name or a matrix to read the matrix from.
The second argument is a string file name to save the matrix to.
If the second argument is NULL
, the function will return the cross product matrix.
lmutils::crossprod(
"matrix1.csv",
"crossprod_matrix1.csv",
)
Recursively converts all files in a directory to matrices.
The first argument is a string directory name to read the files from.
The second argument is a string directory name to write the matrices to.
The third argument is a string file extension to write the matrices as.
If the second argument is NULL
, the matrices will be written to the input directory.
lmutils::to_matrix_dir(
"data",
"matrices",
"csv.gz",
)
Standardize a matrix. All NaN values are replaced with the mean of the column and each column is scaled to have a mean of 0 and a standard deviation of 1.
The first argument is a string file name or a matrix to read the matrix from.
The second argument is a string file name to save the matrix to.
If the second argument is NULL
, the function will return the standardized matrix.
lmutils::standardize(
"matrix1.csv",
"standardized_matrix1.csv",
)
Computes the p-values of a linear regression between each pair of columns in two matrices.
The first argument is a character vector of file names to read from, a list of matrices to use as the blocks, or a single matrix.
The second argument is a string file name or a matrix to read from.
The function returns a data frame with columns p_value
, data
, data_column
, and outcome
corresponding to each range in order.
results <- lmutils::column_p_values(
c("block1.csv", "block2.rkyv.gz"),
"outcomes1.RData",
)
Matches rows of a matrix to the values in a vector by a column.
The first argument is a string file name or a matrix to read the matrix from.
The second argument is a numeric vector to match the rows to.
The third argument is the column name to match the rows by.
The fourth argument is a string file name to write the new matrix to or NULL
to return the new matrix.
lmutils::match_rows(
"matrix1.csv",
c(1, 2, 3),
"eid",
"matrix1_matched_rows.csv",
)
Matches rows of all matrices in a directory to the values in a vector by a column.
The first argument is a string directory name to read the matrices from.
The second argument is a string directory name to write the matched matrices to.
The third argument is a numeric vector to match the rows to.
The fourth argument is the column name to match the rows by.
lmutils::match_rows_dir(
"matrices",
"matched_matrices",
c(1, 2, 3),
"eid",
)
Compute a new column for a data frame from a regex and an existing column.
The first argument is a data frame to read from.
The second argument is the column name to read from.
The third argument is the regex to match.
The fourth argument is the column name to write to.
lmutils::new_column_from_regex(
data.frame(a=c("a1", "b2", "c3")),
"a",
"([a-z])",
"b",
)
Converts two character vectors into a named list, where the first vector is the names and the second vector is the values. Only the first occurrence of each name is used, essentially creating a map.
The first argument is a character vector of names.
The second argument is a character vector of values.
lmutils::map_from_pairs(
c("a", "b", "c"),
c("1", "2", "3"),
)
Compute a new column for a data frame from a list of values and an existing column, matching by the names of the values.
The first argument is a data frame to read from.
The second argument is the column name to read from.
The third argument is a named list of values.
The fourth argument is the column name to write to.
lmutils::new_column_from_map(
data.frame(a=c("a", "b", "c")),
"a",
lmutils::map_from_pairs(
c("a", "b", "c"),
c("1", "2", "3"),
),
"b",
)
Compute a new column for a data frame from two character vectors of names and values, matching by the names.
The first argument is a data frame to read from.
The second argument is the column name to read from.
The third argument is a character vector of names.
The fourth argument is a character vector of values.
The fifth argument is the column name to write to.
lmutils::new_column_from_map_pairs(
data.frame(a=c("a", "b", "c")),
"a",
c("a", "b", "c"),
c("1", "2", "3"),
"b",
)
Mutably sorts a data frame in ascending order by multiple columns in ascending order. All columns must be numeric (double or integer), character, or logical vectors.
The first argument is a data frame to sort.
The second argument is a character vector of column names to sort by.
df <- data.frame(a=c(3, 3, 2, 2, 1, 1), b=c("b", "a", "b", "a", "b", "a"))
lmutils::df_sort_asc(
df,
c("a", "b"),
)
Combine a list of double vectors into a single matrix.
The first argument is a list of double vectors.
The second argument is a string file name to write the matrix to.
If the second argument is NULL
, the function will return the matrix.
lmutils::combine_vectors(
list(1:3, 4:6),
"combined_matrix.csv",
)
Extend matrices into a single matrix by rows.
The first argument is a character vector of file names to read the matrices from or a list of matrices.
The second argument is a string file name to write the extended matrix to.
If the second argument is NULL
, the function will return the extended matrix.
lmutils::extend_matrices(
c("matrix1.csv", "matrix2.rkyv.gz"),
"extended_matrix.rkyv.gz",
)
Deduplicate a matrix by a row. The first occurrence of each value is kept.
The first argument is a string file name or a matrix to read the matrix from.
The second argument is the column name to deduplicate by.
The third argument is a string file name to write the new matrix to.
If the third argument is NULL
, the function will return the new matrix.
lmutils::dedup(
"matrix1.csv",
"eid",
"matrix1_dedup.csv",
)
lmutils
exposes three global config options that can be set using environment variables or the lmutils
package functions:
LMUTILS_LOG
/lmutils::set_log_level
to set the log level (default:info
).LMUTILS_NUM_MAIN_THREADS
/lmutils::set_num_main_threads
to set the number of main threads to use (default:16
). This is the number of primary operations to run in parallel.LMUTILS_NUM_WORKER_THREADS
/lmutils::set_num_worker_threads
to set the number of worker threads to use (default:num_cpus::get() / 2
). This is the number of threads to use for parallel operations. Once an operation has been run, this value cannot be changed.