Git Product home page Git Product logo

benchmarking-file-formats-for-cloud's Introduction

Benchmarking file formats for cloud Storage

Repository holds test scripts for benchmark different file formats. CSV is relative uncompressed, sparse format but very common for data tasks, like import, export or storing. And when it comes performance of creating CSV file, reading and writing CSV files, how does it still stand against some other formats.

Using Python

Covered formats with Python

Benchmarking different file formats for cloud storage.

  1. CSV
  2. AVRO
  3. Parquet
  4. Pickle
  5. ORC
  6. TXT

Python scripts for benchmark test

create_df()
 
# results for write
print(timeit.Timer(WRITE_CSV_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_ORC_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_PARQUET_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_PICKLE_fun_timeIt).timeit(number=number_of_runs))
 
CLEAN_files()

Using R

Covered formats with R

  1. CSV
  2. Parquet
  3. Feather

R scripts for benchmarking

benchmark_write <- data.frame(summary(microbenchmark(
          "test_df.csv"     = write.csv(test_df, file = file_csv),
          "test_df_readr.csv"     = readr::write_csv(test_df, file = file_csv_readr),
          "test_df_datatable.csv"     = data.table::fwrite(test_df, file = file_csv_datatable),
          "test_df.feather" = write_feather(test_df, file_feather),
          "test_df.parquet" = write_parquet(test_df, file_parquet),
          "test_df.rds"     = save(test_df, file = file_rdata),
          "test_df.RData"   = saveRDS(test_df, file_rds), 
  times = nof_repeat)))

Comparing read and write times

Comparing read and write times for each file extension and see, which one performs better for given task.

Example in case of testing with R:

alt text

Cloning the repository

You can follow the steps below to clone the repository.

git clone https://github.com/tomaztk/Benchmarking-file-formats-for-cloud.git

Using Azure Blob storage for data lake

Running SQL Server on-premise and uploading data to data lake, there is a python script (Jupyter notebook) with detailed steps and script.

Link

Related Blog posts

Contributors and co-authors

Thanks to these wonderful R community people for upgrading and improving these benchmarks. Your contributions are highly appreciated!

Ryan Duryea
Ryan Duryea

benchmarking-file-formats-for-cloud's People

Contributors

aguynamedryan avatar tomaztk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.