Git Product home page Git Product logo

insight-data-engineering's Introduction

Insight Data Engineering Challenge

Dependencies

My solution uses one external Python module (filesplit) for file splitting, in the parallel solution. It can be installed using:

python3 -m pip install filesplit

Apart from this, I have used the following Python builtin modules: csv, os, glob, shutil, multiprocessing, collections, functools

Solution

I have tried to solve the problem using two different approaches. The first approach uses a single threaded process, and is memory efficient, but takes a long time to finish. The second approach uses multiple processes running in parallel, and is time efficient, but assumes that the system has enough free space in secondary storage. Both of the approaches reads the data in row by row to make sure the solution scales for large data, and use Python dictionaries as the underlying data structure to store relationship maps and resulting statistics for fast access. The first approach is implemented in the DeptOrderStat class. The second approach is implemneted in the DeptOrderStatMP class, which extends the DeptOrderStat class.

The assumption behind the parallel solution is that the machine has enough free space to store some temporary files (equals the size of the input files). The parallel solution splits the order request file into separate files, process each file using a separate process, and then consolidates the results. For large input files, this reduces the time required to generate the report by a linear factor of m, where m is the number of cpus in the system.

Interface for both the classes are same. To run the parallel solution just pass --mp after the script name.

To run the parallel solution use: python3 report_generator.py --mp

To run the sequential solution use: python3 report_generator.py

To pass product file, order request file, and output file manually use the following format: python3 report_generator --orderfile <order_file_path> --productfile <product_file_path> --output <output_file_path>

To pass the name of the columns in CSV file identifying the product-id, department-id and reorder information use: python3 report_generator --productcol <column_name> --deptcol <column_name> --reordercol <column_name>

I have refrained from the implementing a solution with high-level of abstraction that can be extended 'easily' for other analytics problems. Hence, the solution of the problem is specific to the input data models described in the requirements.

Test cases

I have implemented one test case to check that solution works 'properly' with missing or invalid data.

Bugs

I have omitted implementing a range check for data validity of the reordered column to speed up the running time. Hence, if the input data is any number other than 0 or 1, it will still pass through. This feature can be added with one simple data validation check condition, if required.

insight-data-engineering's People

Contributors

manojps avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.