Git Product home page Git Product logo

etlpro's Introduction

etlpro

These are the programs I discussed in my talk ETL for Pros โ€“ Getting Data Into MongoDB The Right Way at MongoDB World 2016. They show various ways to get from a relational table structure to MongoDB documents.

The script create_data.py sets up the source tables in a local MySQL database.

There are three approaches to get the data into MongoDB. The first three are bad for various reasons and to varying degrees. The final one, which I call the co-iteration pattern, attempts to overcome the shortcomings of the earlier ones. For each solution, there's a simple variant that writes data directly to MongoDB, document by document, and a second variant that batches MongoDB operations in groups of 1,000 each.

Here's a full list:

  • The first solution (etl_1.py, etl_1_batch.py) reads data from the source system in nested queries, leading to 2n+1 individual queries.
  • The second solution (etl_2.py, etl_2_batch.py) only runs three queries against the source system (each getting one full table), and builds the documents in the target database, using individual MongoDB updates.
  • The third solution (etl_3.py, etl_3_batch.py) starts by reading the lookup tables completely into the application's memory, and then assembles the output documents in a loop over the main source table.
  • The final solution (etl_co.py, etl_co_batch.py) introduces the co-iteration pattern, which opens all source tables at once, performs a single pass over all of them, and assembles the output documents in the process.

For an actual tool that does all this, and more, in a more generic manner, see John Page's MongoSyphon.

etlpro's People

Contributors

drmirror avatar

Watchers

James Cloos avatar Jason Zhang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.