Git Product home page Git Product logo

sql_dataload's Introduction

SQL DataLoad Project

This project is an ETL (Extract-Transform-Load) project to convert a live database to a new system.

It took as input:

  • an existing MS SQL Server Database representing the live system;
  • several production Excel files containing accounting information (i.e., invoicing and payments). The conversion process covered over a year of live operation, so the historical files, contained in a nested directory structure, had to be found, parsed, and processed for the information they contained;
  • a number of special Excel 'fixup' files (created to deal with bad data in the SQL Server tables; these were parsed by ruby scripts which converted the detailed information into a SQL script to apply the corrections to the live database before the actual conversion process). and created the new database, properly defined and related.

Some of the issues that had to be dealt with included:

  • Multiple individuals assigned the same ids in the original database; thus, James Smith and Jesse James could all have the ID 12345
  • Ids and names were changed at various times without correcting the existing data
  • The only reference between the tables and the external spreadsheets was individual's names, which had many mis-spellings and typos. I.e., one part of the system might use the name 'James Smith', whereas external systems might refer to that individual using the name 'James S Smith', 'James S. Smith', 'James Smith Jr.', etc... This required special processing to 'normalize' the names and improve the percentage matched.
  • During the original creation of the live system, not all records could be captured (data errors, typos, missing data,...) and thus it was incomplete. However, other instances of that data did appear in several external, related systems.

The goal was to correct as many of these errors as possible and to clearly identify any missing data.

Languages Used

  • Windows CMD scripts to drive the process
  • Ruby to process the external spreadsheets and perform other tasks such as generating SQL scripts fo data fixups from the spreadsheets
  • SQL scripts to perform the data conversion
  • Sed scripts to cleanup data and/or to quickly generate SQL code

Along the way, a number of one-off tests were run to determine precisely how various programs handled specific error or unusual conditions. These simple test scripts are included in the event they help someone else.

Design Approach

The entire process was designed so that the conversion could be re-run as many times as necessary against a clean/updated database. This is in contrast to the approach followed in some other conversions where data is converted once to spreadsheets in a 'standard' format and then changes to the original database are processed manually. The output database structure was originally provided as MySQL statements, so these were converted to MS SQL Server format (via a ruby script) as part of the initial handiling. Since this was run in a Windows environment, the process was executed through a series of CMD files to drive the process. A single CMD file - createmors.cmd - drives the entire process. At a few points in the process, the SQL Server database instance is restarted; this was done to avoid problems wherein MS SQL Server would hang for no apparent reason. Also notice that the free version of the SQLSharp extension was used: this is an awesome library for MS SQL Server users - highly recommended!

Overall, the process was designed to be quite chatty and all output was redirected to various .out files for subsequent review. [Note that some of these files are provided; however, sever

Process

Overall, the process runs from a small number of Windows CMD scripts, designed to make it easy to run the process through to completion without user intervention. Once a short manual process is initiated, the entire script can run straight through. The CMD code that exists here does have a few prompts throughout the process, but they were easily removed to allow unattended operation.

Manual

  • Restore the conversion database from the current production database
  • Run the createmors_sqlsharp.cmd to install the extension

createmors.cmd

  • Ensure that the SQLSharp extension has been loaded
  • Run createmors1.cmd to perform the basic creation/conversion
    • Run various data fixup procedures
    • Create some basic views and new client tables
    • Insert data into the tables
    • Test the various tables for accuracy / completeness
    • Start the invoice/payment (remittance) processing
  • Run createmors2.cmd to incorporate the invoice/payment data from external spreadsheets and produce final statistics
    • Process the remittance data and update output tables with data acquired from this step
    • Export data for missing information
    • Print summary statistics & error information

createmors_list.cmd

This is a convenience command to display the results, sans irrelevant output from MS SQL Server. This is a simple grep with options to exclude the constant "Changed database" and other "informational" messages from SQL Server that clutter up the output.

Results

Output from the various steps was redirected to a variety of .out files, so that the intermediate results could be examined and changes made. Numerous errors were tracked throughout the process and matched back to external information about the database contents. In the end, not all errors could be handled: aproximately 1.5% of the data could not be matched, due to poor data quality in the original database, and had to be handled through an external, manual process.

NOTE

The original process ran with all scripts in the project's root directory; however, the ruby and SQL scripts have been extracted into their own sub-directories for easier reviewing.

sql_dataload's People

Contributors

jesii avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.