Git Product home page Git Product logo

Comments (1)

keif888 avatar keif888 commented on June 24, 2024

I would challenge that best practice, as described, but only where using Multiple Hash for ALL hash generation within your solution. The best practice is there for cases where you are using HashBytes (or it's equivalent in other DBMS').

On face value, as described, it will increase your risk of collisions due to dissimilar records generating the same hash.
In my view what is suggested will result in the following scenario where dissimilar records generate the same hash value:

Col 1 Col 2 Col 3 Hash
Text NULL Text 2 0x123456
Text Empty String Text 2 0x123456

If you are writing your own hashing from scratch, or using hashbytes, then other than empty string, that best practice will reduce the risk of dissimilar records generating the same hash. You need to include replacement of NULL with something that you can guarantee will never appear in your data to be safe.

Col 1 Col 2 Col 3 Hash
Text NULL -> Never In Text Text 2 0xab4f35
Text Empty String Text 2 0x123456

See Best practices on developing Data Vault in SQL Server (including SSIS)

Version 1.3 and greater of my component has a "Safe Null Hashing" option that is enabled by default to prevent this type of issue.

My component generates a byte array of the input to hash by repeating the following for each input column in the hashed field:

  1. Append input column content to a data byte array, ignoring null columns, and formatting all input data into maximum precision
  2. Append Null indicator to indicator byte array
  3. If the input column is variable length, append the length of the field to the indicator byte array

This guarantees that there can be no duplicate hash values generated for input data (with the exception of hash collisions), and is a better solution than the "best practice".

That being said, if you really want an option added to support Pipe delimiting with null replacement hashing, which would have to be mutually exclusive of Safe Null Handling, it can be done.

It would require an increment to the version (1.8), as this will need a new property added (or the SafeNullHandling property updated to support On, Off and DataVault.), which needs the upgrade to execute when opening an old package.
The CaclulateHash function in Utility.cs will need updating to support creating the inputByteBuffer in the correct format.
And of course the UI's need updating to support the new option.

The license doesn't prevent changes, (fork or pull), or running your own repository (clone). You just have to use the same license...

from ssismhash.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.