This repository documents line of code calculation bugs which is present in TruffleHog secrets scanner, as reported in the following issues in the TruffleHog GitHub issue tracker:
- #2502 - Line of code calculation is wrong for sequential identical secrets
- #2504 - Presence of 'line of code' values are inconsistently presented in results, depending upon the data source configured
When the same secret occurs multiple times in a contiguous sequence, TruffleHog incorrectly calculates the line of code value for each subsequent instance of a secret as having the same value as the first occurrence. This results in the first instance of a secret having the correct value for the line number, but all following instances having the same value, which is incorrect.
Analysis indicates that this bug was introduced in PR #520 which was merged on May 04 2022, and first appeared in TruffleHog v3.4.3 on May 05 2022.
Data chunking is used internally in TruffleHog to optimise for performance, by setting a maximum amount of data that a secret detector will process at a time. However, the implementation of this methodology loses some context including the occurrence number (index) of the found secret.
This loss of context information leads to a scenario such that if the exact same raw secret value is present sequentially in a data chunk, then when the line of code is calculated, the first instance of the raw secret value is used as a reference point for calculation instead of the actual location where the secret may have actually been found.
In the file /pkg/engine.go, in the FragmentLineOffset method, on line number 900 , the data chunk is broken up into three pieces (before, after, and found) using the bytes.Cut function:
The behaviour of this function is such that it will split the data upon encountering the first occurrence of the supplied prefix, in this case that is the raw secret value (result.Raw). The consequence of this is that each time the engine calculates the line of code value, it is calculating it from the same line each time regardless of which instance of the occurrence that this calculation is intended for.
Previous to TruffleHog version 3.28.0 similar behaviour was first introduced in v3.4.3 on line 234 using the bytes.Split function:
Bug #2504 - Presence of 'line of code' values are inconsistently presented in results, depending upon the data source configured
Line of code values reported by TruffleHog are inconsistently present depending upon the data source selected. It has been observed that the filesystem data source does not always present the line of code values in results as presented by TruffleHog.
In the below screenshot, which shows /results/filesystem_loc_inaccurate.json , we can see that of the 10 results presented (lines 2-11), only 5 of the results contain a line number, despite the findings being produced in the same file, using the same regular expression pattern, with the similar 'raw secret' results:
Comparing the above screenshot, to the below, which is taken from /results/git_loc_inaccurate.json we can see that each finding produced with the 'git' data source contains line of code values. These values are incorrect due to bug #2502 described above, however they are present in that data source but not the 'filesystem' data source, which is issue at hand with bug #2504.
Observed in TruffleHog v3.68.0, no other versions have been tested at this point.
Unknown.