opendata / ssn-redaction Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 1.0 12.97 MB

A redaction tool for Social Security Numbers in PDFs. [RETIRED]

License: MIT License

Ruby 100.00%

ssn-redaction's People

Contributors

Stargazers

Watchers

ssn-redaction's Issues

Require either 0 or 2 hyphens

Our present regex ((?!666|000|9\d{2})\d{3}(-?)(?!00)\d{2}(-?)(?!0{4})\d{4}) allows hyphens optionally. But this allows just one hyphen, which probably isn't realistic. SSNs are likely to have zero hyphens or two hyphens, but not one. In testing, this made a false match, in a footer at the bottom of a page (reading 15351125-781074). I suggest that, if feasible, we allow either zero hyphens or two hyphens, but not one.

Test candidate regex on actual PDFs

Review at least a few dozen SSN-bearing PDFs, testing the candidate regular expression against real data, to fine-tune it.

Save the fully redacted PDF

After the target text has been identified and replaced (#8), and the background image modified to overwrite the original text (#6), save the modified PDF.

Support spaces as SSN separators

People sometimes use spaces instead of hyphens.

Identify target text within PDF text

Use the regex to identify the target text within the PDF text.

Deal with em dashes, en dashes, etc.

OCR software (including, notably, Tesseract) sometimes gets a little too clever, and believes that hyphens are actually em dashes, en dashes, or minus signs. (Possibly other characters too, I'm not sure.) e.g., 012—34—5678 instead of 012-34-5678. These, of course, are not found by our regex.

I suggest that we convert the character set down to ASCII (assuming that Ruby can do such a thing), so that all hyphen-like characters become, simply, hyphens.

Require that SSNs be neither preceded nor followed by digits

Virtually any 10-plus-digit number is going to be matched, in part, by our regular expression. We really only want to match 9 digit numbers that are not preceded by or followed by digits. I've tried this:

(?:\D)(?!666|000|9\d{2})\d{3}(-?)(?!00)\d{2}\1(?!0{4})(?:\D)

but that also matches the preceding and following characters, returning results like A123-45-6789,. (The ?: ensures that the group isn't stored in memory by regex, but they're still being returned as part of the result.) Figure out how to require the presence of \D before and after, but not return them.

opendata / ssn-redaction Goto Github PK

ssn-redaction's People

Contributors

Stargazers

Watchers

ssn-redaction's Issues

Require either 0 or 2 hyphens

Test candidate regex on actual PDFs

Save the fully redacted PDF

Support spaces as SSN separators

Identify target text within PDF text

Deal with em dashes, en dashes, etc.

Require that SSNs be neither preceded nor followed by digits

Modify background image to draw a black box

Create a test harness to evaluate effectiveness

Replace identified target string with equivalent-length redaction vector data

OCR software confuses letters and numbers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent