opendata / ssn-redaction Goto Github PK
View Code? Open in Web Editor NEWA redaction tool for Social Security Numbers in PDFs. [RETIRED]
License: MIT License
A redaction tool for Social Security Numbers in PDFs. [RETIRED]
License: MIT License
Our present regex ((?!666|000|9\d{2})\d{3}(-?)(?!00)\d{2}(-?)(?!0{4})\d{4}
) allows hyphens optionally. But this allows just one hyphen, which probably isn't realistic. SSNs are likely to have zero hyphens or two hyphens, but not one. In testing, this made a false match, in a footer at the bottom of a page (reading 15351125-781074
). I suggest that, if feasible, we allow either zero hyphens or two hyphens, but not one.
Review at least a few dozen SSN-bearing PDFs, testing the candidate regular expression against real data, to fine-tune it.
People sometimes use spaces instead of hyphens.
Use the regex to identify the target text within the PDF text.
OCR software (including, notably, Tesseract) sometimes gets a little too clever, and believes that hyphens are actually em dashes, en dashes, or minus signs. (Possibly other characters too, I'm not sure.) e.g., 012โ34โ5678
instead of 012-34-5678
. These, of course, are not found by our regex.
I suggest that we convert the character set down to ASCII (assuming that Ruby can do such a thing), so that all hyphen-like characters become, simply, hyphens.
Virtually any 10-plus-digit number is going to be matched, in part, by our regular expression. We really only want to match 9 digit numbers that are not preceded by or followed by digits. I've tried this:
(?:\D)(?!666|000|9\d{2})\d{3}(-?)(?!00)\d{2}\1(?!0{4})(?:\D)
but that also matches the preceding and following characters, returning results like A123-45-6789,
. (The ?:
ensures that the group isn't stored in memory by regex, but they're still being returned as part of the result.) Figure out how to require the presence of \D
before and after, but not return them.
Are we concerned that OCR software frequently replaces letters with numbers? O
for 0
and I
for 1
are both very common transpositions. Or do we regard OCR as the user's problem? That is, is it up to them to get their OCR correct?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.