During a demo this week for a potential OpenAnnotate redaction customer, the question was raised “So you guys don’t do redaction like the lawyer’s for Paul Manafort did, do you?”. To help understand the best way to redact PDF documents to avoid the issue of the unsuccessful redaction, we thought we would provide some additional detail on how to redact (and not redact) PDF documents.
Manafort Lawyer Redacting – What went wrong?
The issue occured recently when Manafort’s lawyers filed a response to special counsel Robert Mueller’s team’s allegations that Manafort lied to prosecutors. On page five, six, seven and nine either the lawyers or the special council staffers attempted to redact sensitive passages. While the redaction blocks prevent the words from being read at first glance, anyone with Adobe Acrobat or other PDF viewing tools or even browser based viewing tools could easily copy and paste the text that still existed under the redaction blocks to another document to easily read the passages that had been redacted.
Redacting Text from a PDF – Our guess at what went wrong
PDF provides a number of different types of documents that could have played a role in how the redaction was wrongly carried out. Typically a document that is scanned in is referred to as a PDF – Image. This document, like a fax, is only made of up black and white (or color if it is a color scanner) dots that do not contain any text for copy and paste. Redacting this type of document can just involve converting all of the dots around the image of text that should be redacted to black. Given lawyers and the amount of scanned documents for signatures, we would imagine that whoever did the redacting thought this document was a scanned document as just drawing boxes over the text would have successfully redacted the ability to view the black and white dots that make up the words.
Unfortunately, there are two other types of PDF documents that in addition to the image of text, contain text data. In these other PDFs, this text data is what allows searching and copy-pasting of the document’s text.
Such documents can be created in two ways. Either the image document is run through an Optical Character Recognition (OCR) Module and the text is embedded behind the image to enable search and other text capabilities like copy and paste. Alternatively, the document could be created from a word processing or font capable program directly into a PDF including text and fonts. From our quick review, we would guess that the document was created and never scanned as the Manafort document as it is very clean (no stray dots from scanning) and very small compared to an image with text. In either of these cases, simply drawing a box over the words does not remove the text from underneath.
OpenAnnotate Redaction
In the short demo below, we will use the Manafort document with OpenAnnotate to show both how to do text redaction versus just drawing blocks over the text.
Thanks again for reading. Let us know if you have any additional thoughts or comments below.
[…] PDF Image format to embed the Textract OCR results behind the image. See our related post on Redacting PDF – What did the Manafort Lawyers do wrong to better understand PDF with image and how text can be embedded in the PDF. By […]