Most security agencies fail to properly sanitize Portable Document Format (PDF) files before publishing them, thus exposing potentially sensitive information and opening the door for attacks, researchers have discovered.
An analysis of roughly 40,000 PDFs published by 75 security agencies in 47 countries has revealed that these files can be used to identify employees who use outdated software, according to Supriya Adhatarao and Cédric Lauradoux, two researchers with the University Grenoble Alpes and France’s National Institute for Research in Computer Science and Automation (Inria).
The analysis also revealed that the adoption of sanitization within security agencies is rather low, as only 7 of them used it to remove hidden sensitive information from some of their published PDF files. What’s more, 65% of the sanitized files still contained hidden data.
“Some agencies are using weak sanitization techniques: it requires to remove all the hidden sensitive information from the file and not just to remove the data at the surface. Security agencies need to change their sanitization methods,” the academic researchers say.
PDF files, the researchers note, represent collections of indirect objects (eight types of objects: arrays, boolean, dictionaries, names, numbers, streams, strings, and the null object) that are used to store data. These objects may include hidden data not visible when viewing the PDF.
Per the NSA, there are 11 main types of hidden data in PDF files, namely metadata; embedded content and attached files; scripts; hidden layers; embedded search index; stored interactive form data; reviewing and commenting; hidden page, image and update data; obscured text and images; PDF comments that are not displayed; and unreferenced data.
Metadata associated with images within a PDF file can be used to gather information about the author, the same as comments and annotations that haven’t been removed before publishing, and PDF metadata.
There are several tools that can be used for sanitizing PDF files, including Adobe’s Acrobat, and there are four levels of sanitization: Level-0: full metadata (no sanitization), Level-1: partial metadata, Level-2: no metadata, and Level-3: properly cleaned files (full sanitization, with all objects having been removed).
For their research, the academics used a set of 39,664 PDF files. Of these, 1,783 (4%) were found to include author name, 30,155 (76%) contained metadata on the PDF producer tool, and 16,805 (42%) revealed the operating system used.
The files also leaked email addresses – including official ones – (in 52 files), hardware brand (581 files), and paths (1,814 PDFs).
“During our analysis we observed that many agencies include more than one author publishing the PDF files. It is possible to download all the PDF files published on a security agency’s website and observe the author habits, OS trends,” the researchers note.
The analysis also allowed for the identification of 159 employees at 19 agencies that haven’t updated tools over a period of two years, which could be abused by threat actors in targeted attacks, especially since nearly half of the PDF files leaked operating system data.
While 9,509 (24%) of the analyzed PDF files have been sanitized before publishing, only 3,313 (8%) were sanitized with Level-3. The researchers note that only 3 agencies out of 7 that appear to care about sanitization are doing it properly.
“The issue is that popular PDF producer tools are keeping metadata by default with many other information while creating a PDF file. They provide no option for sanitization or it can only be achieved by following a complex procedure. Software producing PDF files need to enforce sanitization by default. The user should be able to add metadata only as an option,” the academics conclude.
Related: Adobe Open Sources Tool for Sanitizing Logs, Detecting Exposed Credentials
Related: Researchers Disclose New Methods for Replacing Content in Signed PDF Files

More from Ionut Arghire
- Organizations Worldwide Targeted in Rapidly Evolving Buhti Ransomware Operation
- Google Cloud Users Can Now Automate TLS Certificate Lifecycle
- NCC Group Releases Open Source Tools for Developers, Pentesters
- Memcyco Raises $10 Million in Seed Funding to Prevent Website Impersonation
- Apria Healthcare Notifying 2 Million People of Years-Old Data Breaches
- European Cybersecurity Firm Sekoia.io Raises $37.5 Million
- GitLab Security Update Patches Critical Vulnerability
- Android App With 50,000 Downloads in Google Play Turned Into Spyware via Update
Latest News
- Industrial Giant ABB Confirms Ransomware Attack, Data Theft
- Organizations Worldwide Targeted in Rapidly Evolving Buhti Ransomware Operation
- Google Cloud Users Can Now Automate TLS Certificate Lifecycle
- Zyxel Firewalls Hacked by Mirai Botnet
- Watch Now: Threat Detection and Incident Response Virtual Summit
- NCC Group Releases Open Source Tools for Developers, Pentesters
- Memcyco Raises $10 Million in Seed Funding to Prevent Website Impersonation
- New Russia-Linked CosmicEnergy ICS Malware Could Disrupt Electric Grids
