Security Experts:

Connect with us

Hi, what are you looking for?



Research: Security Agencies Expose Information via Improperly Sanitized PDFs

Most security agencies fail to properly sanitize Portable Document Format (PDF) files before publishing them, thus exposing potentially sensitive information and opening the door for attacks, researchers have discovered.

Most security agencies fail to properly sanitize Portable Document Format (PDF) files before publishing them, thus exposing potentially sensitive information and opening the door for attacks, researchers have discovered.

An analysis of roughly 40,000 PDFs published by 75 security agencies in 47 countries has revealed that these files can be used to identify employees who use outdated software, according to Supriya Adhatarao and Cédric Lauradoux, two researchers with the University Grenoble Alpes and France’s National Institute for Research in Computer Science and Automation (Inria).

The analysis also revealed that the adoption of sanitization within security agencies is rather low, as only 7 of them used it to remove hidden sensitive information from some of their published PDF files. What’s more, 65% of the sanitized files still contained hidden data.

“Some agencies are using weak sanitization techniques: it requires to remove all the hidden sensitive information from the file and not just to remove the data at the surface. Security agencies need to change their sanitization methods,” the academic researchers say.

PDF files, the researchers note, represent collections of indirect objects (eight types of objects: arrays, boolean, dictionaries, names, numbers, streams, strings, and the null object) that are used to store data. These objects may include hidden data not visible when viewing the PDF.

Per the NSA, there are 11 main types of hidden data in PDF files, namely metadata; embedded content and attached files; scripts; hidden layers; embedded search index; stored interactive form data; reviewing and commenting; hidden page, image and update data; obscured text and images; PDF comments that are not displayed; and unreferenced data.

Metadata associated with images within a PDF file can be used to gather information about the author, the same as comments and annotations that haven’t been removed before publishing, and PDF metadata.

There are several tools that can be used for sanitizing PDF files, including Adobe’s Acrobat, and there are four levels of sanitization: Level-0: full metadata (no sanitization), Level-1: partial metadata, Level-2: no metadata, and Level-3: properly cleaned files (full sanitization, with all objects having been removed).

For their research, the academics used a set of 39,664 PDF files. Of these, 1,783 (4%) were found to include author name, 30,155 (76%) contained metadata on the PDF producer tool, and 16,805 (42%) revealed the operating system used.

The files also leaked email addresses – including official ones – (in 52 files), hardware brand (581 files), and paths (1,814 PDFs).

“During our analysis we observed that many agencies include more than one author publishing the PDF files. It is possible to download all the PDF files published on a security agency’s website and observe the author habits, OS trends,” the researchers note.

The analysis also allowed for the identification of 159 employees at 19 agencies that haven’t updated tools over a period of two years, which could be abused by threat actors in targeted attacks, especially since nearly half of the PDF files leaked operating system data.

While 9,509 (24%) of the analyzed PDF files have been sanitized before publishing, only 3,313 (8%) were sanitized with Level-3. The researchers note that only 3 agencies out of 7 that appear to care about sanitization are doing it properly.

“The issue is that popular PDF producer tools are keeping metadata by default with many other information while creating a PDF file. They provide no option for sanitization or it can only be achieved by following a complex procedure. Software producing PDF files need to enforce sanitization by default. The user should be able to add metadata only as an option,” the academics conclude.

Related: Adobe Open Sources Tool for Sanitizing Logs, Detecting Exposed Credentials

Related: Researchers Disclose New Methods for Replacing Content in Signed PDF Files

Written By

Ionut Arghire is an international correspondent for SecurityWeek.

Click to comment

Expert Insights

Related Content

Application Security

Cycode, a startup that provides solutions for protecting software source code, emerged from stealth mode on Tuesday with $4.6 million in seed funding.


Out of the 335 public recommendations on a comprehensive cybersecurity strategy made since 2010, 190 were not implemented by federal agencies as of December...

Application Security

Many developers and security people admit to having experienced a breach effected through compromised API credentials.

Application Security

Electric car maker Tesla is using the annual Pwn2Own hacker contest to incentivize security researchers to showcase complex exploit chains that can lead to...


The EU's digital policy chief warned TikTok’s boss that the social media app must fall in line with tough new rules for online platforms...


Meta was fined an additional $5.9 million for violating EU data protection regulations with WhatsApp messaging app.

Application Security

Google’s Threat Analysis Group (TAG) has shared technical details on an Internet Explorer zero-day vulnerability exploited in attacks by North Korean hacking group APT37.

Application Security

Password management firm LastPass says the hackers behind an August data breach stole a massive stash of customer data, including password vault data that...