Enumerating Metadata: Part2 pdf files

anestisb — Tue, 03 May 2011 00:26:57 +0000

In my article Gathering & Analyzing Metadata Information I empasized the security risk of hidden metadata info of publicly shared documents and how this info can be gathered massively through certain tools. So I begun writing a series of articles in order to analyze the different types of file metadata and what tools can someone use in order to view and edit/remove them. In the first part, I analyzed the case of exif jpeg metadata and in this article I will continue with the famous Portable Document Format (PDF) file, presenting the appropriate tools to handle the metadata information.

We all use PDF files due to professional or personal needs of document sharing with others. PDF metadata is usually populated by PDF converting applications and might expose undesirable information to third-parties. Especially after the adoption of XMP (after version 1.6) in PDF metadata, there has been an increase in the available hidden information fields. Adobe Acrobat Pro offers an extended editor in order to edit metadata fields, but the Adobe Reader and many other editors and converters do not. Some of the metadata information fields are:

AdHocReviewCycleID
Appligent
Author
AuthorEmail
AuthorEmailDisplayName
Company
CreationDate
Creator
EmailSubject
Keywords
ModDate
PreviousAdHocReviewCycleID
Producer
PTEX.Fullbanner
SourceModified
Subject
Title

There exist a lot of tools that can extract/edit/remove PDF metadata information, but I prefer to use open source tools. So I will analyze the use of the PDF Toolkit (pdftk) under a linux environment. PDFTk does not require Acrobat and can run under Windows, Linux, Mac OS X, FreeBSD and Solaris systems. PDF Toolkit has many features but in this article I will cover the ones that we need for metadata manipulation.

Initially you will have to install pdftk using your distribution’s package manager or by compiling the sources.

In order to extract metadata information from a pdf file you can use the dump_data option as follows:

$pdftk file.pdf dump_data
InfoKey: Creator
InfoValue: PScript5.dll Version 5.2.2
InfoKey: Title
InfoValue: Microsoft Word - Ergastiriaki_Askisi_2011.doc
InfoKey: Author
InfoValue: Administrator
InfoKey: Producer
InfoValue: GPL Ghostscript 8.15
InfoKey: ModDate
InfoValue: D:20110406122119
InfoKey: CreationDate
InfoValue: D:20110406122119
PdfID0: bb8f9ac70cc66e8cabecb4144806f
PdfID1: bb8f9ac70cc66e8cabecb4144806f
NumberOfPages: 3

In order to edit metadata fields you have to extract metadata into a file, edit the desired values in the file and then update the pdf by importing the edited metadata file.

To extract metada to file use the output option:

$pdftk file.pdf dump_data output pdf-metada

Using your preferred text editor, you can edit the pdf-metadata InfoValues (I prefer to leave every field blank). Then you can update the initial file using the edited metadata file.

$pdftk file.pdf update_info pdf-metadata output no-metadata.pdf

In order to automate the above steps, I have wrote a simple script to work in a whole directory containing pdf files.

#!/bin/bash
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
 
if [ $# -ne 2 ] ; then
        echo "Usage: $0 [dir] [meta-file]"
        echo -e "\t[search_dir]"
        echo -e "\t\tDirectory with pdf files"
        echo -e "\t[metafile]"
        echo -e "\t\tFile containing desired metadata"
        exit
fi
 
PDFTK="/usr/bin/pdftk"
SOURCEDIR="$1"
METAFILE="$2"
PDFTMPFILE="/tmp/temp.pdf"
 
for i in $( find $SOURCEDIR -type f -name "*.pdf" ); do
  cp $i $PDFTMPFILE
  $PDFTK $PDFTMPFILE update_info $METAFILE output $i
  rm $PDFTMPFILE
done
 
IFS=$SAVEIFS

And here is a clean metadata file that you can use:

InfoKey: Author
InfoValue:
InfoKey: Company
InfoValue:
InfoKey: CreationDate
InfoValue:
InfoKey: Creator
InfoValue:
InfoKey: ModDate
InfoValue:
InfoKey: Producer
InfoValue:
InfoKey: SourceModified
InfoValue:
InfoKey: Title
InfoValue:

A. Bechtsoudis

Anestis Bechtsoudis » pdf

Enumerating Metadata: Part2 pdf files