Enumerating Metadata: Part2 pdf files
In my article Gathering & Analyzing Metadata Information I empasized the security risk of hidden metadata info of publicly shared documents and how this info can be gathered massively through certain tools. So I begun writing a series of articles in order to analyze the different types of file metadata and what tools can someone use in order to view and edit/remove them. In the first part, I analyzed the case of exif jpeg metadata and in this article I will continue with the famous Portable Document Format (PDF) file, presenting the appropriate tools to handle the metadata information.
We all use PDF files due to professional or personal needs of document sharing with others. PDF metadata is usually populated by PDF converting applications and might expose undesirable information to third-parties. Especially after the adoption of XMP (after version 1.6) in PDF metadata, there has been an increase in the available hidden information fields. Adobe Acrobat Pro offers an extended editor in order to edit metadata fields, but the Adobe Reader and many other editors and converters do not. Some of the metadata information fields are:
- AdHocReviewCycleID
- Appligent
- Author
- AuthorEmail
- AuthorEmailDisplayName
- Company
- CreationDate
- Creator
- EmailSubject
- Keywords
- ModDate
- PreviousAdHocReviewCycleID
- Producer
- PTEX.Fullbanner
- SourceModified
- Subject
- Title
There exist a lot of tools that can extract/edit/remove PDF metadata information, but I prefer to use open source tools. So I will analyze the use of the PDF Toolkit (pdftk) under a linux environment. PDFTk does not require Acrobat and can run under Windows, Linux, Mac OS X, FreeBSD and Solaris systems. PDF Toolkit has many features but in this article I will cover the ones that we need for metadata manipulation.
Initially you will have to install pdftk using your distribution’s package manager or by compiling the sources.
In order to extract metadata information from a pdf file you can use the dump_data option as follows:
$pdftk file.pdf dump_data InfoKey: Creator InfoValue: PScript5.dll Version 5.2.2 InfoKey: Title InfoValue: Microsoft Word - Ergastiriaki_Askisi_2011.doc InfoKey: Author InfoValue: Administrator InfoKey: Producer InfoValue: GPL Ghostscript 8.15 InfoKey: ModDate InfoValue: D:20110406122119 InfoKey: CreationDate InfoValue: D:20110406122119 PdfID0: bb8f9ac70cc66e8cabecb4144806f PdfID1: bb8f9ac70cc66e8cabecb4144806f NumberOfPages: 3 |
In order to edit metadata fields you have to extract metadata into a file, edit the desired values in the file and then update the pdf by importing the edited metadata file.
To extract metada to file use the output option:
$pdftk file.pdf dump_data output pdf-metada |
Using your preferred text editor, you can edit the pdf-metadata InfoValues (I prefer to leave every field blank). Then you can update the initial file using the edited metadata file.
$pdftk file.pdf update_info pdf-metadata output no-metadata.pdf |
In order to automate the above steps, I have wrote a simple script to work in a whole directory containing pdf files.
#!/bin/bash SAVEIFS=$IFS IFS=$(echo -en "\n\b") if [ $# -ne 2 ] ; then echo "Usage: $0 [dir] [meta-file]" echo -e "\t[search_dir]" echo -e "\t\tDirectory with pdf files" echo -e "\t[metafile]" echo -e "\t\tFile containing desired metadata" exit fi PDFTK="/usr/bin/pdftk" SOURCEDIR="$1" METAFILE="$2" PDFTMPFILE="/tmp/temp.pdf" for i in $( find $SOURCEDIR -type f -name "*.pdf" ); do cp $i $PDFTMPFILE $PDFTK $PDFTMPFILE update_info $METAFILE output $i rm $PDFTMPFILE done IFS=$SAVEIFS |
And here is a clean metadata file that you can use:
InfoKey: Author
InfoValue:
InfoKey: Company
InfoValue:
InfoKey: CreationDate
InfoValue:
InfoKey: Creator
InfoValue:
InfoKey: ModDate
InfoValue:
InfoKey: Producer
InfoValue:
InfoKey: SourceModified
InfoValue:
InfoKey: Title
InfoValue:
A. Bechtsoudis
It should be mentioned that pdftk does not change the XMP metadata. This can be accomplished with exiftool. https://gist.github.com/hubgit/6078384