Using PDF/A as a Preservation Format
What is PDF/A?
The PDF/A (or Portable Document Format Archival) is a format designed as a preservation format for digital records, particularly documents. The format, though, can also be used for scanned documents. It is an international standard and a subset of the PDF format. One of the great values of PDF formats is that they are open standards, used widely across the world, and designed to record both images and machine-readable text in one document.
Uses of PDF/A
PDF/A can be used to store many types of records, but it is most valuable as a format for storing long-term copies of digital textual documents, such as Microsoft Word files. When you convert such a file into a PDF/A, the resulting file retains the look and feel of the original document. Each page of the original document appears as a single page in the preservation file, the same fonts are used in both documents, and you can search the text of the PDF/A just as you had in the original. If the document is in color, the color is still there as well. For these reasons, PDF/A is a good format in cases where the appearance of the document matters to interpretation and understanding it.
Other digital files may also be converted to PDF/A, including regular PDFs, email, digital images, and spreadsheets. You could even convert a sequence of digital images into one PDF/A. Any digital file that can be printed can be converted to a PDF/A, though this format is better for some documents than others. The format works best for static files that do not change. It is not appropriate for files that are always in flux, such as databases.
Paper documents can also be converted to PDF/A during scanning, but if doing so it will be best if you also use optical character recognition (OCR) software to convert the images of letters in the document into electronic text. Whenever you OCR a document, however, there will be data errors in the converted text. (See the State Archives’ 2013 Digital Imaging Guidelines for the guidelines for scanning and OCR’ing textual documents.)
Advantages of PDF/A
PDF/A has many advantages as a file format for the storage of records with long or permanent retention periods. If you are considering other digital file formats as options for long-term or permanent storage, compare their advantages to those of PDF/A. The advantages of PDF/A given below will serve as a checklist of features necessary in any preservation format. Note that you will find file formats that have one or even a few of these advantages, but it is the accumulation of these that makes PDF/A a good preservation format. Microsoft Word, for instance, is ubiquitous and long-lived, but it is missing other essential features that would make it a candidate for long-term storage of records.
Since its inception, the PDF format has been accessible across computing platforms, and the PDF/A format has this same advantage. What this means is that a PDF/A created in a Windows environment will be perfectly readable and usable in a Mac environment, or vice versa.
Something ubiquitous is something that you can find everywhere, and the PDF and PDF/A formats are used by hundreds of millions of people across the world every day. The value of this universal use of PDF is that this means it is unlikely to die out as a format anytime soon. Also, since PDF/A is merely a subset of PDF, any software product that can read a PDF can read a PDF/A. Adobe distributes for free Adobe Reader software for reading PDFs, allowing everyone to read a PDF/A at no additional cost in computing equipment or hardware. (This is downloadable at http://get.adobe.com/reader/.)
The PDF format has been around since 1991, so this format is unlikely to disappear soon. Again, so long as PDF is around, PDF/As will be easy to read and use.
To understand a digital file, you often need good metadata to give context to the file. This metadata can include many pieces of information, such as the name of the author and the date of a file. For digital files, metadata is often stored within the file itself, so it is important to be able to save this metadata (and even add to it) when converting one digital file to another. PDF/A is specifically designed to support rich metadata.
Supporting perfect conversion
The goal of any conversion program, even microfilming and scanning, is to create a new record that is as much like the original as possible. PDF/A is designed to do much in this area: It saves the look and searchability of the original file, and it requires that the original fonts, colors, and layout be preserved in the PDF/A you produce. The PDF/A format does this by being self-contained, meaning by saving within the file itself all of the information it needs to display the document. (This includes the fonts and the color definitions, which are not always saved in other file formats.)
An open file format is one in which the specifications are available to anyone and where anyone can use those specifications to develop a software product to create and read the file format. PDF/A has always been a preservation standard since its initial release in 2005, so it clearly meets this criterion.
In the digital world, even more so than the analog world, it is important to ensure that records retains that authenticity, that they are not modified after their creation, that they do not come to hold information different than they originally held. No file format alone can ensure authenticity, but PDF/A supports authenticity by being difficult (though not impossible) to modify and by providing document security (such as digital signatures).
All that extensible means is that the readability of a digital file will extend into the future, that a file will not become unreadable as software changes. The PDF/A standard is designed so that the earliest PDF/A will always be readable in the most current PDF viewer. This is assured by the fact that every version of PDF/A is always a subset of the one that comes after it, meaning that the PDF/A-3 standard always supports all the characteristics of the original PDF/A-1—along with a few extra features.
Disadvantages of PDF/A
Versions of PDF/A
Since PDF/A is designed to be a format that extends its features over time, there are already a number of different versions of PDF/A (PDF/A-1, -2, and -3). Beyond this, each generation of the format has different conformance levels, which indicate the degree to which each meets the highest goals of PDF/A.
Defining characteristics of PDF/A
All versions of PDF/A are joined together by a certain subset of supported features, which can primarily be boiled down to one idea: each PDF/A file has to be self-contained, holding within itself all the information needed for it to be read as a complete file. It might seem that all digital files are self-contained, that each one carries everything needed to make it readable as it was intended to be read, but this is not the case. For instance, if you work on a Microsoft Word file at work and then open it at home, it may look much different: If you don’t have the same font at home that you have at work, then the Word file will choose the closest font it can find on your computer. A Word file does not have to store the fonts it uses within itself; instead, it stores only information about the font it uses and then searches for that font in whatever computing environment it is in.
A PDF/A, however, must embed all of its fonts within itself, so that it never has to search for the fonts it needs to reveal itself fully to a user. To save space, the file will store only the subset of the font it needs, so if the file does not have a capital X within it, the information to show that character is not stored in the file. PDF/As also need to have unlimited legal use of any embedded fonts, because if they do not then they will not be able to be viewed accurately in the future. Some fonts have metadata within them that will not allow them to be used in a PDF or that limit the timeframe in which the font may be legally used. If such fonts are in a document you are trying to convert to a PDF/A, then you will not be able to produce a PDF/A from it.
Besides embedded fonts, a PDF/A also needs device-independent color, which means that the display of color in a file cannot be dependent on the computing device you used to view it. A PDF/A has to one of two kinds of color encoding to ensure device independence. These two issues, embedded fonts and device-independent color, are part of a larger rule that a PDF/A file cannot have any reference to outside content.
Also essential to the definition of a PDF/A are the metadata requirements. Because PDF/As are archival files, they must include metadata describing the file, and the file must identify itself as a PDF/A of a certain version. Since the file extension for a PDF/A is the same as for any kind of PDF (they are all .pdf), the file must store metadata within itself that identifies precisely what version of PDF/A it is.
ISO Standard 19005-1:2005
Based on PDF Reference 1.4 (Acrobat 5)
PDF/A-1 also supports the fewest features of any version of PDF/A. It does not support transparency (which is a feature that supports the creation of text shadowing, since the means of supporting transparency long term had not yet been solved). This version also does not support JPEG2000 compression or embedded files, which are supported in all subsequent versions.
PDF/A-1a conformance level
The highest conformance level of any PDF/A is the 1a level, where the “a” stands for “accessible.” This level has all the general features of a PDF/A but also preserves the document’s logical structure. What this means is that the PDF/A-1a stores information to preserve the text stream (or text streams) of a document in reading order. If you create a PDF/A-1a of a newsletter, for instance, the file will know to direct you from one story on the front page directly to where it continues on the fifth. This feature is especially important for the visually impaired, whose screen readers will understand the metadata in the PDF/A-1a and guide them logically through the file. A PDF/A-1a must also specify within itself the language it is written in, and it must include Unicode mapping. Unicode is an extension to ASCII. Where ASCII encodes all of the Latin alphabet, Unicode encodes all writing systems that have ever existed, which makes for a file with a more accurate representation of text.
PDF/A-1b conformance level
The PDF/A-1b conformance level is a step down from the 1a. (The “b" in this level stands for “basic.”) This level preserves the visual appearance visual format of the files, just as all PDF/As do, but it does not require as much descriptive information, the use of Unicode, the preservation of the reading order of the text stream. This makes PDF/A-1b a less accessible format (for the visually impaired), but it still produces a usable preservation file. Since, all PDF/As in conformance level b are easier to make, these also tend to be more common.
ISO Standard 19005-2:2011
Based on PDF Reference 1.7
PDF/A-2 extends the format by supporting a number of different features: the embedding of OpenType fonts (instead of only PostScript fonts), JPEG2000 image compression, transparent objects, and layers (which can be hidden to support the viewing of a multi-layer document). This version also defines use of digital signatures (thus better supporting security), specifies requirements for the creation of metadata by the person creating the PDF/A, and allows the embedding of documents within a PDF/A. In the last case, only PDF/As can be embedded in a PDF/A, but this allows users to create sets of documents in one file (such as a series of emails or related reports).
PDF/A-2a conformance level
This level is the same as PDF/A-1a, but with PDF/A-2 extensions.
PDF/A-2b conformance level
This level is the same as PDF/A-1b, but with PDF/A-2 extensions.
PDF/A-2u conformance level
The PDF/A-2u level is identical to the PDF/A-2b, except in one regard: it requires the use of “Unicode.” (The u stands for Unicode.) As with version 2b, version 2u does not represent the logical structure of the document, but it is a bit better than 2b because it better represents text in multiple writing systems.
ISO Standard 19005-3:2012
Based on PDF Reference 1.7
Currently, PDF/A-3 is the latest version of the format, but new versions of PDF/A are expected and inevitable. This newest version includes only one change to the PDF/A-2 version: it allows embedding of any type of file to a PDF/A. The value of this change is that it supports the practice of maintaining the original source file along with the PDF/A created from it. This allows you to maintain both versions as part of one file, thus simplifying your preservation practices, if you are following the recommended digital preservation practice of always retaining the original digital files along with their preservation copies.
PDF/A-3a conformance level
This level is the same as PDF/A-2a, but with PDF/A-3 extension.
PDF/A-3b conformance level
This level is the same as PDF/A-2b, but with PDF/A-3 extension.
PDF/A-3u conformance level
This level is the same as PDF/A-2u, but with PDF/A-3 extension.
Choosing a version of PDF/A to use
A number of considerations come into play when deciding what version of PDF/A to use, but to some degree any version is fine. If you only have software that will produce a PDF/A-1b and that supports all the features you need, then that is a good choice, and a permanent one. Remember, given the extensibility of the PDF/A series, the first version of PDF/A is compliant with all later versions, and there is never a reason to convert a PDF/A to a more recent version of the PDF/A format.
There are a couple of basic rules you can follow in making your choices. The first is that the best conformance level to use is always level a, which will always produce the most accessible file. Barring that, you should choose level u, for its Unicode encoding, but keep in mind that the basic level (b) will almost always be sufficient for your needs. It also makes sense to use the latest version of the series that you can produce, because doing so will allow you to support the greatest number of features.
What might be a more important consideration is your color encoding. If you will need to print out high-quality copies of a document, then you should choose CMYK encoding (which stands for Cyan, Magenta, Yellow, and Black). But if you expect only to be reading your files on a computer screen then RGB Color (for Red, Green, and Blue) is your better choice.
To create a PDF/A, you need a product that can produce PDF/As. One of the most commonly used products is Adobe Acrobat Professional, versions 8 and later. Keep in mind, however, that there are many other software products you can use as well, and some of them have different features that you might find useful. (For a list of some of these products, see “Appendix A: PDF/A Tools.”) Also, a number of general products, such as the Microsoft Office suite now include tools within them that create PDF/As, so you may not need to purchase any new software at all, depending on your needs. If you need to create many PDF/As all at once, though, you’ll need to purchase a product focused on the creation of PDF/As, because those support batch processing, which allows you to convert multiple documents at once.
The process of converting a digital file into a preservation file is technically called normalization. In this process, the target format (in this case, the PDF/A) has to be one that meets the requirements of a preservation format, so it has to be a format that is not expected to disappear or become unusable in the near future.
What you need to ensure before converting any files is that you have the necessary fonts installed on the computer you are using for the normalization. Without the necessary fonts, you will not be able to create a PDF/A. Of course, this is not an issue when converting a scanned image into a PDF/A.
When to create a PDF/A
You actually have a choice of when to create a PDF/A, and you may choose to create PDF/As at different points in a records life cycle based on your business processes for different records.
At the point of creation
Sometimes, you can create a PDF/A as the original document, thus sidestepping the issue of conversion altogether. Doing so allows you to begin the life of a document in a format you know will last. If you do this, you will have to do it only with documents that you will not have to modify over time. Usually, PDF/As are made at creation only as output from large databases.
At the point of recordation
Recordation is the process of making a document a record. For instance, you may create multiple versions of a report, but only the final version will be the record. So when you have completed the writing and editing of that report, you can save the file as a PDF/A, which freezes the file, making it more difficult to modify. The other advantage of this technique is that it allows you to distinguish easily between interim drafts and the final version, because the PDF/A will always be the final.
At the point of archiving
Most people still convert documents, paper and electronic, to PDF/As at the point of archiving, at that point that they decide to store the record as an archival record by creating a preservation copy of it.
Scanning from paper
When scanning from paper, you have to set your scanner to create a PDF/A-compliant file. You then scan the document, keeping all pages of the document in one PDF/A, and run OCR text recognition, if needed, to convert the text within the document into intelligent digital text.
Converting existing scanned images
If you have existing digital images of text documents to convert to PDF/A, you can use PDF software to conduct OCR text recognition and save the file in your chosen version of PDF/A. Only conformance levels b and u are possible when scanning records, and level u is preferred.
Using the Distiller engine
One method of converting a file to a PDF/A is available only in Adobe Acrobat, and that is the Distiller engine. Distiller works separately from Adobe Acrobat, but it is also part of that software. It is usually accessible on the taskbar of your computer. To create a PDF/A with the Distiller, you would choose the appropriate PDF setting and then save or export the file. The Distiller engine may be a little more convenient sometimes, but it has no other advantages, and it cannot produce a fully accessible file (meaning one that meets conformance level a).
Converting from within proprietary software products
You can also create a PDF/A from within many software products that do much more than create PDFs. These include word-processing, spreadsheet, and page layout software. You can usually create PDF/As by “printing” or saving files to PDF/A, but you must be sure to change the PDF settings to your PDF/A preference. You can also set the software’s default to your preferred settings, for ease of use later on.
Converting from regular PDFs
Many people have stores of regular PDFs that they want to convert to PDF/As for preservation purposes. To do this, you might first have to remove any features that are prohibited in PDF/A, or you can run the conversion and see if any errors occur during the conversion. If using Adobe Acrobat, you will have to use its Preflight function to convert a regular PDF to a PDF/A. Since PDF-to-PDF/A conversions are notoriously unsuccessful, you might want to purchase a product designed for such conversions. The product 3-Heights PDF to PDF/A analyzes files in more detail to afford you a higher success rate in conversion. Still, no product will always be able to produce a PDF/A from regular PDFs.
Quality Control Practices
Any form of reprographics (such as microfilming, imaging, or preservation photocopying) must include a quality control step to ensure an accurate copy of the original has been produced. The same is true of the process of normalization.
There are two basic steps to the quality control of a PDF/A. First you must visually inspect the document to ensure that the new file looks just like the old file. If the conversion has somehow gone awry, you should be able to see this in the file and then repeat your conversion processes, after rechecking your settings and methodology. The second step in quality control is to validate the created files’ conformance to the version of the PDF/A standard you are using. To do this, you’ll have to use any of a number of validation tools, including Adobe Acrobat’s Preflight function. For a list of such products, see “Appendix B: PDF/A Validation Tools.”
The preservation of records involves much more than simply creating PDF/As. It requires much work over time, and constant vigilance. You must develop solid conversion procedures, followed by good quality control practices. You will have to create and maintain metadata on the files to make them accessible and usable. You will need to ensure that your environmental controls are good for the storage of electronic files and that your data management controls (especially backup procedures) are sensible and consistent. And you’ll have to ensure one other fact: that your chosen file format for storage remains a valid preservation format. Right now, PDF/A is a good format for the long-term storage of documents, particularly digital textual documents, but that might not be the case ten years from now.
Appendix A: PDF/A Tools
PDF Tools AG
Appendix B: PDF/A Validation Tools
Adobe Acrobat Preflight Function
Callas Software’s pdfaPilot
PDF Tools AG's 3-Heights’ PDF Validator
Appendix C: Additional Resources
Extensible Metadata Protocol
General PDF Resources