The PDF Tar Pits: Where content is trapped, struggles, sinks, and dies…

I’m working on a small government web project at the moment, and I was asked to assess the content to propose some content types. As I have looked across the landscape, there were very few content types, really. But then, as I continued my survey, I noticed these suspicious, dark black patches. I couldn’t see beneath their surface, so I started probing, moving closer…THEN, I was caught in sticky, black goo that started to pull me under, as panic rose in my throat…

Caught in the PDF Tar Pits

This is web content that was authored in MS Word, converted to portable document format (PDF) files, and then uploaded to the website, rather than loading it into a content management system (CMS) as text and images. PDF document libraries sprawl insidiously across the internet landscape, trapping living, breathing content in their depths, ossifying into solid rock—unusable, un-reusable—until some content strategist chips away the asphalt to discover the bones of content that is probably extinct, or at least years out of date.

I understand how these PDF libraries are created and why. Really, I do. It’s easier to have content owners whack up content however they want, then just toss the PDFs online, rather than spend the time to consider the content carefully, giving it the time, attention, and respect it deserves.

Let me offer, though, some reasons for helping to pull the poor, thrashing, doomed content back out of the tar.

  1. Oh, PDFs are searchable, so it’s OK to dump…er…upload them.

    It’s true. Adobe over the years has made provision for lots and lots of embedded metadata, so it’s easier to find them. But while search engines can index PDFs so that they can be found, the real human beings who are searching for that content cannot scan them to see whether it’s the content they’re seeking without opening them. Don’t make your visitors become content excavators. Don’t make them open that PDF to skim it.

  2. But this is how I want it to look.”

    It’s true. There are many times when our designers spend hours creating beautiful, high-end, printed publications. That’s good. That’s their art and craft. But creating print-ready publications does not release us from the responsibility of making all that content directly accessible as html text and images. You can certainly make BOTH available, as indeed you should.

  3. We’ll just make a content type for PDFs.”

    It’s true. It is indeed important to have content types to represent files in libraries, ready for download. They need to make their metadata available to the CMS, so that appropriate related files can be offered up alongside other primary content. But when text and image content are caught in the PDF tar, their own content type is masked. Unless the content is pulled from the PDF, your CMS cannot manage the true content types correctly. Files of the “PDF” type will be indistinguishable, one from another, all sticky and black as they are.

  4. This is how we got it from the content owner, so that’s how we’re going to publish it.”

    It’s true. Content owners own their content. (I know, it sounds silly.) They spend a lot of time, laboring in MS Word to format it just so. When they hand over their œuvre to you for posting, you’re stuck between appreciation of their efforts, compassion that they spent so much time on it, and horror that it’s going to require stripping it of all its format before it can be reformatted for the CMS. If you have content owners who are open to the liberation of “just give me the text,” then you can make their lives (and yours) easier, and the content escapes oblivion. If not, then although it means a longer road, reformatting the content will take you safely around the tar.

  5. PDFs of unstructured documents can never be reused as structured content.

    Finally, the most important reason for eschewing the PDF is that when content owners create MS Word documents, they almost never—like, ever!—understand the difference between “format” and “structure.” So they skip blithely through their document, clicking bold here, italic there, and changing fonts and colors according to how they think it will communicate their intentions, without capturing the meaning of those formatting changes in the structure of the content. If unstructured documents are then converted to PDFs and put online, they will be unusable as structured content, and meaningless to semantic search.

Time to Drain the Tar Pits

The simplest guidance you can give your clients, content owners, and stakeholders is to reserve PDF files for content that has been designed to be printed, and then only as a supplement to the live web content. You can probably get away will making PDFs available for content that no one will ever really need, like legal reports and other specific content types that will actually be easier to consume as printed documents rather than as web documents. Even in those circumstances, abstracts of that content should be posted, so that content consumers will be able to preview the documents before committing themselves to downloading them.

About: rsgracey

@rsgracey has spent his life moving from one area of interest to another, collecting knowledge, skills, and experience (and TOOLS!) for a wide range of creative and professional fields. If you need someone to help you "think through" any problem of information, communication, and the community, don't hesitate to call him in.

2 comments

  1. Great piece!

    I’ve had similar experiences with the PDF scourge.

    In my experience, I’ve seen another great (and admittedly rare) use of this format along with the printable items you mentioned. In some workflows with regulated content that varies from week to week, the PDF serves a unique purpose. We’re able to clear and change the file rather than the whole page. The content is rarely accessed (and that’s okay). It satisfies the legal needs, and actually shortens the workflow overall. Everyone wins!

  2. rsgracey says:

    Excellent. Thanks so much, Clinton–your reading means a lot to me. Stephen