The Trouble with Semantic Markup: Response to schema.org

First thing this morning, checking in on the Twitter streams, I saw Jeff Evans (@joffaboy) announce the article, “Google, Bing & Yahoo’s New Schema.org Creates New Standards for Web Content Markup.”

Initial tweet

My heart began pounding as soon as I read the title. The arch-rivals of search, the biggest dogs in the yard, the great institutions of the web were collaborating to propose a solution to the problem of markup that has plagued me from the beginning: Markup doesn’t really address the substance of the web, just its most basic structure. My hopes were further raised by the mention of a “recipe” content type, which if you follow my writings, you’ll recognize as a regular example.

I retweeted in a flash: This is what I’ve been looking for!

My first retweet

Then, I visited schema.org, and all my hopes came crashing to Earth again. The Search Giant monsters have created a new monster.

My second retweet

Quick Overview

As I understand it, schema.org is proposing additions to HTML that the “Big Three” search engines are going to interpret, in order to improve the accuracy of search results. By augmenting the markup in web content, they are together settling on a standard vocabulary, so that they will all be recognizing the same language. Presumably, once they’ve built this standard language into their sorting algorithms, any content that has these augmentations will rise to the top of search results, above content that doesn’t.

In principle, that sounds good, doesn’t it?

I’d like to offer some reflections on a few practical implications of this effort.

Corporations try to head off the “free” Semantic Web

For-profit companies have been watching in dismay for twenty years the rise of the “free” WorldWide Web. Content is free. Software is free. Social Networking is free. And more and more of the web is being driven by “free” efforts, like the WorldWideWeb Consortium. Volunteerism is a huge threat to capitalism, and they know it.

Among the greatest of these free efforts is the quest for the Semantic Web, which in its simplest terms, seeks a set of standards for describing the meaning of content. Human language is always problematic—as are those who use it—because words are never just words. The meaning of words is rich, contextual, ambiguous, and worst of all, ever changing. There are a lot of really, really smart people, all over the world, almost exclusively volunteer (with some corporate support), working hard to figure this out. If you want to get a sense of the complexity of it all, talk to Rachel Lovinger (@rlovinger) at Razorfish. She’s one of the true semantic geeks, and I’ll just have to take her word on most of what she says. She’s fab.

But instead of supporting this “free” effort, the Search Giants have imposed a de facto standard for the Semantic Web, and they’re pushing it with the strength of their size and popularity. Like the Zen question of the tree in the forest:

If a search engine doesn’t support your semantic standard, will anyone find your content?

I am suspicious of their motives. I read it as an effort to bypass all the work that’s already gone into the Semantic Web.

Markup is more than basic structure and presentation

It has been a great struggle since the beginning of the web to strike the appropriate balance between the structure of content and its presentation. In other words, what content is should be distinct from how content looks. But HTML—even up to HTML5—still only addresses the most basic aspects of content, and even now, offers only tags that address the pieces of the “webpage”—like the “header” and “navigation.” There isn’t markup to describe the content’s substance.

CSS as semantic markers

Cascading Stylesheets, in a roundabout way is one approach to the problem, although it’s originally meant to control the presentation of the content. Let me give an example.

Lists are a primary content structure. We create lists for everything—ingredients, footnotes, archives, contacts, links, Q&A, references, etcetera ad nauseum—but HTML offers us only two choices: “Ordered lists” (numbered) and “Unordered lists” (bulleted).

If your website had a list of links in a sidebar and a list of staff names on a contact page, you use the same basic markup:

<ul>
    <li><a href= “http://url.for.link/1” title= “This is the first list item”>Link Text 1</a></li>
    <li><a href= “http://url.for.link/2” title= “This is the second list item”>Link Text 2</a></li>
</ul>

…and then…

<ul>
    <li>Contact Name 1</li>
    <li>Contact Name 2</li>
</ul>

Here’s the problem: The web browser has a default way of rendering these lists, and they will look exactly the same, except that the links will be underlined. If you want to distinguish them from each other, you can add CSS classes, which give you a way to style them differently.

Now, CSS gurus (the best of whom are really content strategists underneath it all) will tell you that you should NEVER use class names that describe how something looks, like “class= ‘blue_text’.” The class names should describe what they are, which is, in fact, a semantic indication:

<ul class="links”>
    […]
</ul>

…versus…

<ul class=“contacts”>
    […]
</ul>

Using these identifiers, the designer can define precisely how each component of a website should look. In a better world, however, they could also be used to identify what they are. Defining standard CSS classes and identifiers as part of XHTML would be one approach to encoding the meaning into markup.

But not Google, Bing, and Yahoo—Noooooooo.

The Search Giants, though, instead of building on CSS or any other existing approach, have introduced another “standard,” which superimposes another layer of markup on top of the feeble XHTML we already have. Here is the example from schema.org:

<div>
    <h1>Avatar</h1>
    <span>Director: James Cameron (born August 16, 1954)</span>
    <span>Science fiction</span>
    <a href="../movies/avatar-theatrical-trailer.html">Trailer</a>
</div>

Before I go any further, I have to say that this code doesn’t look like any real XHTML I’ve ever seen, and that’s a worry right from the start. Nevertheless…

Once they’ve applied their markup augmentations, again right from schema.org, it becomes:

<div itemscope itemtype="http://schema.org/Movie">
    <h1 itemprop="name">Avatar</h1>
    <div itemprop="director" itemscope itemtype="http://schema.org/Person">
    Director: <span itemprop="name">James Cameron</span> (born <span itemprop="birthDate">August 16, 1954)</span>
    </div>

    <span itemprop="genre">Science fiction</span>
    <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>

There are many, many, many things wrong with this picture.

All the complexity of XML without any of its simplicity

XML is the mother of all markup. In fact, XHTML is just one markup language based on the XML standard. Using XML as the basis of your web code is an elegant—but very complex—solution to defining your content. When it’s all worked out, however, it lets you replace that gobbledygook above with something more like this:

<movie>
    <title>Avatar</title>
    <director>
        <name>James Cameron</name>
        <birthdate> August 16, 1954</birthdate>
    </director>
    <genre>Science fiction</genre>
    <trailer url= “../movies/avatar-theatrical-trailer.html” />
</movie>

Putting it simply, by augmenting XHTML with another layer of markup, the Search Giants have complicated the code immensely, making it just as complex as if they had done it in XML, but without any of the benefits of XML’s simple elegance.

Content is rarely this simple

The examples above deceive us, in any case: Yes, we can add fields to CMS templates for isolated metadata like “title” and “director,” but what about the main content itself? What about the meaning embedded in the article? Let’s say we’re writing an article about motion picture history, and we include the following sentence:

<p>James Cameron, best known for directing the sci-fi thriller,
“Avatar,” was born on August 16, 1954.</p>

All of the information in the schema.org example is present in that sentence, and if we were searching for content about James Cameron, we would have to rely on full-text searching.

If we were to use the schema.org augmentation, in order to make it all accessible to the search engines, it would get very messy, something like:

<p>
    <span itemscope itemtype ="http://schema.org/Movie">
        <span itemprop="director" itemscope itemtype="http://schema.org/Person">
        James Cameron
        </span>
    </span>,
best known for directing the
    <span itemscope itemtype ="http://schema.org/Movie">
        <span itemprop="genre">sci-fi thriller</span>,
        <span itemprop="name”>Avatar</span>
    </span>
,” was born on
    <span itemscope itemtype ="http://schema.org/Movie">
        <span itemprop="director" itemscope itemtype="http://schema.org/Person">
        <span itemprop="birthDate">August 16, 1954</span>
    </span>
    </span>.
</p>

Not for mere mortal content authors

Now we come to the main practicality of content: Content authors.

I have marked up a lot of content in my career, and I am an obsessive, precise, exacting author. On the other hand, I’ve implemented CMS templates and tried to configure the best WYSIWYG editors to be able to apply the right CSS classes within content. And I’ve worked with a lot of content owners to teach them the importance of good markup.

Here’s the hard reality: No matter how powerful the technology, no matter how carefully designed and coded the CMS templates, no matter how sophisticated the WYSIWYG editor, and no matter how much training we offer, any markup will ultimately succeed or fail on the content authors’ ability to use it.

And that brings me to my main issue with the Semantic Web.

The Semantic Web cannot rely on encoding alone

If the main difficulty of searching the web is in understanding the meaning of the content (given all the languages, people, markup skill, and so many more factors), then we can really only solve it the hard way: Intelligent reading. We cannot rely on the human beings who create content to make it speak for itself, by making sure that everything is tagged correctly. They just can’t do it.

We cannot rely on markup because XHTML is insufficient, XML is too complicated for more than data structures, and the schema.org effort is unrealistic. In the end, each method may play a limited role in addressing the findability of content, but ultimately, it will require some other kind of intelligence—intelligence in the interpreting of meaning, rather than its encoding.

I don’t know what will happen with the schema.org markup augmentations. Personally, I hope that it just sags under its own weight and disappears into the marshes from whence it came. And I heartily encourage all the folks who are working on this problem to keep at it: There’s no path to success here but the long one. Eventually, perhaps new kinds of computers will be able to understand us weird, wonderful human beings, but for now, we remain inscrutable to the mechanical, algorithmic mind.

Content Modeling is more than “fields”

When content management folk talk about “content modeling,” they are usually referring to the process of building templates for a CMS.  Besides the Content Management Bible by Bob Boiko, which is a great place to see how a lot of CMSes work, I found a series of excellent overviews of the discipline by Deane Barker of Blend Interactive, Inc., at Gadgetopia.

Barker says:

“Content modeling is the process of converting logical content concepts into content types, attributes, and datatypes.”

In academia, you can find inscrutably technical research on content modeling as the process of identifying the structure of documents algorithmically. (This gem from MIT scintillates! Content Modeling Using Latent Permutations, by Chen, Branavan, Barzilay, and Karger. 2009.)

But if that’s what is meant by “content modeling,” then there are essential aspects missing.

As content strategists, we face this technical view all the time, which I believe is descended from IT disciplines like “data modeling” for database design. We come on the scene talking about content purpose and process, and technologists ask us for template requirements, metadata fields, and data types. In these days of XML standards and the quest for the Holy Semantic Web, we find ourselves pushed into the thick of technical specification before we’ve had a chance to imagine what the content is supposed to be and do, let alone how it should be structured.

Returning to art

In my view, we’d be nearer the truth of “modeling” if we took our cues from other disciplines:

  • When a painter undertakes a monumental work of art, she doesn’t just run in with paintbrushes blazing. She sketches from life. She does études. She makes early decisions about what works and what doesn’t.
  • Murals often begin as drawings in miniature, which are enlarged to scale, then transferred to the wall.
  • The sculptor “models” in clay before casting in bronze.
  • The industrial designer creates digital “models” before production.
  • Developers create prototypes (just “models” by another name) before turning the coders loose.

Models serve as demonstration and instruction to the producers, the assistants, and the artists themselves. They remind and guide. They provide format and boundaries to inspire greater creativity.

Content must be modeled in this creative sense, as well as in the technical sense.

Some suggestions for modeling

  • Banish the “basic page” from your content types. The “webpage” is the content parallel to the “miscellaneous” category in information architecture. Far from being your standard content type, it should be your very last resort.
  • Ask the simple questions. Why are we creating this content form? What are people supposed to do with it? What does that mean for the other kinds of content we produce? How can they be combined into content “super-types?”
  • Do some content studies and sketches. Before you define technical requirements, spend time whipping up some real content to see how it behaves in your domain. If you already have content, gauge the consistency of its form from one piece to the next.
  • Test the usability of your content. Like a user interface, you should see whether people can actually use your content in the way it was intended. Do they get from it what you hoped they would?
  • Define the “rules” for each content type. You’re establishing conventions for the content creators, so they know what they’re doing, and so they can do it consistently over time.

By modeling your content in the artistic sense—by setting the forms and boundaries even before the content is “designed”—all the technical content management exigencies, like “fields” and “data types,” are set in their proper perspective. Templates are simply the mold into which your material is poured and out of which the sculpture emerges, fully formed.

Sophie’s choice: Well-crafted content or empowered content owners?

It’s only recently that I’ve come to appreciate a hard truth about myself: I’m a content geek. I know I’m not the only one. If you’re reading this post, you’re probably a content geek, too. But if you’re like me, the realization that you might be fundamentally different from the normal people around you has been a long time in coming, and it’s only after years of stripping the formatting out of other people’s documents and spending more hours in “code view” than in WYSIWYG that it becomes clear: Not everyone can do what we do.

And as a content manager, I have a terrible choice to make: Do I apply my content geek powers toward crafting web content myself, or do I hand the keys of my CMS over to the content owners, who say that if only they had access, they’d create and maintain all their own content?

This is a timely question of content strategy because not only does a content strategy shape the form and substance of your web content, but it also specifies how it gets designed and produced. So who’s going to do it: The geeks or the owners? Two recent blog posts make the case very well:

Seth Gottlieb at Content Here debunks the “Myth of the Occasional CMS User,” and calls all organizations not to believe the promises:

“Often, one of the big justifications for a CMS is removing the webmaster bottleneck and delegating content entry to the people who have the information. The implicit assumption is that everyone wants to directly maintain their portion of the website but technology is standing in the way. But if you visit a CMS customer a while after implementation you are likely to find that the responsibility of adding content is still concentrated in a relatively small proportion of the employee population.”

Jeff Cram at The CMS Myth expands on Gottlieb’s post and advises that you “Stop Letting People Use Your CMS.”

“So, I’ll take it one step further than Seth. Stop letting people use your CMS unless they are an integrated part of your web and editorial team and need to be in it on a regular basis. Even then, they may not need to be in the tool.”

What is Content Craft?

Being a content geek—at least for me— means that I see the crafting of content through insect-like, multifaceted eyes:

First, there’s the substance of the content. What is it? For whom is it intended? What’s its underlying message? What are we expecting it to accomplish?

Second, there’s the fashioning of it. Have we chosen the right language, the right images, the right arrangement, the right granularity, and the right length to accomplish our goals?

So far, so good. Any good writer can do as much.

But then, there’s the structure of the content. Not in the sense of how the piece is composed, but of the technical aspects of the headings, the various kinds of paragraphs, the selection of appropriate keywords for linking to other content, and it’s position within the website.

THEN, there are the content modeling and metadata. How is this class of content the same as or different from other classes? Into which section of the site does this content go? How will it be tagged so that it comes up in the right places or at the tops of searches? Can I really build this specific set of attributes into my CMS templates?

And finally, there’s the markup. What HTML elements are we using (and NOT using)? How have we chosen identifiers and classes for the CSS code, so that it reads like Ibsen in the source view?

Content geeks can manage all these facets like playing with Legos. We have an instinctive compass that points true north: We connect the pieces across web space and keep the links consisent.

Subject Matter Experts, Not Content Experts

Once upon a time, I was all about empowering my content owners. I tried to teach them the difference between “bold” and a “heading.” I tried to teach them to use “styles” in MS Word, rather than formatting each piece on top of “normal.” I showed them how beautiful and consistent content could be when you paid attention to these simple details, how you could instantly reshape the whole piece by shifting templates. Their eyes would just glaze over, or they would simply decide that it was far too much work. Now, I’ve decided that for the really important stuff, I do it myself, and with pride.

In the end, there is a profound difference between subject matter expertise and content crafting skill. Every now and then, the two can coincide in a single human being. For the most part, however, when content owners pour their subject matter expertise into web pages, someone else ends up going through it to “clean it up,” not out of a pathological need for beautiful code, but because the whole user experience will be best served by clean, consistent, well-crafted content. And isn’t our website really there to serve the visitors?

The Bottleneck is the Real Work

When your CMS sales rep sings the praises of the system you’re evaluating, and especially how content owners’ creativity and productivity will be unleashed because they won’t need any “technical skill” to build web pages, don’t you believe the bull. Publishing web content takes technical skill and time, no matter what system or tools you use, and just as in every other professional endeavor, it is best entrusted to web content gee…er…professionals like you and I.

The Mythic Bestiary: Content Owners

Shhhhh! Look over there!

The Neverland of content strategy is full of wondrous creatures. We invoke their names in meetings. Their titles appear on project plans. We assign them tasks and responsibilities, and we expect them to deliver. Of these legendary beasts, none is more elusive than the Content Owner.

Though we never see them working, like little elves, content owners mysteriously fill our web pages with high-quality, completely relevant, irresistible content. Chanting their spell of “lorem ipsum sit amet dolor…” they spin strawcases to gold and keep every project on-time, etc., etc.

Some of their powers they claim for themselves, and some we confer upon them:

Content owners want responsibility for creating and posting content.

If only we gave them complete freedom and access to the content management system (CMS), content owners could—and would!—take full responsibility for creating and posting all their own content. There would be no more bottlenecks! They would follow the styleguide. They would keep content fresh and current. And of course, because of the CMS’s WYSIWYG editor, they wouldn’t need to learn HTML or tagging.

Content owners are content experts.

Content owners know exactly what they want and how it should appear on a web page. They know how the navigation should work and which labels will eliminate confusion. They require minimal technical support, and can be relied upon to make savvy decisions. This is because…

Content owners understand the deepest desires of their audiences.

Content owners are continually in touch with their audiences and understand their requirements intimately. They have no need of data or testing. They have no time for research: Their content is too important for research, anyway. When cornered and pressed to support their assertions, they turn nasty and threaten spells to bring down the wrath of the C-Suite.

But if you ask me…

I don’t think Content Owners really exist—certainly not in these mythical terms. It’s all superstition, fairytale, and wishful thinking about some of the hardest work in publishing: Content Strategy.

There is no easy path to successful content, and the hard work cannot be foisted off onto content owners, even if they’re real—and real good—people serving in that role. They can be invited to help in the production process, but it’s too much to expect of them that they can do it all.

Yet on the other hand, content owners need to understand that they can’t do it by themselves. Content ownership is not content dictatorship. They may indeed know the information and subject matter that eventually becomes content, but it is precisely because they own it that they are not in the best position to turn it into good content. It requires distance and collaboration with content strategists.

Content must be planned and created in the context of all the disciplines of user experience. We can’t rely on elves or fairies—or even content owners—to make it happen by magic.

You’ll Wish You’d Had a Content Strategy Before Implementing Content Management

Two years ago, I had my first experience of implementing a content management system (CMS). I will withhold all names of organizations, people, and platforms to protect the guilty, but let’s just say it’s painful even now to recall the confusion, the groping, the anger, and the desperation that my group went through. There are probably support groups out there to help us recovering CMS implementers, but in case you’re teetering on the brink of your own CMS implementation, about to go over the edge into the abyss, I want to offer some perspectives from our experience.

Read More