First thing this morning, checking in on the Twitter streams, I saw Jeff Evans (@joffaboy) announce the article, “Google, Bing & Yahoo’s New Schema.org Creates New Standards for Web Content Markup.”
My heart began pounding as soon as I read the title. The arch-rivals of search, the biggest dogs in the yard, the great institutions of the web were collaborating to propose a solution to the problem of markup that has plagued me from the beginning: Markup doesn’t really address the substance of the web, just its most basic structure. My hopes were further raised by the mention of a “recipe” content type, which if you follow my writings, you’ll recognize as a regular example.
I retweeted in a flash: This is what I’ve been looking for!
Then, I visited schema.org, and all my hopes came crashing to Earth again. The Search Giant monsters have created a new monster.
Quick Overview
As I understand it, schema.org is proposing additions to HTML that the “Big Three” search engines are going to interpret, in order to improve the accuracy of search results. By augmenting the markup in web content, they are together settling on a standard vocabulary, so that they will all be recognizing the same language. Presumably, once they’ve built this standard language into their sorting algorithms, any content that has these augmentations will rise to the top of search results, above content that doesn’t.
In principle, that sounds good, doesn’t it?
I’d like to offer some reflections on a few practical implications of this effort.
Corporations try to head off the “free” Semantic Web
For-profit companies have been watching in dismay for twenty years the rise of the “free” WorldWide Web. Content is free. Software is free. Social Networking is free. And more and more of the web is being driven by “free” efforts, like the WorldWideWeb Consortium. Volunteerism is a huge threat to capitalism, and they know it.
Among the greatest of these free efforts is the quest for the Semantic Web, which in its simplest terms, seeks a set of standards for describing the meaning of content. Human language is always problematic—as are those who use it—because words are never just words. The meaning of words is rich, contextual, ambiguous, and worst of all, ever changing. There are a lot of really, really smart people, all over the world, almost exclusively volunteer (with some corporate support), working hard to figure this out. If you want to get a sense of the complexity of it all, talk to Rachel Lovinger (@rlovinger) at Razorfish. She’s one of the true semantic geeks, and I’ll just have to take her word on most of what she says. She’s fab.
But instead of supporting this “free” effort, the Search Giants have imposed a de facto standard for the Semantic Web, and they’re pushing it with the strength of their size and popularity. Like the Zen question of the tree in the forest:
If a search engine doesn’t support your semantic standard, will anyone find your content?
I am suspicious of their motives. I read it as an effort to bypass all the work that’s already gone into the Semantic Web.
Markup is more than basic structure and presentation
It has been a great struggle since the beginning of the web to strike the appropriate balance between the structure of content and its presentation. In other words, what content is should be distinct from how content looks. But HTML—even up to HTML5—still only addresses the most basic aspects of content, and even now, offers only tags that address the pieces of the “webpage”—like the “header” and “navigation.” There isn’t markup to describe the content’s substance.
CSS as semantic markers
Cascading Stylesheets, in a roundabout way is one approach to the problem, although it’s originally meant to control the presentation of the content. Let me give an example.
Lists are a primary content structure. We create lists for everything—ingredients, footnotes, archives, contacts, links, Q&A, references, etcetera ad nauseum—but HTML offers us only two choices: “Ordered lists” (numbered) and “Unordered lists” (bulleted).
If your website had a list of links in a sidebar and a list of staff names on a contact page, you use the same basic markup:
<ul> <li><a href= “http://url.for.link/1” title= “This is the first list item”>Link Text 1</a></li> <li><a href= “http://url.for.link/2” title= “This is the second list item”>Link Text 2</a></li> </ul>
…and then…
<ul> <li>Contact Name 1</li> <li>Contact Name 2</li> </ul>
Here’s the problem: The web browser has a default way of rendering these lists, and they will look exactly the same, except that the links will be underlined. If you want to distinguish them from each other, you can add CSS classes, which give you a way to style them differently.
Now, CSS gurus (the best of whom are really content strategists underneath it all) will tell you that you should NEVER use class names that describe how something looks, like “class= ‘blue_text’.” The class names should describe what they are, which is, in fact, a semantic indication:
<ul class="links”> […] </ul>
…versus…
<ul class=“contacts”> […] </ul>
Using these identifiers, the designer can define precisely how each component of a website should look. In a better world, however, they could also be used to identify what they are. Defining standard CSS classes and identifiers as part of XHTML would be one approach to encoding the meaning into markup.
But not Google, Bing, and Yahoo—Noooooooo.
The Search Giants, though, instead of building on CSS or any other existing approach, have introduced another “standard,” which superimposes another layer of markup on top of the feeble XHTML we already have. Here is the example from schema.org:
<div> <h1>Avatar</h1> <span>Director: James Cameron (born August 16, 1954)</span> <span>Science fiction</span> <a href="../movies/avatar-theatrical-trailer.html">Trailer</a> </div>
Before I go any further, I have to say that this code doesn’t look like any real XHTML I’ve ever seen, and that’s a worry right from the start. Nevertheless…
Once they’ve applied their markup augmentations, again right from schema.org, it becomes:
<div itemscope itemtype="http://schema.org/Movie"> <h1 itemprop="name">Avatar</h1> <div itemprop="director" itemscope itemtype="http://schema.org/Person"> Director: <span itemprop="name">James Cameron</span> (born <span itemprop="birthDate">August 16, 1954)</span> </div> <span itemprop="genre">Science fiction</span> <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a> </div>
There are many, many, many things wrong with this picture.
All the complexity of XML without any of its simplicity
XML is the mother of all markup. In fact, XHTML is just one markup language based on the XML standard. Using XML as the basis of your web code is an elegant—but very complex—solution to defining your content. When it’s all worked out, however, it lets you replace that gobbledygook above with something more like this:
<movie> <title>Avatar</title> <director> <name>James Cameron</name> <birthdate> August 16, 1954</birthdate> </director> <genre>Science fiction</genre> <trailer url= “../movies/avatar-theatrical-trailer.html” /> </movie>
Putting it simply, by augmenting XHTML with another layer of markup, the Search Giants have complicated the code immensely, making it just as complex as if they had done it in XML, but without any of the benefits of XML’s simple elegance.
Content is rarely this simple
The examples above deceive us, in any case: Yes, we can add fields to CMS templates for isolated metadata like “title” and “director,” but what about the main content itself? What about the meaning embedded in the article? Let’s say we’re writing an article about motion picture history, and we include the following sentence:
<p>James Cameron, best known for directing the sci-fi thriller,
“Avatar,” was born on August 16, 1954.</p>
All of the information in the schema.org example is present in that sentence, and if we were searching for content about James Cameron, we would have to rely on full-text searching.
If we were to use the schema.org augmentation, in order to make it all accessible to the search engines, it would get very messy, something like:
<p>
<span itemscope itemtype ="http://schema.org/Movie">
<span itemprop="director" itemscope itemtype="http://schema.org/Person">
James Cameron
</span>
</span>, best known for directing the
<span itemscope itemtype ="http://schema.org/Movie">
<span itemprop="genre">sci-fi thriller</span>, <span itemprop="name”>Avatar</span>
</span>
,” was born on
<span itemscope itemtype ="http://schema.org/Movie">
<span itemprop="director" itemscope itemtype="http://schema.org/Person">
<span itemprop="birthDate">August 16, 1954</span>
</span>
</span>.
</p>
Not for mere mortal content authors
Now we come to the main practicality of content: Content authors.
I have marked up a lot of content in my career, and I am an obsessive, precise, exacting author. On the other hand, I’ve implemented CMS templates and tried to configure the best WYSIWYG editors to be able to apply the right CSS classes within content. And I’ve worked with a lot of content owners to teach them the importance of good markup.
Here’s the hard reality: No matter how powerful the technology, no matter how carefully designed and coded the CMS templates, no matter how sophisticated the WYSIWYG editor, and no matter how much training we offer, any markup will ultimately succeed or fail on the content authors’ ability to use it.
And that brings me to my main issue with the Semantic Web.
The Semantic Web cannot rely on encoding alone
If the main difficulty of searching the web is in understanding the meaning of the content (given all the languages, people, markup skill, and so many more factors), then we can really only solve it the hard way: Intelligent reading. We cannot rely on the human beings who create content to make it speak for itself, by making sure that everything is tagged correctly. They just can’t do it.
We cannot rely on markup because XHTML is insufficient, XML is too complicated for more than data structures, and the schema.org effort is unrealistic. In the end, each method may play a limited role in addressing the findability of content, but ultimately, it will require some other kind of intelligence—intelligence in the interpreting of meaning, rather than its encoding.
I don’t know what will happen with the schema.org markup augmentations. Personally, I hope that it just sags under its own weight and disappears into the marshes from whence it came. And I heartily encourage all the folks who are working on this problem to keep at it: There’s no path to success here but the long one. Eventually, perhaps new kinds of computers will be able to understand us weird, wonderful human beings, but for now, we remain inscrutable to the mechanical, algorithmic mind.