Since we're having so much fun talking about tables and CSS and whatnot I thought I'd see how long I could keep the party rolling. Yes, I'm being ironic. I actually didn't want to get sucked into this. But I made an offhand comment in my last post that needs clarification:
HTML has no semantics beyond how it is rendered!
That resulted in comments like this one:
No. You're very, very wrong about this. The whole field of SEO exists because you are wrong. The title element, depths of heading, everything is used. Just because you can't see it being used, it doesn't mean it isn't.
It's commonly said that Google is the biggest blind user in the world. If your content isn't blind-accessible, it's likely to be less Google-accessible. Accessibility is for everyone.
This is too big a misconception to let slide, so here I go again.
First, notice that I did not say that HTML has no semantics. I said that HTML has no semantics beyond how it is rendered. It turns out that this is not quite correct, but the way in which it is not quite correct is subtle and rather difficult to explain. So I'll start with an example. There are two links there. They both go to documents whose HTML content is identical. The only difference between the two documents is the style sheet. Those of you who use custom style sheets in your browser will probably need to turn them off. Those of you who are blind (do I have any blind readers?) will completely not get the point of the example.
And that is precisely the point of the example.
Back in 1995 I wrote an essay for the Risks forum called the source of semantic content. Back then it was in the context of an attempt by then-Senator Jim Exon of Nebraska to pass legislation limiting the transmission of pornography on the Internet. The key sentence of that essay is as relevant today as it was forteen years ago:
[T]he semantic content of bit streams is in the eye of the beholder, and ... the apparent correspondence between bits and semantics is the result of engineering convention and not an inherent property of the bits.
Let me try to explain what I mean by that by way of a non-computer-related example. What gives words their meaning? One answer is that their meanings are given by their dictionary definitions. (The French take this to a whole nuther level.) This would be analogous to saying that the semantics of HTML are given by the standards published by the W3C. It is not an unreasonable answer. But it is wrong.
The best example I know of to show that it is wrong is so powerful that I almost dare not use it. It is the word "niggard", with an A, no E, and a D at the end. The dictionary definition of this word is, "A stingy, grasping person; a miser." A negative connotation to be sure, but on its face less offensive than some of the epithets that people regularly sling at one another. And yet the actual semantics of the word, which is to say, its actual meaning as measured in terms of the effect it has when it is employed, is radically different from its dictionary definition. (The word "ignorant" has a similar property. So does "liberal", at least in the U.S.)
Neither words nor computer programs derive their true semantics by fiat. They derive their semantics by the effects they have on the world. And the principal effect that HTML has on the world is that it gets rendered in browsers and those renderings are viewed by humans. The W3C and CSS purists can rant and rave all they want, but the fact of the matter is that what the HTML in the above example really means depends on which style sheet you use to render it.
And that is true for any HTML document. It's fairly easy to make a CSS style sheet that will take any HTML document and cause it to render in a completely arbitrary way. (How to do it is left as an exercise for the reader.) This is not just an academic observation; techniques like this are frequently used by spammers to get past filters.
The counter to this is that these people are undermining the true semantics of HTML. They are somehow "cheating" or "ruining the web" or some such thing. I have a certain amount of sympathy for this position. I am no fan of spam. The world would be a better place if everyone followed the rules. But this is just like arguing that the world would also be a better place if everyone agreed to abide by the dictionary definitions of the word "niggard". You can argue this until you are blue in the face. That will not change the fact that if you call a person with black skin a "niggard" you will likely cause more offense than if you called her a "miser". That is reality.
The reality for HTML is that its semantics are determined primarily by how it renders on the browsers that people actually use. And at the moment that includes IE6. The H1 tag doesn't mean "top level heading" because the W3C says it does. It means top-level heading because browsers by default render it in big bold type and as a block rather than inline. And if you believe that I have the causality backwards, that browsers render H1 big and bold because it means top-level heading, imagine an alternate world where the default style for H1 was NOT big bold type, and ask yourself how many people would use <font size=+10> instead. (Actually, you don't have to imagine. Just look at how many people use the "I" tag instead of the "EM" tag, or how many web forms are out there that don't use LABELs. Actually, I use "I" instead of "EM" myself. It's partly out of habit (when I started writing HTML there was no EM tag) and partly because "I" is less typing, uses up less screen real estate, and accomplishes the same thing for my purposes.)
Now, I said at the beginning that all this was not quite true, and the thing that makes it not quite is SEO, which is to say, Google. Google imposes a set of operational semantics on HTML that are substantially different from those imposed by browsers, and it is those differences that lie at the root of many a web designer's sleepless nights, and is responsible for the existence of the SEO industry. But here's the thing: even mighty Google has to yield to the operational semantics of browser rendering. Google puts in enormous effort to try to glean how a page will render, and not just extract its apparent content as defined by the W3C. The reason they do this is simple: if they don't they will be overwhelmed with spam. If Google could generate their index by rendering every page and running OCR on it they would. The only reason they don't is that it's prohibitively expensive.
By the way, if you still doubt that the semantics of HTML are inextricably bound to rendering, consider the P tag and tell me how you define a paragraph without talking about rendering. A paragraph is an inherently visual concept. If you doubt this, do the following experiment: take an audio recording of Barack Obama's inauguration speech and transcribe it. Now compare your transcription with the original text. Count the number of places you put in a paragraph break where there was none in the original text, and the number of places there was a paragraph break in the text that you missed. Now compare that count to the number of places where you missed (or added) the end of a sentence. Now ask yourself: why is there markup for paragraphs but not for sentences?