July 26th, 2005 - 6:30 - 9:30pm - Tag Tuesday July meeting - Odeo - at 77 Natoma Street at 2nd Street, San Francisco, CA USAMeet other Bay Area tag developers, and hear how Odeo is using tags in their podcast directory.
See you there!
Numerous smart folks have written about these topics over the years, and I've wanted a collection of references all in one place to point folks to for some time.
Most recently (almost two months ago) Anne van Kesteren wrote a post titled Why generic XML on the web is a bad idea, in response to some comments on Dave Shea's blog post Who Cares about Semantics Anyway?. Side note: be sure to read Dave Shea's excellent Markup Guide.
The marketing message of XML has been for people to develop their own tags to express whatever they wanted, rather than being stuck with the limited predefined tag set in HTML. This approach has often been labeled "plain XML" or "generic XML" or "SGML, but easier, better, and designed just for the Web".
The problem with this approach is that while having the freedom to make up all your own tags and attributes sounds like a huge improvement over the (mostly perceived) limits of HTML, making up your own XML has numerous problems, both for the author, and for users / readers, especially when sharing with others (e.g. anything you publish on the Web) is important.
This post by no means contains a complete set of arguments against plain/generic XML and presentational markup, nor are the arguments presented as definitive proofs. Mostly I wanted to share a bunch of reinforcing resources in one place. Readers are encouraged to improve upon the arguments made here.
If everyone invents their own tags and attributes, pretty soon you get people calling the same thing by different names and different things by the same name. While avoiding both of those occurences completely is very difficult (many of the microformats principles are designed to help avoid those problems), downright encouraging authors to make up their own tags and attributes makes it much worse and all you end up with are a bunch of documents that give you the illusion of self-description.
Now if you don't care about sharing/publishing your data on the Web, this becomes a lot less important. For example if you're just writing a custom data format for some behind-the-firewall application.
Note that while having authors make up their own tags (element names) is bad for data formats and interoperability, this is quite different from the more and more popular practice of having authors make up their own tags (keywords) which has shown to be a very effective alternative to explicit taxonomies and ontologies for categorizing content.
What happens all too often when authors or developers make up their own tags is that they choose tags that are tightly tied to a specific presentation rather than abstracting them with semantics. Quite similar to the phenomenon of authors picking presentational class names.
It's bad enough that there are still web authors (even a few professional web authors) that continue to use presentational HTML on the Web. However, perhaps worse is that even a few W3C efforts have compromised on the principle of separating presentation from markup (perhaps because markup has been seen as a generic data format rather than just a way to mark up text content — it's called "markup" for a reason).
Sometimes something is a bad idea not just in absolute terms, but also relative to other approaches and solutions.
A while ago I wrote about a semantic richness spectrum on the www-style mailing list which went into a bit more detail.
Håkon Wium Lie wrote a paper that both predated my rough summary by a couple of years, and provided a much more thorough analysis.
I strongly recommend reading Håkon's paper.
Languages with well-known semantics are preferred to proprietary/made-up XML. This is for many reasons, including accessibility, cross-device support, and future user agent support.
You should never, ever send arbitrary markup in a language you made up over the network (unless you have full control over the target UA).
Making up your own vocabulary is one of the worst possible things to do in terms of accessibility, semantic web content analysis, and user control.
Later, Ian wrote this up in his blog:
And that's it for now on this topic. No pithy conclusions or summary statement. Just some food for thought on a Sunday evening.
P.S. I'm sure I've missed a few other good related write-ups and articles. Feel free to point them out to me and I'll update this post appropriately. Thanks.
I met Dare at Gnomedex last month (we sat together at lunch the last day), and chatted about a bunch of things. Dare knows I'm a fan of XML, so it didn't surprise me that much of his post confirmed what I wrote above, if indirectly. But some of his post surprised me a bit, and many readers as well, as the comments on his post refuted a lot of what he had to say. A few points in particular:
Dare wrote in response to the "Tower of Babel Problem" with plain XML that I pointed out (should have been problemS, emphasis on that plural):
Didn't the XML world solve this with XML namespaces like six or seven years ago?
Sigh, XML namespaces. The sad thing is that while namespaces theoretically addressed one of the problems I pointed out (calling different things by the same name), it actually WORSENED the other problem: calling the same thing by different names. XML Namespaces encouraged document/data silos, with little or no reuse, probably because every person/political body defining their elements wanted "control" over the definition of any particular thing in their documents. The <svg:a> tag is the perfect example of needless duplication.
And if something was theoretically supposed to have solved something but effectively hasn't 6-7 years later, then in our internet-time-frame, it has failed. Dare continues:
I personally haven't seen a good explanation of why <strong> is better than <b>...
A statement like that begs some homework. The accessibility, media independence, alternative devices, and web design communities have all figured this out years ago. This is Semantic (X)HTML 101. Please read any modern web design book like those on my SXSW Required Reading List, and we'll continue the discussion afterwards.
At the end of Dare's post, he seems to imply a tension with RSS and Atom. I'm a little unclear where Dare got those notions, as every microformat spec is built to work in semantic (X)HTML, Atom, RSS, or even "generic XML" (but only if you really want to hurt yourself and those viewing your content). RSS and Atom are essentially envelope formats, which work quite well for carrying visible microformatted content payloads.
Finally, one has to seriously cast doubt on XML opinions on a page that is INVALID markup. I suppose following the XML-way, I should have simply stopped reading Dare's post as soon as I ran into the first well-formedness error. Only 1/2 ;) And the duplicate is also invalid.
Randy has the beginnings of a good history lesson about HTML and SGML and XML. But his lesson ends around the introduction of XML 1.0, which was many years ago. To get caught up where he leaves off, take a look at the microformats wiki introduction page .
Note: email sent to cs.stanford.edu or tantek.com since 3pm yesterday has almost certainly not been received by me.
Some of the infrastructure behind my cs.stanford.edu email address has failed and will take a day or two to repair. Yeah, ouch.
Tantek.com email should be working fine again as of about 1pm today so you can now use that.
P.S. I've definitely been overwhelmed with email and behind (much more than "usual") on replying since SXSW in March of this year. If you've sent me email and I haven't replied, my apologies. I try to respond to each email I get but some of them I tend to lump in similar buckets until I get around to responding to that topic. One tip: short emails with bounded questions/issues get much faster responses than essay emails. Thanks.
Shankar Gupta writes about how 'Mainstream Media Harnesses Blogosphere'. In short, by showing readers the feedback from blogs, and linking to blogs, media sites like Salon are encouraging a virtuous feedback loop with bloggers. Salon's articles use Technorati to link to the bloggers talking about Salon's articles, thus encouraging more bloggers to do so. As a result more people reading blogs visit and read Salon's articles, and some of Salon's users click thru to blogs.
Expect to see more media sites in the near future directly engaging the ongoing conversation about their articles in the blogosphere. If you want to integrate Technorati support into your media site, send me an email at email@example.com, and I'll make sure that Technorati's business development staff gets back to you.
Two months ago (When I was in Japan! Yes, can you tell I'm catching up on some blogging?), Salon launched integrated Technorati support. Kevin and Rodney and all the folks at Salon have done a great job and you should check it out.
What's new for Salon readers and bloggers?
First, right there on the Salon home page, in the second column, about 2-3 page-scolls down, are the top five Salon stories, ranked by new links from blogs within the past four days (AKA "mini blog roundup"). For example, right now the top article is "The time of revenge has come". Just below the link to the article is the link to the blog discussion about the article, showing that the article has 33 new links from blogs. Clicking on that link shows the first 20 results, prefixed with a nice Salon logo, the article title, and a short article abstract, all providing the user a sense of continuity from where they clicked from.
Second, just below the mini blog roundup on the home page, there is a link to a full blog roundup page which shows the top 10 hottest Salon stories, as determined by new links from blogs in the past four days, as well as the most recent blog post regarding each, including direct links to those blogs.
Third, every Salon article has a link to "Technorati: Blogs discussing this story" so that after reading any Salon article, you can read follow up discussions and commentary.
What does all this mean? For one, it means that Salon is paying attention to blogs and what blogs are saying about Salon stories. Salon's writers are especially paying attention to what blogs are saying about their articles, and how many bloggers are linking to each article (sound familiar? How many bloggers do you know who are constantly checking their Technorati ranking and inbound links?).
Salon has done an excellent job integrating Technorati support, and showing the blogging community that they are listening. So if you're a blogger, read some Salon articles, and if/when you see some worthy of a commentary, blog something and link back to the article. Who knows, your very commentary might show up on Salon's blog roundup page.
Technorati Japan launched with super Japanese keyword and URL real-time blog search, including a few Japanese-only features, such as the Top 25 DVDs blogged in Japanese, Top 10 (or so) CDs, Top 10 Games, and Top 100 Japanese Blogs. Well done Technorati Japan team!
I met Eric Hayes at the recent Gnomedex conference. Eric is a co-founder of Attensa, a company actively working on supporting Attention.xml. We had a great discussion about some of the issues (and proposed resolutions for) in the current draft.
As an implementer, I asked Eric to join the Attention.xml effort as a co-author and he accepted. Please welcome new Attention.xml co-author Eric Hayes. We'll be documenting and fixing issues both related to turning Attention.xml into one or more microformats, and the issues that Eric has uncovered from his implementation experience.
Today I have a little more freedom than I did yesterday.
Today is exactly one year after my last day at Microsoft. This means I can recruit former Microsoft co-workers.
This also means my one year noncompete has expired, and so for example, I can write code for, and fix bugs in, other web browsers. Or browser engines. Like open source browser engines such as Gecko, KHTML, and WebKit (I was joking with Jonas tonight (Happy Birthday man!) that I could build a custom Technorati browser, based on open source).
Now that doesn't mean it will happen. First, there is the problem of copious spare time, or lack thereof. Second, I do wonder if the maintainers of those open source projects would accept patches from me, seeing as it has been a number of years since I actually wrote browser code myself. In any case, the option is there, and it feels good just to know that.
It's another new month, and I think that perhaps in appreciation of that, we should make the 1st of the month a day to try something you've never tried before. Do something you've never done before. Or go somewhere you've never gone before.
Here are the new things I've done so far today:
What new things will you do/see/goto on the next 1st of the month?
Almost six months ago I wrote about cognitive overload and the things I've done to cope. Here is one more suggestion:
In contrast, some companies are adopting "Email Free Fridays". This seems entirely the wrong approach. The most common problem I have found with email, especially in corporate settings, is the sending of useless, humorous, or otherwise irrelevant emails on work mailing lists. I've seen this happen at every place I have worked (with the exception of 6prime, which was too small to have this problem). Banning email one day a week won't solve this problem. Eventually you learn that different people have different signal to noise ratios of sent email, and those ratios are independent of the seniority or managerial level of the sender, and the only solution is liberal use of the delete key.
However, that's not the issue. Contrasting the values implicit in "Phone Free Fridays" vs. "Email Free Fridays", which is better?
I think the answer is obvious.
Let's minimize phone calls and phone conferences on Fridays. Try out Phone Free Fridays and see how it works for you.