While researching a way to describe different parts of a URL for a runtime interface, I was shocked to discover that over the years, different specifications, implementations, and communities had developed an incredible variety of ways to slice and name the pieces of a URL.
I remember seeing XKCD 927 recently, laughing at the familiarity, and what appeared at the time to be quite a bit of exaggeration. 14 competing standards, hah.
I was developing a small single purpose microsite and decided to build it using CASSIS not just for application logic, but for the server-side runtime execution and flow as well. I figured the needs of a simple real world site would work well to drive the design of a simple runtime.
Window.location's properties seem reasonable, until you get to "search" for the "?" query part of a URL. What about the source, the specs for URL and HTTP? And that's when I started to see the problem.
With a little more research I found a half-dozen different ways to slice and dice URLs. Kevin Marks asked me, what about Python? And that made seven. I published my research publicly on the microformats wiki, which is a good place to document existing formats for something (a key step in the microformats process).
Among all the differences (and overloading of the same terms to mean different things) it did seem that there were some patterns. So I made a diagram of a sample URL, chopped into pieces and named according to
seven different conventions over the years, in the hopes that doing so might reveal such patterns.
|~1994 URL RFC||scheme||internet domain name||port number||path||fragmentid|
|~1996 HTTP RFC||absoluteURI||fragment|
|~1996 DOM window.location||protocol||host||pathname||search||hash|
|~1997-99 CGI||scheme||SERVER_NAME or HTTP_HOST||SERVER_PORT||SCRIPT_NAME / PATH_INFO||QUERY_STRING|
|host_port or authority||path||query|
|1999 Mozilla nsIURI / nsIURL||prePath||path|
|~2000 Python 2||scheme||netloc||path||query||fragment|
|~2005 URI RFC||scheme||hier-part||query||fragment|
|2007 Googler||protocol||host or hostname||port||path||parameters||fragment|
|2010-2011 Node.js URL||href|
|2011 jQuery Mobile parseUrl||href|
|protocol||host or authority||directory||filename|
I hope you find this diagram useful to both understand the many names for different parts of URLs, and what someone might mean when they use one of them.
Also available in
image form (as of 2011-08-26, 720x350)
If you publish it, please link it to this blog post: http://tantek.com/b/4DY1
And if you know of other standards, implementations, or even cultural conventions that split up URLs and name the pieces differently than the above, please let me know. Note: username & password were omitted for simplification (and you shouldn't be using http-auth anyway); params omitted because it's obsolete.
A few conclusions:
- scheme is more prevalent than protocol. Yet anecdotally developers use protocol more, and in practice most schemes are protocols.
- hostname is used consistently (to mean the same thing) as are port and query.
- path has been used consistently for the past 10+ years and in a way consistent with its operating system roots.
- fragment(id) is used inconsistently as to whether or not it includes the leading "#" hash/pound symbol. However, notably absent from any specification or platform was the alternative phrase named anchor.
The highlighted subset of terms should work just fine for a new CASSIS web app runtime convention, thus inevitably fulfilling the expectations as documented and foretold by XKCD 927.
Thanks to Erin Jo Richey, Ian Fung, Kevin Marks, Ben Metcalfe, and Violet Blue for feedback and reviewing drafts of the diagram.
- 2011-08-26 Hacker News commentary thread
- 2011-09-08 Saveen: A Great Diagram: URL Slicing by Tantek Çelik
- Thanks for the kind words Saveen! t