How many ways can you slice a URL and name the pieces?

While researching a way to describe different parts of a URL for a runtime interface, I was shocked to discover that over the years, different specifications, implementations, and communities had developed an incredible variety of ways to slice and name the pieces of a URL.

I remember seeing XKCD 927 recently, laughing at the familiarity, and what appeared at the time to be quite a bit of exaggeration. 14 competing standards, hah.

I was developing a small single purpose microsite and decided to build it using CASSIS not just for application logic, but for the server-side runtime execution and flow as well. I figured the needs of a simple real world site would work well to drive the design of a simple runtime.

No need to invent anything new, just re-use Apache/CGI environment variables (e.g. as used in PHP, like SERVER_NAME). But they look like old C constants, and CASSIS coders will be more familiar with Javascript.

Window.location's properties seem reasonable, until you get to "search" for the "?" query part of a URL. What about the source, the specs for URL and HTTP? And that's when I started to see the problem.

With a little more research I found a half-dozen different ways to slice and dice URLs. Kevin Marks asked me, what about Python? And that made seven. I published my research publicly on the microformats wiki, which is a good place to document existing formats for something (a key step in the microformats process).

Among all the differences (and overloading of the same terms to mean different things) it did seem that there were some patterns. So I made a diagram of a sample URL, chopped into pieces and named according to ~~seven~~ ~~eight~~ ~~nine~~ ~~ten~~ twelve different conventions over the years, in the hopes that doing so might reveal such patterns.

	http	:	//	microformats.org	:	80	/	wiki	/	index.php	?	title=url-formats	#	Why
~1994 URL RFC	scheme			internet domain name		port number	path						fragmentid
~1994 URL RFC	scheme			internet domain name		port number					search		fragmentid
~1996 HTTP RFC	absoluteURI													fragment
	scheme		relativeURI (net_path)
				net_loc			abs_path
				host		port		rel_path
								path				query
								fsegment		segment		query
~1996 DOM window.location	protocol			host			pathname				search		hash
~1996 DOM window.location	protocol			hostname		port	pathname				search		hash
~1997-99 CGI	scheme			SERVER_NAME or HTTP_HOST		SERVER_PORT	SCRIPT_NAME / PATH_INFO					QUERY_STRING
~1998 Perl	scheme		opaque											fragment
				host_port or authority			path					query
				host		port	path					query
1999 Mozilla nsIURI / nsIURL	prePath						path
	scheme			hostPort			filePath					query		ref
	scheme			host		port	directory			fileName		query		ref
~2000 Python 2	scheme			netloc			path					query		fragment
~2000 Python 2	scheme			hostname		port	path					query		fragment
~2005 URI RFC	scheme		hier-part									query		fragment
~2005 URI RFC	scheme			authority			path					query		fragment
2007 Googler	protocol			host or hostname		port	path				parameters		fragment
2010-2011 Node.js URL	href
	protocol*			host			pathname				search		hash
	protocol*			hostname		port	pathname					query	hash
2011 jQuery Mobile parseUrl	href
	hrefNoHash												hash
	hrefNoSearch										search
	domain						pathname
	protocol			host or authority			directory			filename
	protocol			hostname		port	directory			filename
2011 Adactio	protocol			domain		port	path				???		???
	http	:	//	microformats.org	:	80	/	wiki	/	index.php	?	title=url-formats	#	Why

I hope you find this diagram useful to both understand the many names for different parts of URLs, and what someone might mean when they use one of them.

Also available in ~~image form (as of 2011-08-26, 720x350)~~ image form (as of 2011-09-18, 748x644)

If you publish it, please link it to this blog post: http://tantek.com/b/4DY1

And if you know of other standards, implementations, or even cultural conventions that split up URLs and name the pieces differently than the above, please let me know. Note: username & password were omitted for simplification (and you shouldn't be using http-auth anyway); params omitted because it's obsolete.

A few conclusions:

scheme is more prevalent than protocol. Yet anecdotally developers use protocol more, and in practice most schemes are protocols.
hostname is used consistently (to mean the same thing) as are port and query.
path has been used consistently for the past 10+ years and in a way consistent with its operating system roots.
fragment(id) is used inconsistently as to whether or not it includes the leading "#" hash/pound symbol. However, notably absent from any specification or platform was the alternative phrase named anchor.

The highlighted subset of terms should work just fine for a new CASSIS web app runtime convention, thus inevitably fulfilling the expectations as documented and foretold by XKCD 927.

Thanks to Erin Jo Richey, Ian Fung, Kevin Marks, Ben Metcalfe, and Violet Blue for feedback and reviewing drafts of the diagram.

Update 2011-08-31: Thanks also to Jeremy Leader for IMing me the info about Perl, including a sample Perl script that slices up the example URL, as well as its output, and 1998 reference to URI library 0.01 readme.

Update 2011-09-01: Thanks to Wladimir Palant for reminding me about the Mozilla XPCOM interfaces nsIURI and nsIURL and their respective IDL files, used by Firefox extension developers.

Update 2011-09-02: I've added Node.js's URL class, originally documented in 2010 because of the debate on github, as well as jQuery Mobile's path.parseUrl thanks to Scott Jehl sending me the results of parsing the example URL into the various properties.

Comments

2011-08-26 Hacker News commentary thread
- Funny that the first comment is not the post itself but about Falcon post UI (arrowkey navigation between items as numerous content hosting sites do, e.g. Facebook, Flickr, etc.). The "/* " problem noted was due to a very old stylesheet rule which has been fixed. t
2011-09-08 Saveen: A Great Diagram: URL Slicing by Tantek Çelik
- Thanks for the kind words Saveen! t