By Sir Tim Berners-Lee, inventor of URIs, URLs, HTTP, HTML and the World Wide Web, current head of the W3C. Written in 1998

What URI is Cool?

One that doesn't change.

How do URIs change?

URIs don't change: people change them.

In theory, there is no reason for people to change URIs (or stop maintaining documents), but in practice there are millions.

In theory, the nominal owner of the domain namespace actually owns the domain namespace and therefore all the URIs in it. Apart from insolvency, nothing prevents the owner of a domain name from keeping this name. And in theory, the URI space under your domain name is completely under your control, so you can make it as stable as you like. Pretty much the only good reason for a document to disappear from the internet is that the company that owned the domain name has gone out of business or can no longer afford to keep the server running. Then why are there so many missing links in the world? This is partly just a lack of foresight. Here are some of the reasons you can hear:

We just reorganized the site to make it better.

Do you really feel like the old URIs can't work anymore? If so, you have chosen them very poorly. Consider keeping the new ones from the next redesign.

We have so much material that we cannot keep track of what is outdated, what is confidential, and what is still relevant, and so we thought it was better to just turn it off.

I can only sympathize. The W3C has gone through a period where we had to sift through archival material for confidentiality carefully before making it public. The decision must be thought out in advance - make sure that you record with each document an acceptable range of readers, the date of creation and, ideally, the expiration date. Save this metadata.

Well, we found we needed to move files ...

This is one of the most pathetic excuses. Many people don't know that web servers allow you to control the relationship between an object's URI and its actual location on the file system. Think of a URI space as an abstract space, perfectly organized. Then map to whatever reality you actually use to implement it. Then report it to the web server. You can even write a snippet of your server to get it right.

John no longer maintains this file, now Jane does.

Was John's name in the URI? No, just the file was in his directory? Well, okay.

We used to use a CGI script for this, but now we use a binary program.

There is a crazy idea that scripted pages should be located in the "cgibin" or "cgi" area. This exposes the mechanism of how you start your web server out. Change the mechanism (even keeping the content) and oops - all your URIs change.

Take the National Science Foundation (NSF) for example: NSF

Online Documents

http://www.nsf.gov/cgi-bin/pubsys/browser/odbrowse.pl

The first page to start viewing documents will clearly not remain the same in a few years. cgi-bin, oldbrowseand pl - all this gives out particles of information about how-we-do-it-now. If you use the page to search for a document, you get an equally bad result first:

Report of the working group on cryptology and coding theory

http://www.nsf.gov/cgi-bin/getpub?nsf9814

for the index page of the document, although the html document itself looks much better:

http://www.nsf.gov/pubs/1998/nsf9814/nsf9814.htm

Here the heading pubs / 1998 will give any future archiving service a good clue that the old 1998 document classification scheme is in effect. While the document numbers may look different in 2098, I can imagine that this URI will still be valid and will not interfere with the NSF or any other organization that will maintain the archive in any way.

I didn't think URLs were supposed to be persistent - they were URNs.

This is probably one of the worst side effects of the URN discussion. Some people think that due to research on a more persistent namespace, they may be careless about dangling links because "URNs will fix it all." If you are one of these people, then let us be disappointed.

Most of the URN schemes I've seen look like an authority identifier followed by either the date and string you select, or just the string you select. This is very similar to the HTTP URI. In other words, if you think your organization will be able to create long-lived URNs, then prove it now by using them for your HTTP URIs. There is nothing in HTTP itself that makes your URI unstable. Only your organization. Create a database that maps the document's URN to the current filename and let the web server use that to actually retrieve the files.

If you have come to this point, then if you do not have the time, money and connections to develop some kind of software, then you can state the following excuse:

We wanted to, but we just don't have the right tools.

But you can sympathize with this. I totally agree. What you need to do is force the web server to instantly process the persistent URI and return the file wherever it is currently stored in your current crazy filesystem. You want to keep all the URIs in a file as a check and keep the database up to date at all times. You want to preserve the relationship between different versions and translations of the same document, and also maintain an independent checksum record to protect against accidental error in the file. And web servers just don't go out of the box with these features. When you want to create a new document, your editor asks for a URI.

You need the ability to change ownership, document access, archive-level security, and so on in the URI space without changing the URI.

It's too bad. But we will fix the situation. At the W3C, we use Jigedit (a Jigsaw editing server) functionality that keeps track of versions, and we experiment with document creation scripts. If you are developing tools, servers and clients, pay attention to this problem!

This excuse applies to many W3C pages as well, including this one: so do what I say, not what I do.

Why should I care?

When you change the URI on your server, you can never fully tell who will reference the old URI. These can be links from regular web pages. Bookmarks to your page. The URI may have been scratched in the margin of a letter to a friend.

When someone clicks on a link and it is broken, they usually lose confidence in the server owner. He is also disappointed - both emotionally and realistically from the inability to achieve his goal.

A lot of people are constantly complaining about broken links, and I hope the damage is obvious. I hope that the reputational damage to the maintainer of the server where the document disappeared is also obvious.

So what should I do? URI design

It is the responsibility of the webmaster to allocate URIs that can be used in 2 years, in 20 years, in 200 years. This requires thoughtfulness, organization and commitment.

URIs change if some information changes in them. How you design them is very important. (What, URI design? I need to design a URI? Yes, you should think about it). Design basically means not having any information in the URI.

The date the document was created - the date the URI was issued - something that will never change. It is very useful for separating requests that use the new system from those that use the old system. It is a good starting point for a URI. If the document is dated, even if the document is relevant in the future, then this is a good start.

The only exception is a page that is intentionally the "latest" version, for example, for the entire organization or a large part of it.

http://www.pathfinder.com/money/moneydaily/latest/

This is the last column of Money Daily in Money magazine. The main reason this URI doesn't need a date is because there is no reason to store a URI that will survive the log. The concept of Money Daily will disappear when Money disappears. If you want to link to content, you should link to it separately in the archives:

http://www.pathfinder.com/money/moneydaily/1998/981212.moneyonline.html

(Looks good. Assumes "money" will mean the same thing for the life of pathfinder.com. There is duplicate "98" and unnecessary ".html", but otherwise looks like a strong URI.

What to leave aside

All! Aside from the creation date, putting any information in a URI is one way or another begging for trouble.

Author's name . Blame may change with new versions. People leave organizations and pass things on to others.
Subject . It is very difficult. He always looks good at first, but changes surprisingly quickly. I'll talk more about this below.
Status . Directories like "old", "draft", and so on, not to mention "latest" and "cool", appear on all filesystems. Documents change status - otherwise there would be no point in creating drafts. The latest version of a document needs a persistent identifier, regardless of its status. Keep status out of name.
. W3C , . , , , , , . , , , - , ! .
. . "cgi", ".html" . , 20 HTML , . W3C ( ).
Software mechanisms . In the URI, look for "cgi", "exec" and other terms that scream "look at what software we are using." Anyone want to devote their whole life to Perl CGI scripts? No? Then remove the .pl extension. Read the server manual on how to do this.
Disk name. Come on! But I've seen that.

So the best example from our site is simply

http://www.w3.org/1998/12/01/chairs

… A report of the minutes of the meeting of the W3C chairs.

Topics and classification by topic

I'll go into more detail about this danger, as it is one of those things that are most difficult to avoid. Typically, topics end up in URIs when you categorize your documents by work in progress. But this breakdown will change over time. The area names will change. At the W3C, we wanted to change MarkUP to Markup and then HTML to reflect the actual content of the section. Also, the namespace is often flat. After 100 years, are you sure you won't want to reuse anything? In our short life, we already wanted to reuse "History" and "Style Sheets", for example.

It's a tempting way to organize a website - and a really tempting way to organize anything, including the entire Web. This is an excellent mid-term solution, but it has serious drawbacks in the long term.

Part of the reason lies in the philosophy of meaning. Each term in the language is a potential clustering object, and each person may have a different idea of what it means. Since the relationship between subjects is more like a spider web than a tree, even those who agree with the cobweb can choose a different representation of the tree. These are my (often repeated) general remarks about the dangers of hierarchical classification as a general solution.

In fact, when you use a topic name in a URI, you are tying yourself to some sort of classification. You may choose a different option in the future. Then the URI will be subject to violation.

The reason for using a subject area as part of a URI is that responsibility for sub-sections of a URI space is usually delegated, and then you need the name of the organizational body — a unit, group, or whatever — that is responsible for that subspace. This is the binding of the URI to the organizational structure. It is usually only safe when the URI further down (left) is protected by a date: 1998 / pics could mean to your server “what we meant in 1998 by pics” rather than “what we did with what we now call pics. "

Don't forget your domain name

Remember that this applies not only to the path in the URI, but also to the server name. If you have separate servers for different things, remember that this separation cannot be changed without destroying many, many links. Some classic mistakes like "look at what software we are using today" are the domain names "cgi.pathfinder.com", "secure", "lists.w3.org". They are designed to facilitate server administration. Regardless of whether the domain represents a specific department within your company, document status, access level, or security level, be very, very careful before using more than one domain name for multiple types of documents. Remember that you can hide many web servers inside one visible web server,using redirection and proxying.

Yes, and also think about your domain name. You don't want to be referred to as soap.com after you change your product line and stop making soap (I apologize to whoever owns soap.com at the moment).

Conclusion

Saving a URI for 2, 20, 200, or even 2000 years is obviously not as easy as it sounds. However, all over the Internet, webmasters are making decisions that will really make it difficult for themselves in the future. This is often because they are using tools whose job it is to present the best site only at the moment - and no one really appreciated what would happen to the links when everything changed. However, the point here is that a lot, a lot can change, and your URIs can and should remain the same. This is only possible when you think about how you create them.

See also:

Jacob Nielsen's tirade on the same topic

Supplements

How to remove file extensions ...

... from a URI in the current file-based web server?

If you are using Apache, for example, you can configure it to negotiate content. You save the file extension (for example, .png) in a file (for example, mydog.png ), but you can link to a web resource without it. Apache then checks the directory for all files with that name and any extension, and can choose the best one from the set (for example, GIF and PNG). And you don't have to put different types of files in different directories, in fact, content negotiation won't work if you do.

Configure your server to negotiate content
Always reference URIs without extension

Extension links will still work, but will prevent your server from choosing the best format currently available and in the future.

(In fact, mydog, mydog.pngand mydog.gif- codes and web resources mydog- a universal resource content type, mydog.pngand mydog.gif- the resources of a particular type content).

Of course, if you are writing your own web server, it is a good idea to use a database to bind persistent ids to their current form, although beware of unlimited database growth.

Shame Board - Story 1: Channel 7

Throughout 1999, I tracked school closures due to snow across the page http://www.whdh.com/stormforce/closings.shtml. Do not wait for the information to appear at the bottom of the TV screen! I have linked it from my home page. The first big snow storm of 2000 arrives and I check the page. It says:

- As of.

Nothing is currently closed. Please come back in case of weather warnings.

It can't be the same strong storm. It's funny that the date is missing. But if you go to the main page of the site, there will be a large "Closed Schools" button, which leads to a page http://www.whdh.com/stormforce/with a long list of closed schools.

Maybe they changed the system for getting the list - but they didn't need to change the URI.

Shame Board - Story 2: Microsoft Netmeeting

With the growing dependence on the Internet, the smart idea came to applications that you can embed links to the manufacturer's website. This has been used and abused a lot, but - you can't change the URL. Just the other day I tried a link from the Microsoft Netmeeting 2 / something client in the Help / Microsoft on the Web / Free stuff menu and got a 404 error - no response found from server. Maybe already fixed ...

© 1998 Tim BL

Historical note: At the end of the 20th century, when this was written, “cool” was an epithet of approval, especially among young people, indicating fashion, quality or appropriateness. In a hurry, the URI path was often chosen out of "cool" over utility or longevity. This post is an attempt to redirect the energy behind the quest for cool.

See also:

" URI - it is difficult about simple (Part 1) "

" Fighting" bad "URIs, spammers and php-shells - personal experience "

Cool URIs don't change