August 27, 2005

Semi-structured meta-data has a posse: A response to Gene Smith

I’m starting to work my way back to the tagging debate, and want to start with Gene Smith’s post from last spring at Atomiq: Market Populism in the Folksonomies Debate. Smith regards Ontology is Overrated with some skepticism, concluding that I am overstating the case for effect. He is instead trying to carve out a more reasonable position, arguing for the usefulness of tags in some limited number of cases, and peaceful coexistence with other sorts of classification schemes.

I, on the other hand, am of the unreasonable view that classification schemes are going to be largely displaced by tagging for the same reasons that search has largely displaced directories for finding things, namely that distributed intelligence, for all its faults, tends to beat the work of a professional class when dealing with large, dynamic systems.

Gene’s label for this view is market populism, which seems to me to be a misreading of what is at work here. Tagging is not a populist technology but a libertarian one — it is precisely because the populace does not need to come to a consensus that tagging better expresses both the fluidity and polyvalence of meaning than formal classification systems do. If tagging were populist, it would have all the disadvantages of classification schemes, because it is the very requirement of forced convergence on an agreed-upon set of metadata that causes the problems with classification in the first place.

Gene’s general critique, as befits his more reasonable point of view, is to point out that many institutions still use classification schemes:

The separation of the USSR into 15 or so independent states was no easy transition–politically or categorically-so I suspect the Library of Congress wasn’t the only one stuck on how to describe the Soviet Union after the fall. The Washington Post abandoned the “Former USSR” label only recently.

This is true, but is in my view merely a restatement of the problem. The Washington Post has the same dilemma as the Library of Congress because they have taken on the same (unrequired) constraints of consistency and clarity in labeling, which makes their classification schemes a poor fit for inconsistent and unclear situations.

Consider the value of the tagging approach, relative to the “One category now, another later” approach: the independent nature of Soviet states would always be represented in such a system, so long as anyone labeled things happening in Ukraine, say, with that tag. And after the breakup, more Ukraine tags and fewer USSR tags would be used, because the system would react, immediately, to the new information. We don’t yet have tag clouds old enough to take advantage of such time slices (”Show me how this has been categorized in the last month vs. the last year”), but we will, soon, and the ability to look at dynamic tag signatures will be better able to handle cases of redefinition than waiting for professional catalogers to reconsider their judgment.

He goes on in this vein, asking what I think is the key question:

(And would it be any easier to go through thousands of tagged URLs and decide which was about Georgia and which was about Azerbaijan and which was about Turkmenistan? Even with social metadata, issues of aboutness persist.)

Yes, of course it would be easier. This is what is so radical about tagging — it would be easier because other people would do it for you. It is much, much easier for new terminology to establish itself in a tagged system than in one where there is a professional group of classifiers, because going back over old material is simply too expensive. In most classification systems, the arrearage problem — the buffer of as-yet uncategorized new material — is so acute that taking time out to re-classify existing things is out of the question in all but the most extreme cases. Categorization systems favor stable categories not because the world is stable but because categorizers are busy and their time is expensive.

As for issues of aboutness persisting, of course they persist. They are in fact permanent, which is why classification systems are both unduly definitional when covering ambiguous cases, and grow increasingly brittle over time. The cataloger can’t replicate the mental models of the users better than the users can themselves, nor can they predict how stable their proposed categorizations will be over time. (These are presented as the Mind Reading and Fortune Telling problems in Ontology Is Overrated.)

This is not to say that Clay is wrong about tags being a useful, even vital, way of organizing certain kinds of information.

Coming from someone who claims to be enthusiastic about tagging, this is an oddly tepid endorsement. I’d be especially curious what kinds of things are not included in “certain kinds of information.” Tags are just labeled links; they are a useful, even vital, tool for anything that can be referred to with a URI. Which is a pretty considerable subset of everything.

The trouble is that most of the practical objections to folksonomies–as well as the arguments for a peaceful co-existence between classification schemes–have been met with the forced move response. The argument from inevitability is a great way to simultaneously sidestep your opponent’s objections while confirming your own assumptions.

In my view, people who believe that tagging will co-exist peacefully with classification schemes have underestimated tagging.

As for the ‘forced move’ argument, this is a fair cop, and I’ll try to be more explicit about why I am predicting the rise of tagging at the expense of traditional classification schemes. Note, first, that I am restricting my comments to tagging, rather than to the more general folksonomies. I think in particular that folksonomies like the Wikipedia hierarchy are making the same mistakes that older systems make. Folksonomies that take on the structural rigidities of classification schemes also take on their weaknesses.

For all the many differences between tagging and classification, the key one is cost. It is simply too expensive to hire professionals to do the work once a system that uses peer production is also available. I will in the future refrain from making generic ‘forced move’ arguments, and will instead predict that the economic advantages of moving from classification to tagging will be attractive enough to making tagging the preferred strategy among people choosing between the two.

It’s also good for keeping the discussion in the abstract rather than concrete. Because it’s a forced move, there’s no point asking why both Amazon and Wikipedia use categories. Or why does the failure to organize the whole web into a hierarchical taxonomy (like the Yahoo Directory) mean that taxonomies are useless?

I’ll take each of these three cases in turn:

Amazon is currently experimenting with tagging, because the signal loss created from categories costs them money. Amazon doesn’t want to create a mental model that I need to conform to, they want to reflect to me whatever model will help me find things most easily. Whenever anyone wants to find something and can’t, Amazon loses a potential sale.

Categories create an elaborate and expensive round trip: try to understand the users’ mental model, then reflect that in the categories, then hand those categories back to the user in hopes that it will help them find things. Tagging takes out the middle step — the tag cloud is better fit to the users’ mental model than the categorization scheme is. Amazon is only an accidental purveyor of ontology, a function that raises their costs while lowering their effectiveness. They’ll get out of that business if they can, and now they can. (The Lists feature is a kind of proto-tagging, where users organize things for one another in ways Amazon simply could not re-create with the work of professionals.)

Wikipedia’s classification scheme, as I noted above, has all the problems of older classification schemes because it is a classification scheme; being ‘folksonomic’ or whatever isn’t magic pixie dust. Wikipedia classification is the usual mess we get with such systems: Toys are in Personal Life while Food and Drink is in Culture; Agriculture is in Technology, but not Society; Sociology is in Society but not Science, and so on. It reads like something out of Metacrap. And, possibly as a result, no one actually points to the Category: pages on Wikipedia.

To offer a prediction, both Amazon and Wikipedia have seen the high-water mark of the usefulness of their respective categories; both will shift to include tagging more natively; and you will see some of those changes in the next 12 months.

As for Yahoo, the failure to organize the whole web shows us that there are limits to the effectiveness of classification in systems that have three characteristics: large size, heterogeneous content, and dynamic growth. Sound like any data sets you know?

Even more surprising is how low those limits are. DMOZ claims to have 4.6M unique URLs classified, and claims to have surpassed the size of Yahoo’s directory. Let’s assume for the sake of argument that 4.6M is the largest publicly available set of classified URLs. is larger than that now, and it’s currently adding 20 or so new URLs a minute.

Most organizations will not be able to support even a fraction of the effort that went into the Yahoo or DMOZ directories, and given the examples of tagging systems that have already blown past those limits, the constraints of scale alone will push a number of groups away from formal classification, as the volume of material they want to cover grows even to hundreds of thousands of things. More importantly, the superior economic model of tagging as a by-product of self-interest, best described by Dan Bricklin in The Cornucopia of the Commons, will mean that even before those limits are reached, tagging will be the more attractive choice on practical grounds.

Or, like, do you seriously mean that tagging will replace all other kinds of categorization? Across the whole freaking web? Surely not.

That ‘Surely not’ is an argument from personal incredulity, and one I think I can dispel by re-stating my position: Yes, I really do mean that tagging will replace other kinds of formal categorization. Across, like, the whole freaking web. Modeling the group mind with the group mind is both better and cheaper than making some small formalize their guesses on behalf of the larger group. When labeling strategies are concerned, it’s formal, accurate, large: pick two.

The only asterisk I’d place on that belief is for cases where forcing intellectual conformity on a user base is both desirable and feasible. We want there to be general agreement as to the categories of mental illness or distress, and the American Psychiatric Association has both the authority and wherewithal to produce that agreement in the US. But those cases are rare, and even when they are on balance desirable, their existence as a single source of doctrinal authority creates perverse incentives like the fight over defining homosexuality as an illness.

But, as with Amazon, most places that offer digital categorization are only in the expensive and frustrating business of classification because they wrongly believe that is the best way to serve their users. That will change.

One can be enthusiastic about tags and folksonomies (I am) and still confront the serious problems that face them as a stand-alone tool for organizing information. Turning a blind eye to those problems is what turns strange zeitgeist into irrational exuberance.

This is both correct and a key point. I didn’t discuss this in Ontology is Overrated because I hadn’t yet understood it back in March: tags do indeed have serious problems as a stand-alone tool for organizing information, and compared to formal classification schemes, tags are not an acceptable replacement.

But tags aren’t a stand-alone tool. Tagging is the first post-search tool for information organization; tagging only makes sense in a world where Google has already become normal. Traditional cataloging systems, for all their faults (faults well understood by the catalogers themselves, I might add) had one clear reason for their continued viability: there was no real alternative. Pre-digital finding tools — book indices and card catalogs and thesauri and so on — were the only game in town.

No longer. Full text indexing, link analysis, trust networks, and related techniques now accomplish about 80% of what classification used to do for us. The reason I wrote, earlier, that tagging is displacing classification, rather than replacing it, is that tagging is merely handling the residual value that comes from labeling in a world where search has already taken over many of the important functions previously handled by classification systems.

The question is not whether tagging systems can do everything formal classification schemes used to do — they can’t, but they don’t need to. The question is: which is a better fit for the requirements of labeling in a post-search world — tagging, or formal classification? And my answer is tagging.

This is, yes, irrationally exuberant. To predict that something that’s been around less than two years is going to displace an activity that’s been around for centuries is, well, you can provide your own label for that belief yourself, but it’s a safe bet ‘rational’ won’t be high in the resulting tag cloud. But here’s the thing: you can’t understand technological change if you assume that new systems replace existing ones when the new systems outperform the existing ones.

One of the seminal moments for me in understanding the net came reading Paul Feyerabend’s Against Method. Feyerabend, a historian and philosopher of science, pointed out that new theories don’t in fact spread because they better explain the facts than old theories. They spread because they are a better mental fit, even when they explain fewer of the details.

One of his examples was the switch from geo-centric to a helio-centric view of the motion of the planets. The calculations of the motions of Mars made with geo-centric models, with all their retrograde motion, were highly accurate, and when the helio-centric model first appeared, it was a less useful for predicting Mars’ position than the well-debugged geo-centric tables. The increase in quality came only after the mental shift to helio-centrism happened, because once they had understood the new model, they then took on the job of building new tools.

That’s tagging vs classification. Formal classification has centuries of practice supporting it, while tagging merely has a handful of early, incomplete examples. As a result, tagging does not have anything like the sophistication of classification systems — for tagging to work broadly and well we still need, inter alia, group tags; private tags; better user-defined thesauri; better tools for discovering latent communities; better tools for making time series; better routing labels like for: and via:; better traversal of the resulting for/via graphs; and ways of turning a collection of tags into site navigation, a sort of permanent card-sorting game that continually optimizes site navigation. For tagging to take over from classification, we need all those things and more, and we don’t have them.

But tagging is already better fit for discovering and reflecting both personal and group mental models; does a better job of handling ambiguous or dynamic cases; provides judgment-related context (’funny’, ‘cool’); allows better mapping to communities of the like-minded; and is, on top of all of that, cheap cheap cheap. These advantages are driving adoption, and the early adopters are now suffering from the lack of well-developed tools, but new inventions will arise to service those users, and this will lead to more later-but-still-early adopters, sharpening the problems further but with a bigger user base, thus increasing the incentive for still more improvement, lather, rinse, repeat.

You could make a lot of money or win a lot of bar bets when thinking about the digital realm if you compare technologies between hard or easy, rigorous or sloppy, sophisticated or naïve, expensive or cheap, professional or amateur, and then bet on the things that have the most checkmarks in the right hand column. Tagging has a checkmark in all those boxes.


  1. Mentioning “Against Method” reminded me of an article (Discipline vs “Field” Discourse) I wrote many years ago (1991) where I also cited Feyerabend. I revisited my article and saw another good reference there:

    “The Sophists… proposed a theory claiming that the world itself was in full motion and contradiction, and that consequently the motion of la langue was only corresponding to real mobility … language could not express anything fixed or stable, since it was in full motion itself.” (Kristeva, Julia. “Le langage, cet inconnu, Seuil”, Paris, 1981).

    Comment by emilsotirov — August 28, 2005 @ 1:58 pm

  2. “file under Uncategorized… Tags: none”

    I love that.

    Comment by bobdc — August 29, 2005 @ 12:49 pm

  4. particulary thought-provoking for me is this notion: “…ways of turning a collection of tags into site navigation, a sort of permanent card-sorting game that continually optimizes site navigation…” — very much on my mind these days.

    one question, clay, to which i think an answer will help some of us more fully understand your position:

    on a site like, the creator of the content has “categorized” his blog entries in order to create a navigation method for accessing the archived content.

    do you consider this practice to be a form of “self-tagging” or a “classification system” or neither?

    when you propose “most places that offer digital categorization are only in the expensive and frustrating business of classification because they wrongly believe that is the best way to serve their users. That will change…”, would this include the “self-classification” (or whatever it is) of, say,

    i’m a die-hard proponent of the “metadata IS navigation/navigation IS metadata” view, and i very much buy the idea that user-generated metadata can drive navigation that is highly responsive to the mental models of users, but i wonder about publishing content without ANY initial navigational structure at all, in the hopes that the “masses” will eventually accidentally run across it and “tag” it; which, after a critical mass is reached, would finally provide navigational access to that content. for

    so, are you really proposing that site owners/content creators will eventually stop self-classifying their stuff? won’t most (perhaps even the amazons of the world) still want to “seed” their information space with a “default” navigational view?

    Comment by onpause — August 30, 2005 @ 4:46 am

  7. Clay Shirky’s ‘rebuttal’ of Gene Smith failed to counter several of his original arguments and many of us are still no further forward to learning how resource discovery in a globally distributed information environment is to be enhanced by folksonomies or tagging. Even some of the most ardent classification proponents wouldn’t deny that tagging can potentially play a huge role in the task of Personal Information Management (PIM) or for resource discovery within small communities of practice, but it remains totally unclear how such an approach can enhance resource discovery (esp. by subject) for the greater good.

    More generally, I find the number of entirely spurious and misinformed statements within the ‘tagging literature’ to be deeply concerning. For example:

    “But tagging is already a better fit for discovering and reflecting both personal and group mental models; does a better job of handling ambiguous or dynamic cases; provides judgment-related context (’funny’, ‘cool’); allows better mapping to communities of the like-minded; and is, on top of all of that, cheap cheap cheap”.

    How, for example, can “judgment related context” possibly enhance resource discovery for users in a global information environment?!?! Mr Shirky states that tagging is “cheap, cheap, cheap”??? Well, I guess it depends on your definition of ‘cheap’ and precisely how frugal we want to be in our future information society. The implication of such frugality is, for many, quite obvious. Indeed, every statement in the above excerpt is exceedingly debatable but is nevertheless taken as absolute fact. Where is the evidence? Where? Even a session of rigorous desk-based research would not lead one to make such spurious claims. The evidence simply does not exist.

    The bottom line is that folksonomies and tagging have yet to be exposed to any rigorous scientific enquiry and no amount of blogging can possibly substitute for this. It is something that my colleague and I wish to remedy (particularly since it is an extremely interesting area of research!). However, I’m confident that other work will gradually filter through the international information and computing science communities in due course and we will all be the better for it.

    Comment by CDLR — September 19, 2005 @ 5:14 am

  10. Clay Shirky makes false dichotomies

    There seems to be a misunderstanding in the tagging world that taxonomies are rigid hierarchies defined from authorities on high. Good taxonomies are multifaceted hierarchies which grow among users as trees of knowledge.

    If there is such a thing as good tagging verses poor tagging, I suggest that good tagging will take on board some of the lessons of multifaceted classification.

    Tagging and classification are the same thing. Tags in blogs are no different from personal classifications. The use of multiple descriptors are attempts by individuals to create multi-faceted taxonomies.

    As our quick analysis of Technorati’s top tags already indicates ( even the simplest multifaceted taxonomy adds value and an element of definition to an otherwise cloudy view.

    The goal of supertaggers is to encourage the best tagging, as well as providing added value of tagging super structures.

    Comment by supertaggers — December 23, 2005 @ 12:49 pm

  12. There is an interesting investigative paper in D-Lib magazine regarding tagging ( which attempts to square Clay Shirky’s claims with a small empirical study. Conclusions are as you might expect. The conclusions also hint at some of the analyses provided at

    Comment by CDLR — January 23, 2006 @ 8:45 am

