Great post by Pietro Speroni explaining the difference between a tag set (an un- or arbitrarily ordered list of tags associated with a URL) and a tag cloud (a list of such tags ordered by frequency.)

He notes that tag clouds tend towards power law distribution, and goes on to say

If a tag cloud was a power law, we could express it as an ordered list of tags, and a number. The number representing the steepness of the curve, and the list representing the tags, ordered from the most popular to the least popular.

The fact that tag clouds only approximate power laws, means that if we try to express a tag cloud as a power law, we will be making an error, and although we might expect the error eventually to go to zero, at the beginning it might be quite massive. Also global cultural changes, even small ones, might shift the resulting aim from one approximated power law to another, thus temporarily generating a bigger error, until the change has been integrated in the curve.

Apropos yesterday’s post about the meaning of pages changing over time, he finds another example of same.

For example the paper: Clay Shirky: Power Laws, Weblogs, and Inequality, has by now being bookmarked by 113 person on delicious. When it came out the term ‘long tail’, was not used. [...] On the issue of October 2004 the article from Wired: The Long Tail came out. [...] This article changed the way people looked at power law. Thus it changed the way people perceived the previous article from Clay Shirky. At the moment 8 people have tagged his paper as ‘longtail’, and 3 as ‘long_tail’.

He builds on this intiution to talk about ways of finding not just similarity or difference between two links based on their tag clouds, but on being able to compute distance in information space, an idea with potentially fantastic value. Read the whole thing.

I think Pietro is really onto something here, with one caveat: I think that the goodness or badness of fit between any given tag cloud and a pure power law distribution is likely to persist over time, and that the misfit is itself informative.

Now I like power law distributions as much as the next person (well, way more, actually), but in this case I’m seeing a kind of variability in the tag clouds that isn’t easily explained away as an artifact of small scale.

For example, take two links on del.icio.us as I write this: one pointing to anti-virus software, which has currently been tagged by 91 people, and one about Amazon image hacks, which is rising fast, and which I measured when it had also been tagged by 91 people.

The virus page tag has 65 unique tags, for a tag/user ratio of 0.71; the Amazon page has 79 unique tags, and a ratio of 0.87. The tail is longer for the Amazon page than for the virus page, and, at the same time, the Amazon head is higher. The tag cloud for the Amazon page begins “42 amazon/18 images/15 hacks/14 web/12 cool”, while the virus page is “20 antivirus/15 virus/10 security/10 free/9 online”, a much flatter distribution.

I agree with Pietro that the curves will become clearer as the scale increases. However, I doubt that they will also converge on neat enough mappings to power law curves to be mapped to one another based on their coefficient of steepness. (Edit made after mail from Pietro explaining his idea further.) I think the tag cloud for the anti-virus page is more constrained because it’s clearer to the users what that page is about. Put another way, I am betting, based on early observations of tag cloud development, that the curves for each of those two pieces will become more like some idealized curve, but I do not think that the pieces have the same ideal curve.

Now there are source and audience issues here, so what the compressed tag cloud of the anti-virus page could be telling us includes both “that page is obviously about one or a few things” and “the audience for that page has a coherent view of what it is about”, and, of course, the Amazon link is telling us both “that page covers a lot of different issues” and “several different audiences have tagged it based on local perspectives.”

It will be hard to isolate those effects, but I am convinced that they are there, and that waiting for every tag cloud to settle into a pure power law is actually giving up on some really valuable information that is better regarded as communication about a particular intersection of content and community than simply being an artifact of small scale.