Great post by Pietro Speroni explaining the difference between a tag set (an un- or arbitrarily ordered list of tags associated with a URL) and a tag cloud (a list of such tags ordered by frequency.)
He notes that tag clouds tend towards power law distribution, and goes on to say
If a tag cloud was a power law, we could express it as an ordered list of tags, and a number. The number representing the steepness of the curve, and the list representing the tags, ordered from the most popular to the least popular.
The fact that tag clouds only approximate power laws, means that if we try to express a tag cloud as a power law, we will be making an error, and although we might expect the error eventually to go to zero, at the beginning it might be quite massive. Also global cultural changes, even small ones, might shift the resulting aim from one approximated power law to another, thus temporarily generating a bigger error, until the change has been integrated in the curve.
Apropos yesterday’s post about the meaning of pages changing over time, he finds another example of same.
For example the paper: Clay Shirky: Power Laws, Weblogs, and Inequality, has by now being bookmarked by 113 person on delicious. When it came out the term ‘long tail’, was not used. [...] On the issue of October 2004 the article from Wired: The Long Tail came out. [...] This article changed the way people looked at power law. Thus it changed the way people perceived the previous article from Clay Shirky. At the moment 8 people have tagged his paper as ‘longtail’, and 3 as ‘long_tail’.
He builds on this intiution to talk about ways of finding not just similarity or difference between two links based on their tag clouds, but on being able to compute distance in information space, an idea with potentially fantastic value. Read the whole thing.
I think Pietro is really onto something here, with one caveat: I think that the goodness or badness of fit between any given tag cloud and a pure power law distribution is likely to persist over time, and that the misfit is itself informative.
Now I like power law distributions as much as the next person (well, way more, actually), but in this case I’m seeing a kind of variability in the tag clouds that isn’t easily explained away as an artifact of small scale.
For example, take two links on del.icio.us as I write this: one pointing to anti-virus software, which has currently been tagged by 91 people, and one about Amazon image hacks, which is rising fast, and which I measured when it had also been tagged by 91 people.
The virus page tag has 65 unique tags, for a tag/user ratio of 0.71; the Amazon page has 79 unique tags, and a ratio of 0.87. The tail is longer for the Amazon page than for the virus page, and, at the same time, the Amazon head is higher. The tag cloud for the Amazon page begins “42 amazon/18 images/15 hacks/14 web/12 cool”, while the virus page is “20 antivirus/15 virus/10 security/10 free/9 online”, a much flatter distribution.
I agree with Pietro that the curves will become clearer as the scale increases. However, I doubt that they will also converge on neat enough mappings to power law curves to be mapped to one another based on their coefficient of steepness. (Edit made after mail from Pietro explaining his idea further.) I think the tag cloud for the anti-virus page is more constrained because it’s clearer to the users what that page is about. Put another way, I am betting, based on early observations of tag cloud development, that the curves for each of those two pieces will become more like some idealized curve, but I do not think that the pieces have the same ideal curve.
Now there are source and audience issues here, so what the compressed tag cloud of the anti-virus page could be telling us includes both “that page is obviously about one or a few things” and “the audience for that page has a coherent view of what it is about”, and, of course, the Amazon link is telling us both “that page covers a lot of different issues” and “several different audiences have tagged it based on local perspectives.”
It will be hard to isolate those effects, but I am convinced that they are there, and that waiting for every tag cloud to settle into a pure power law is actually giving up on some really valuable information that is better regarded as communication about a particular intersection of content and community than simply being an artifact of small scale.
Hello Clay,
thanks for your kind words.
I must have been quite drunk if I ever meant that two different tag clouds, coming from different URL would tend to the same idealised curve. Of course they don’t or there would be no gold to dig, nor distance to measure.
Regards,
Pietro Speroni
Comment by pietrosperoni — May 25, 2005 @ 10:21 am
Tag Clouds and Recommeding Systems
Tags and Recommenders: Is there some synergy? There is a very interesting post by Pietro Speroni on tag sets and tag clouds. This was followed by some discussion by Clay Shirky on the fit of tag distributions to power laws,…
Trackback by shimenawa — May 25, 2005 @ 4:16 pm
An interesting aspect of the idea of applying a metric to “tag space” is that it only works if enough people tag the URLs. Until this critical point is reached, you might have a URL tagged with “tags” and “shirky”, and another with “tagging” and “cshirky”, and their distance would be maximal for two tags. This effect will diminish as the number of people tagging the URLs increases, but will always be present for less popular tags.
Comment by Adam — May 25, 2005 @ 4:25 pm
A possible experimental hypothesis: places description will converge
I was recently following this discussion ([1], [2]) on Tag Clouds and the fact that on the long run, tags will converge to a Tag Set. I think the same can be verified for annotations to a shared map that are used to describe a place. In the same wa…
Trackback by Mauro Cherubini's weblog — May 30, 2005 @ 12:05 pm
They do converge. And fairly quickly.
http://www.terrellrussell.com/projects/cloudalicious/
Pietro has written more since posting his original hypothesis – based on some of the Cloudalicious graphs:
http://blog.pietrosperoni.it/2005/05/28/tagclouds-and-cultural-changes/
Comment by terrellrussell — May 30, 2005 @ 5:08 pm
[...] n seinem Text On Tag Clouds, Metric, Tag Sets and Power Laws und Clay Shirky kommentiert: Sets Bad, Tag Clouds Good.
This entry was posted
[...]
Pingback by IB Weblog » Blog Archive » Tag in den Wolken — June 1, 2005 @ 8:28 am
Archives are at the heart of decentralized communities
In decentralized, emergent communities, the community archive defines the community over time. Therefore, designers of such communities need to pay attention to the processes by which these archives emerge. The ongoing debate over folksonomy provides…
Trackback by The Community Engine Blog — June 2, 2005 @ 11:32 am
[...] he descrivono visualmente – tramite la grandezza del font – l’occorrenza dei tag, in questo post su tagsonomy.net, un blog sul magico mondo della categorizzazione. (ho l [...]
Pingback by » Tag Cloud | Central Scrutinizer - [ il web in 92 comode rate ] — June 7, 2005 @ 11:46 am
[...] e che descrivono visualmente – tramite la grandezza del font – l’occorrenza dei tag, lo trovate qui su tagsonomy.com, un blog sul magico mondo della categorizzazione. (h [...]
Pingback by Tag Cloud | Central Scrutinizer - [ il web in 270 comode rate ] — February 15, 2006 @ 10:52 am
Connect people, places, and things to ideas and information.
Look up solstice from Doc Searls. His post is about what you find looking up information on regular search engines vs. what you find using Technorati. " The difference won’t just be the number of finds (as Tristan Louis just studied), but the el…
Trackback by Social networking software - innovation platform - collaboration sites — November 4, 2006 @ 12:28 pm