Tag Archives: algorithm

Google Hot Trends Clustering: The 100 Hottest Queries Tell You About 67.76 Stories In Average

Did you noticed that among the 100 (hourly updated) Google Hot Trends, there are always several hot queries that are related one to the other?

Let’s take  a look at the Hot Trends of the current hour by the time I’m writing this post: Hot Trends of  September 24 at 11PM PST Time (clicking on the keywords won’t work, it is just a local copy of the file at that time). In few seconds, we can spot some similar queries, for instance Hot Trend #5 “sean salisbury” is clearly related to Hot Trend #45 “sean salisbury internet postings” and also to Hot Trend #57 “sean salisbury cell phone incident” (click the picture to enlarge).

SeanClust3

Now, a small quizz: is there a link between Hot Trend #48 “julia grovenburg” and Hot Trend #8 “superfetation”, and what the hell is “superfetation”??.

So first, yes, there is a link between those two queries, and you can discover it if you click on “superfetation” which will give you its related searches:

superfetationDetails

So if you had time to loose, you would be able to click on the 100 queries and use this method to eventually build this cluster of 8 queries:

superfetationClust8

  • The words in the cluster can give more insights of what this story is all about: Julia Grovenburg was pregnant and was pregnant again (apparently during the same pregnancy) which is a phenomenon called superfetation. You can verify it on a news article of the same day:

newsPregnancy

  • Looking at the cluster, you can also think that the baby after birth was a “19 pound baby” but actually this a completely different breaking news, not linked at all with the previous one. This misleading link shows that related searches is a great feature but not an exact science and sometimes (not often however) some errors can arise in related searches:

wrongRelatedSearches

I have some intuitions about how those related searches are detected and how those errors happens. It’s beyond the scope of this post but if you are interested about it, shoot me an email.

So I implemented a link-based clustering algorithm that knows how to plug to google hot trends data ant that build all that stuff automatically. Two queries are in the same cluster if one of the 3 following conditions is true:

  • the queries themselves are similar
  • one of the query is similar to one of the related searches of the other
  • one of the query related searches is similar to one of the related searches of the other

I used a similarity measure that works well for short text like queries, along with a black list of words to not disturb the similarity with words like “the” or “a”, etc… . I also empirically determined different thresholds for the three different cases described above. If you have more questions about that stuff, feel free to shoot a comment or to contact me.

So How Many Clusters Can I Build Out Of The 100 Google Hot Trends Queries?

You got it from this post title: 67.76 clusters in average (based on crawled data that represents few months of hot trends). Each cluster is supposed to represent a same “story” or breaking news. Note that this number is also dependent of my thresholds and that other algorithms and/or thresholds (more or less strict) can obtain slightly different numbers.

Of course, some errors can also arise, either because of some misleading related searches (like showed above) or because is some cases two queries look very similar but in reality they are speaking about two different things.

As an example of output, see the file generated for the 100 keywords studied in this post.

What It Is Useful For?

First of all it is fun :). Second, in information retrieval, order is always better than the opposite. But much more than that: if you are a breaking news website or blog, you’d better use in your article all the keywords of the same cluster since they represent the hottest searched queries of that particular story represented in its cluster! From an SEO point of view, I think the interest is pretty clear.

BONUS

If you read the post up to here, I’d like to offer you a small bonus :). It is the HUGEST cluster that I was able to observe running my program on the last few years of google hot trends data. I think you already guessed to which breaking news it is related.  Check it out!

Update: Coincidence, the day after I wrote this post the hot trends list was reduced from 100 to 40, so the screenshots and data above are in souvenir of the older version :).