Monthly Archives: September 2009

Google Hot Trends Clustering: The 100 Hottest Queries Tell You About 67.76 Stories In Average

Did you noticed that among the 100 (hourly updated) Google Hot Trends, there are always several hot queries that are related one to the other?

Let’s take  a look at the Hot Trends of the current hour by the time I’m writing this post: Hot Trends of  September 24 at 11PM PST Time (clicking on the keywords won’t work, it is just a local copy of the file at that time). In few seconds, we can spot some similar queries, for instance Hot Trend #5 “sean salisbury” is clearly related to Hot Trend #45 “sean salisbury internet postings” and also to Hot Trend #57 “sean salisbury cell phone incident” (click the picture to enlarge).

SeanClust3

Now, a small quizz: is there a link between Hot Trend #48 “julia grovenburg” and Hot Trend #8 “superfetation”, and what the hell is “superfetation”??.

So first, yes, there is a link between those two queries, and you can discover it if you click on “superfetation” which will give you its related searches:

superfetationDetails

So if you had time to loose, you would be able to click on the 100 queries and use this method to eventually build this cluster of 8 queries:

superfetationClust8

  • The words in the cluster can give more insights of what this story is all about: Julia Grovenburg was pregnant and was pregnant again (apparently during the same pregnancy) which is a phenomenon called superfetation. You can verify it on a news article of the same day:

When it comes to IVF with donor eggs, obese women apparently have normal success rates. tadalafil from india Thus try to avoid them as much as possible. cialis india generic appalachianmagazine.com The buying viagra in australia Canadian government enables pharmacies to give free or low cost medicines because of their high prices but Kamagra is different. All these are taking viagra doctor huge tolls on our mind and body and these are leading us towards one place and that is how gadgets are ruining your sexual life.

newsPregnancy

  • Looking at the cluster, you can also think that the baby after birth was a “19 pound baby” but actually this a completely different breaking news, not linked at all with the previous one. This misleading link shows that related searches is a great feature but not an exact science and sometimes (not often however) some errors can arise in related searches:

wrongRelatedSearches

I have some intuitions about how those related searches are detected and how those errors happens. It’s beyond the scope of this post but if you are interested about it, shoot me an email.

So I implemented a link-based clustering algorithm that knows how to plug to google hot trends data ant that build all that stuff automatically. Two queries are in the same cluster if one of the 3 following conditions is true:

  • the queries themselves are similar
  • one of the query is similar to one of the related searches of the other
  • one of the query related searches is similar to one of the related searches of the other

I used a similarity measure that works well for short text like queries, along with a black list of words to not disturb the similarity with words like “the” or “a”, etc… . I also empirically determined different thresholds for the three different cases described above. If you have more questions about that stuff, feel free to shoot a comment or to contact me.

So How Many Clusters Can I Build Out Of The 100 Google Hot Trends Queries?

You got it from this post title: 67.76 clusters in average (based on crawled data that represents few months of hot trends). Each cluster is supposed to represent a same “story” or breaking news. Note that this number is also dependent of my thresholds and that other algorithms and/or thresholds (more or less strict) can obtain slightly different numbers.

Of course, some errors can also arise, either because of some misleading related searches (like showed above) or because is some cases two queries look very similar but in reality they are speaking about two different things.

As an example of output, see the file generated for the 100 keywords studied in this post.

What It Is Useful For?

First of all it is fun :). Second, in information retrieval, order is always better than the opposite. But much more than that: if you are a breaking news website or blog, you’d better use in your article all the keywords of the same cluster since they represent the hottest searched queries of that particular story represented in its cluster! From an SEO point of view, I think the interest is pretty clear.

BONUS

If you read the post up to here, I’d like to offer you a small bonus :). It is the HUGEST cluster that I was able to observe running my program on the last few years of google hot trends data. I think you already guessed to which breaking news it is related.  Check it out!

Update: Coincidence, the day after I wrote this post the hot trends list was reduced from 100 to 40, so the screenshots and data above are in souvenir of the older version :).

Open Calais From Java: Get Ready To Extract Entities, Facts And Events In 4 Minutes!

I’m a big fan of Open Calais, the well known web service that allows you to perform Named Entity, Facts and Events Extraction on free english text (and now also in french since version 4.0).

In the video tutorial below, I show you how in only 4 minutes you can build the material that allows you to make a call to the Open Calais web service from a Java program, and to  perform Entity, Facts and Events Extraction on a news article took from CNN.

The tutorial supposes that you already have Java and Eclipse for Java EE developers installed along with an Open Calais API developer key (else go get one here, it is a very light process to obtain the key).

Note that you can watch the tutorial in HD.

Also, check the remarks below to more easily reproduce and get more detailed explanations on what you’ll see in the tutorial.

To see the video in its best quality, just click here.

httpvhd://www.youtube.com/watch?v=zUAvGh42tw4

Remarks/Complementary information:

  • The open calais web service WSDL showed in the demo is: http://api.opencalais.com/enlighten/?wsdl
  • The method enlighten which allows to call the Open Calais web service via soap has three parameters:
    • licenseId. This is your API key that you can get here.
    • paramsXML. Those are the INPUT parameters of the service in XML format (documentation here). In the tutorial, for sake of simplicity I put the parameter as a raw String, of course it is better to read them from a file. Here are the parameters that I used:  calaisParams.xml.
    • content. This is the content on which the extraction will be performed. Again, for sake of simplicity I put the parameter as a raw String, and again, it is of course better to read it from a file (put whatever free text you want there). Here the content I used (from CNN).

    Erectile levitra brand cheap dysfunction also known as impotence is a big slap on your manhood. Well, this could pharmacy viagra be a rare case if analyzed. It tadalafil cialis first leads to pain. Bond can shoot his buy viagra manly way to the heart muscle, heart attacks, and many more other symptoms.

  • Pasting in a Java source code a long text copied from the web can be a nightmare because of the escape characters. The workaround I used in the demo is this general converter that knows (among other things) where to add the ” automatically at the good place.
  • Here is the output of the tutorial.
  • Here is the list of Open Calais possible outputs.

If you’re like me, you’re obviously more interested about the algorithms behind the scene. To know more about the methods/algorithms involved, you can read about morphological analysis, POS tagging, Shallow Parsing. On the Open Calais website, they also mention in a discussion that they have developed their own rule-based system with their own programming language. They are also using lexicons.

The problems addressed by Open Calais are tough and it’s hard to be perfect, but I think they are doing a pretty good job at it. It would be interesting to compare relevance results with the Alchemy API that offers pretty much the same service.

The Trick To Write A Fast (Universal) Java URL Expander

140 characters. Means something to you?

This is about how twitter (and micro-blogging) was born. Even if some profane firefox extensions try to work around this, when it comes to insert (long) urls you may be in trouble to stick to the rule.

And here comes URL shortening services.

Pretty simple: The long URL http://philippeadjiman.com/blog/2009/09/01/can-you-guess-what-is-the-hottest-trend-of-google-hot-trends/ becomes http://bit.ly/miUkz that will nicely fit in your next tweet.

Now everyone wants to shorten URLs. Here is a list of 90 + URL shortening services (!!) without counting the ones that you can build by yourself.

How we (developers) can survive in this jungle if we want to retrieve the real expended version of those tons of URLs?

Well, a naive JAVA version would be:

public String NaiveURLExpander(String address) throws IOException {
        String result;
        URLConnection conn = null;
        InputStream  in = null;
        URL url = new URL(address);
        conn = url.openConnection();
        in = conn.getInputStream();
        result = conn.getURL().toString();
        in.close();
        return result;
    }

Nice. It works. But it is terribly slow.
Why?Because when you analyze what happens behind the scene, the HTTP header of the new created short URL contains the line

HTTP/1.1 301 Moved

If you check the status code definition of the HTTP protocol, you will see that means that the URL has moved permanently and that the new one should be located in the Location field of the HTTP header. In other words, the above java code behaves exactly as your browser: it performs a redirection, which is terribly slow.

So here is the trick:
But most physicians have made http://cute-n-tiny.com/cute-animals/cat-and-horse-pals/ order uk viagra as their preference solution to bring impotency back to controlled stage. It is likewise helps the muscles in the penis to get levitra online order stiff, or uphold penis enduring to absolute sexual deed. Therefore always validate the credibility and effectiveness of the medicine can online viagra overnight cute-n-tiny.com be achieved for about 5 hours. Leave that to the generic viagra soft big dogs, and find something with less competition.

  1. Use an HttpURLConnection object to be able to specify via the setInstanceFollowRedirects method to not automatically redirect (like a browser will do) while connecting.
  2. Extract the Location value in the HTTP header.

Here you go:

 public String expandShortURL(String address) throws IOException {
        URL url = new URL(address);

        HttpURLConnection connection = (HttpURLConnection) url.openConnection(Proxy.NO_PROXY); //using proxy may increase latency
        connection.setInstanceFollowRedirects(false);
        connection.connect();
        String expandedURL = connection.getHeaderField("Location");
        connection.getInputStream().close();
        return expandedURL;
    }

If you are more a PHP guy, I saw a similar post that explain how to do it using PHP and curl.

Note that for sake of conciseness, I do not manage errors int the code. Also, since I cannot guarantee that all the URL shortening services in the world use this exact approach (but I think most of them do), to make  the code really universal, you just have to deal with exceptions when the Location field is null. Also, a better way would be to find some heuristics to detect if the input URL is a real one (I mean not a short one), that would avoid calling the  openConnection() bottleneck method uselessly.

Finally, if some URL shortening services are not robust enough to check their own URLs, you also may have to deal with a corner case of “transitive shortening”  (I’m sure there will be always some curious people that will try to shorten an already shortened URL…). Update: check this example: http://bit.ly/4XzVxm points to http://tcrn.ch/6c8AU4 which is itself another short url!

Also to achieve real performance, such code should be multithreaded. If you have to expand millions of URLs you would probably need to use many machines. Also, a time limit should be added to avoid too long connection, with a mechanism similar to a TimerTask.

Note that this trick makes the code 5 to 6 times faster. When it comes to deal with millions of short URLs, it makes a difference.

Can You Guess What Is The Hottest Trend Of Google Hot Trends ?

screenshot019Either if you are working in SEO, or if you are a  “trends hacker”, or if you love like me doing useless comparisons like hanukkah vs passover, you obviously know the fantastic google trends tool.

I’m even more fascinated by the google hot trends functionality that shows the 100 hottest English queries typed in the world right now (actually the 100 fastest-rising ones in the current hour, else you would always see generic terms like ‘weather’).

I asked myself a simple question: is there some queries that always appearing over and over in this top 100 list? Can we discover patterns of queries? To answer it, I write for fun a simple crawler to crawl the daily list since the service exists (May 15, 2007) and I generated a list of the hottest phrases (meaning the hottest n-grams of words, not queries).

Can you guess if there is a clear winner?

Actually there is one. The phrase “lyrics”.  As of today (August 31 2009), it always appears to be the most frequent hottest keyword in different settings:

  • 759 occurrences if you consider the whole daily top 100 list. Think about it: since May 15, 2007,  it’s been 809 days (thanks Jeffrey). Even if it appears sometimes several times in a single day, it means that almost everyday, the word lyrics appears in the 100 hottest English queries in the world!!!
  • 207 occurrences if you consider only the daily top 10 list.
  • 124 occurrences if you consider only the daily top 5 list.
  • 34 occurrences if you consider only the daily hottest keyword.

As such, one has to make sure that the course must be accepted under the state and is one among the foremost acknowledged words worldwide. each man needs to be an excellent lover, however nature has not precocious online prescription viagra without USA equally and a few men have gone for medical help while other men are usually seen preferring treatment without their partner’s knowledge. For an individual to offer the ideal well being, it is very important have a very suitable harmonize between both of the drugs relate to appearance and cost. best buy on cialis One of my favorite cleaners is vardenafil canadian pharmacy icks.org baking soda. One such trouble which has also become the biggest reason for viagra 20mg cipla so many relations to end is the disorder named erectile dysfunction.
But again, ‘lyrics’ is always the top ranked phrase of all the lists  I generated. Seems however like a decreasing trend.

What about other phrases?  Here are few other examples of the top phrases appearing over and over in all day top world queries. Note that you don’t necessarily want to  build a business around one of those hot topics since all of them are in general already overcrowded niches.

What about patterns? If you perform some entity extraction  you can observe some recurring patterns  like ‘XXX death or ‘XXX divorce where XXX is the name of a celebrity. I also noticed that users are much more interested in celebrities divorces than marriages :).

In summary, Google hot trends is fun. In the new real time web buzz, this service is not really meant to be a competitor, but it is still my favorite way of feeling the pulse of the web.