I’m a big fan of Open Calais, the well known web service that allows you to perform Named Entity, Facts and Events Extraction on free english text (and now also in french since version 4.0).
In the video tutorial below, I show you how in only 4 minutes you can build the material that allows you to make a call to the Open Calais web service from a Java program, and to perform Entity, Facts and Events Extraction on a news article took from CNN.
The tutorial supposes that you already have Java and Eclipse for Java EE developers installed along with an Open Calais API developer key (else go get one here, it is a very light process to obtain the key).
Note that you can watch the tutorial in HD.
Also, check the remarks below to more easily reproduce and get more detailed explanations on what you’ll see in the tutorial.
To see the video in its best quality, just click here.
- The open calais web service WSDL showed in the demo is: http://api.opencalais.com/enlighten/?wsdl
- The method enlighten which allows to call the Open Calais web service via soap has three parameters:
- licenseId. This is your API key that you can get here.
- paramsXML. Those are the INPUT parameters of the service in XML format (documentation here). In the tutorial, for sake of simplicity I put the parameter as a raw String, of course it is better to read them from a file. Here are the parameters that I used: calaisParams.xml.
- content. This is the content on which the extraction will be performed. Again, for sake of simplicity I put the parameter as a raw String, and again, it is of course better to read it from a file (put whatever free text you want there). Here the content I used (from CNN).
- Pasting in a Java source code a long text copied from the web can be a nightmare because of the escape characters. The workaround I used in the demo is this general converter that knows (among other things) where to add the ” automatically at the good place.
- Here is the output of the tutorial.
- Here is the list of Open Calais possible outputs.
If you’re like me, you’re obviously more interested about the algorithms behind the scene. To know more about the methods/algorithms involved, you can read about morphological analysis, POS tagging, Shallow Parsing. On the Open Calais website, they also mention in a discussion that they have developed their own rule-based system with their own programming language. They are also using lexicons.
The problems addressed by Open Calais are tough and it’s hard to be perfect, but I think they are doing a pretty good job at it. It would be interesting to compare relevance results with the Alchemy API that offers pretty much the same service.