Tuesday, October 18, 2005

A Cool Semantic Search Engine - Finally

Search Engine Watch yesterday wrote about the new specialized medical search engine healthline. I had a look at it and I'm happy to report that it is the first really cool semantic search engine in the wild. Granted, building such a system is not rocket science - but still, I'm not aware of any public website that has a comparable semantic search engine.

The architecture of healthline is a classic "Pre/Post semantic seach engine" - the simplest kind of a semantic search engines (I'll write about the other possibilities: "pseudosemantic search engine" and "semantic search engine" some other day). At the core of such a search engine is a traditional text index, retrieval is only based on the text of the documents no metadata is used whatsoever. The background knowledge is only used to augment a query before it is posed to the text index and in the end to enrich the result that is displayed to the user. Once a query comes in, it is first examined for references to the ontology/taxonomy, if some are found the terms that reference the taxonomy/ontology may be removed, replaced or some more terms may be added. An example for healthline would be the search that uses a casual term for a disease that is replaced by the standartized medical term. The altered query is then used to query the text index. In the end the result that is returned from the text index is enriched based on the background information. Healthline enriches the result with links offering query refinement ("narrow your search") and for query relaxation ("broaden your search"). As a nice touch they sometimes offer "information maps" (search for "Prostate cancer" to see one) for important topics.

If you think you need such a search engine you can contact me or one of my employers (fzi or ontoprise). We have already build similar system (sadly only used in corporate intranets, so I can't link to them here) and would love to make more!

Update: In their feedback to my feedback the makers of healthline point out, that

we're doing more than just full text index/retrieval.
My guess is that they are increasing recall by either using automatic classification (unlikely, because they would need training data) or Latent semantic indexing