Wednesday, December 14, 2005

Alexa Web Search Platform Is For Information Extraction

Yesterdays announcement about Alexas Web Search Platform left me wondering: "Well this sounds cool, but what exactly are those new applications that could not be build using small scale crawlers + googles/alexas old apis?"
After a few minutes of thought this became clear: Alexas Web Search Platform is for novell information extraction and text mining applications. All applications that can gather some data from web sites that normally don't make it into the index. The two examples Alexa made - search engines that index the metadata of images and music - further illustrate this. So if you have algorithms that can get more information from websites than Google/MSN/Alexa (Recognizing people in pictures? Any kind of complex named entity recognition? NLP in general? Recognizing technorati tags not just on blogs but across the whole web? Reading XMP from files?)- Alexa offers you the framework to apply them. Or if you have algorithms that use large corpora of text to learn "something" (Ontology learning anyone?) - Alexa is the chance to process a really really large corpus (my very rough estimate: its less than $16500 to process the entire german speaking web*)

*: The entire index has 100TB, the german speaking web will be <25%, more than >75% will be images and the like; that leaves <7000GB. Processing this data at 1$/50GB is 160$ (generously assuming that in the process we create additional 1000GB of traffic). Estimating the runtime without knowing the algorithm is impossible, but assuming a pretty quick 100kB/s (what kind of cpu is that anyway?), we spend 16.000 CPU hours or $16.000 (better make that algorithm paralell - you could be waiting >2 years ;-) ). The 100GB we use for intermediate storage ($100) don't add very much.

Tags: ,