Tech 24 May 2005 04:23 pm
MSN Search seminar
I’ve attended today a seminar about the history of and the technology behind the new MSN Search. The speakers were Jim Walsh, Development Manager and Hugh Williams, Senior Software Design Engineer of MSN Search.
The first few minutes were mostly about the history of the service; up until 2002, Microsoft didn’t really care about search: the MSN home page had a search box that was only 10 characters wide, and they outsourced the service to Inktomi (now owned by Yahoo!). It was not until early 2003 that they decided to write their own search engine from scratch; they unveiled it in January, 2005.
About the technology, I think one of the interesting bits is that, like Google, they run the service on a large cluster of consumer-grade machines, with reliability created by software (but with the provision that software is also expected to fail every now and then). Unlike Google, however, they run the 64-bit version of Windows Server 2003 on all boxes.
They also discussed their spam-prevention techniques for a while, without delving into too much detail. They claim that 15 to 20% of all web pages reached by their crawler are junk or spam (things that no user wants to see listed as search results, in short) and need to be discarded.
And a final interesting point is that approximately 10% of the search queries are misspelled (in the USA; less in some other countries, more in others). They have algorithms for dealing with that and will return results for the correct spelling if there are not enough data for the incorrect one. Very often, though, there are very good results for the misspellings, especially common ones (the example they used was Britney Spears, which is apparently written in many different ways by the users; curiously enough, that was the same example used by the Google engineers who were here last year).
I asked them about plans for indexing non-textual content; the response was that they are working on it, starting with images: their current image search technology is from a third-party and, in their words, “not very good”. Audio and video will come later.





