I meant to post on this earlier today – I found a post on Petabyte Scale Data-Analysis and the Scientific Method in which had a very interesting and revealing comment about Google:
…. the fact that we don’t know why our search algorithms produce particular results for a particular query. That’s absolutely true. Do a search for a particular set of keywords, and we can’t, without a lot of work, figure out just why our algorithms produced that result. That doesn’t mean that we don’t understand our search. Each component of the search process is well understood and motivated by some real theory of how to discover information from links. The general process is well understood; the specific results are a black box. Mr. Anderson is confusing the fact that we don’t know what the result will be for a particular query with the idea that we don’t know why our system works well. Web-search systems aren’t based on any kind of randomness: they find relationships between based on hypotheses about how links between pages can provide information about the subject and quality of the link target. Those hypotheses are implemented and tested – and if they test out – meaning if they produce the results that the hypothesis predicts they should – then they get deployed. It’s basic science: look at data, develop a hypothesis; test the hypothesis. Just because the hypothesis and the test are based on huge quantities of data doesn’t change that.
The article proposes that when very large sets of data are analyzed by supercomputers they are able to find just about any pattern you’d want within the data.
The article in question is a recent article from Wired magazine, titled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”.
The basic idea behind the article is that now that we can retrieve and analyze truly massive quantities of data, that the whole scientific idea of trying understand why and how things work is now irrelevant.