Tuesday, April 16, 2013

Big Data is Big But is Not Everything

I still consider myself a keen student of statistics and the use of quantitative approaches to understanding reality or events. For that reason, I am particularly keen to read and understand the claims being made by the "Big Data" movement that digitization of transactions and availability of high powered chips makes it possible today to obtain large data sets for analysis. The claim proceeds to state that big data is now the future and that the availability of information will make all of us very smart and create a deeper understanding of commercial, social and other transactions of life.

Like all claims that come with conventional wisdom, I am suspicious of the unquestioned exuberance over the possibilities created by "Big data". And yet the strength of this narrative is such that few people question it especially as it is now the loudly proclaimed by governments and large management and business firms that are leaders in providing policy and business advise.

David Brooks, writing in the NYT here, provides an incisive view of the "Big Data" movement and dissects the claims being made about it. He raises two important points, the first being that there are certain areas of individual life in which subjective preferences are still dominant and so it is important for "Big data" enthusiasts to be alert to the limits of this movement. To my mind, the most important refutation in the article is the push back against the claim that the surfeit of data obviates the need to create theories because correlations and other statistical techniques will reveal connections between variables.

This preposterous claim by the "Big Data" fundamentalists that theory is obsolete is rightly questioned by the author. In addition, Nate Silver, who himself is a very creative and competent statistician, tackles the claim in this book. Those who make the claim that the mere existence of large troves of data makes theory building unnecessary are overstating the case because any attempt to review and determine the degree of connection between two variables means that a theory exists about their connection. of significance too is that prediction and establishing linkages between phenomenon is not poor because of the absence of data but because of the inability of most professionals to distinguish between the signals and noise. In other words, a spurious connection may exists but unless a plausible theory is used to examine the claim, then big data will find all manner of connections that are just noises.  

Just because more data will be conveniently available does not mean that statistical ken will develop in proportion to it. Indeed, my guess and expectation is that the supply of poor statistical reasoning will rise. Society will still need to find good quantitative thinkers among the volume of "Big Data" crowd. 

No comments: