Showing posts with label Statistics. Show all posts
Showing posts with label Statistics. Show all posts

Tuesday, April 16, 2013

Big Data is Big But is Not Everything

I still consider myself a keen student of statistics and the use of quantitative approaches to understanding reality or events. For that reason, I am particularly keen to read and understand the claims being made by the "Big Data" movement that digitization of transactions and availability of high powered chips makes it possible today to obtain large data sets for analysis. The claim proceeds to state that big data is now the future and that the availability of information will make all of us very smart and create a deeper understanding of commercial, social and other transactions of life.

Like all claims that come with conventional wisdom, I am suspicious of the unquestioned exuberance over the possibilities created by "Big data". And yet the strength of this narrative is such that few people question it especially as it is now the loudly proclaimed by governments and large management and business firms that are leaders in providing policy and business advise.

David Brooks, writing in the NYT here, provides an incisive view of the "Big Data" movement and dissects the claims being made about it. He raises two important points, the first being that there are certain areas of individual life in which subjective preferences are still dominant and so it is important for "Big data" enthusiasts to be alert to the limits of this movement. To my mind, the most important refutation in the article is the push back against the claim that the surfeit of data obviates the need to create theories because correlations and other statistical techniques will reveal connections between variables.

This preposterous claim by the "Big Data" fundamentalists that theory is obsolete is rightly questioned by the author. In addition, Nate Silver, who himself is a very creative and competent statistician, tackles the claim in this book. Those who make the claim that the mere existence of large troves of data makes theory building unnecessary are overstating the case because any attempt to review and determine the degree of connection between two variables means that a theory exists about their connection. of significance too is that prediction and establishing linkages between phenomenon is not poor because of the absence of data but because of the inability of most professionals to distinguish between the signals and noise. In other words, a spurious connection may exists but unless a plausible theory is used to examine the claim, then big data will find all manner of connections that are just noises.  

Just because more data will be conveniently available does not mean that statistical ken will develop in proportion to it. Indeed, my guess and expectation is that the supply of poor statistical reasoning will rise. Society will still need to find good quantitative thinkers among the volume of "Big Data" crowd. 

Tuesday, November 20, 2012

Post-Elections Analysis 2012

A number of columns and purported analyses have been written already about what determined the outcome of the US presidential elections which president B. Obama handily won. Again, I am not a US Citizens but I had some understanding of the race by following more closely the various contentions between the two candidates. The outcome of the race confirms to me that as stated before, most political, like economic commentators and pundits do not know what they speak about. Not only was the conventional wisdom that the race was a statistical dead heat untrue, but many people opted to go for the pundits on TV with known ideological and party biases, while ignoring both Intrade and Nate Silver's blog which both suggested that Romney had made gains in the last month but was still an underdog late in the game. 

To my own disappointment too, many libertarians and market friendly commentators stuck to a very ignorant mantra that Romney had a momentum after the first debate and would win. Their reasoning was that the endorsement of the Tea Party on one side and the stellar record in corporate reengineering was enough. Sorry, it was not purely because the president was not really as weak as it was thought. And it was only a single comment in the Samizdata blog where there was the caution that the celebrations were unjustified because the information markets still firmly predicted an Obama win. 

Looking now at the result, it is clear that there will be many books and tracts trying to explain the manner of Obama's win. Starting with this piece in the NYT, there is emerging evidence that president  Obama's campaign team worked with a "Dream team" of academics on the cutting edge of research, but also employed sophisticated data analysis that informed both media buying and face to face outreach. Noting also that both campaigns had professionals advising them, it is essential to compare one set against the other in order to determine how one side bested the other.  

Monday, October 29, 2012

Quoting Nate Silver

"In science, one rarely sees all the data point toward one precise conclusion. Real data is noisy-even if the theory is perfect, the strength of the signal will vary". Nate Silver, in, The Signal and The Noise: Why Some Predictions fail- but some don't. Loc. 6920-32

Tuesday, July 10, 2012

Are Hollywood Studios Aware of Epagogix?

In this recent edition of the NYT's Magazine, Adam Davidson wonders how the movie industry makes it it's money. The interesting article goes through the organization of Hollywood and uses movies that largely show that the industry spends a lot of money in production and marketing but seems not to have a definite idea about what distinguishes commercially successful movie projects from the rest. For all the glamour, the article confirms my suspicion that commercial success for most movies is really rare, judging by the overall return on investment that is barely at 1%. Only Disney and Pixar, which both make animated movies seem to have strong brand recognition, to the extent that it matters.

What comes to my mind is whether the directors and owners is whether any of them have read my blog post here but more importantly whether they have heard from Epagogix. This firm has been known to have a quantitative prediction tool for assessing the likelihood of commercial success for movies based on the script. Among the findings of this tool is that it does not recommend the hiring of high-profile movie stars who are also expensive because they do not guarantee success for those movies. While I am unaware of published tests on this tool by Epagogix,  I am surprised that few people in Hollywood are using it keenly. So is it ignorance or have they tried it and found it unsuitable for their purposes?

Monday, May 21, 2012

Big City Team Wins With Bad Soccer

Over the last weekend, a record for soccer was established in Europe where Chelsea FC won the European Club Champions Cup (UEFA) cup after beating Bayern Munich of Germany. In that win, Chelsea FC established its own record because this was the first time that Chelsea are the soccer champions of Europe. Many journalists also noted that it marked the end of a jinx where the English team and teams based in England have consistently lost to German opposition in penalties.  

Typical of events such as these is that the analysis captures trite issues with every pundit attempting to explain why the victory and loss were altogether inevitable. To my mind, the one factor that only very keen analysts would have noted and even explored is the fact that this event marked the break with big cities winning this championship. This obscure fact is one that I realized while reading this book by Simon Kuper and Stefan Szymanski. In one of the chapters, they state the curious fact that none of the soccer clubs based in Europe's largest cities such as London, Paris or Istanbul have had a team win that trophy while teams domiciled in smaller cities such as Marseilles, Manchester  and Milan have achieved that feat. 

With Chelsea's improbable win, it is clear that some change has taken place and big-city clubs outside Madrid may have found a way to crack this problem. The event is also an achievement in the sense that it proves that a team that consistently pumps a single-investor's money into soccer in Europe could ultimately win that prize even if it makes no profit while doing that. Without data and more incisive analysis, it is possible that pundits are making too much of a single event but one sees that many teams will probably be encouraged to be pragmatic in terms of game strategy and win regardless of the aesthetics. Maybe south American soccer will be what helps the game retain the claim to being the beautiful game. Chelsea are winners but won with no style or respect for entertainment value. Since the club is making no money while at it, I am not thrilled that the better team, Bayern Munich, lost. 

Saturday, November 19, 2011

Reviewing Rugby World Cup 2011

Many sports fans are today aware of the fact that statistical analysis of sports events has become common. This often manifests itself in the application of statistical techniques to purchase players and even in choice of play during set pieces during play. the one place in which the reasoning with data is more overt and subject to clear assessments is in the development of predictive models for ranking teams in a tournament.

A couple of articles written by Eoin O'Connell here and here in Significance Magazine presents a clear narrative on the author's reasoning about the pools, the form teams, and determinants of qualification towards the finals. This model is fascinating because his predictions correctly names seven of eight teams that qualify from the pool stages. While the predictor model is narrated, it is easy to see how the logic of the model worked and how it turned out where results did not go as predicted. Going towards the semi-finals and finals, the second piece states clearly that New Zealand team has the advantage of form and superior performance during the pool that makes the team less likely to lose. the model proves correct though the Wales replaced Ireland as the other finalist. The narrative is impressive as it highlights a systematic and approach to analysis of the games with data used to strengthen the stories. What one concludes is that home advantage and team form is a hugely powerful predictor of the outcome.

Tuesday, August 30, 2011

Vettel is Not All Time Best in Formula 1

Journalism is a profession that has immense value for bloggers and it has been the basis upon which a large proportion of this blog has based analysis or commentary. Many people seem to think that it is only in the area of political coverage does partisanship overtake objectivity but I consider that sports journalism too is especially prone to commentary that includes exaggeration or even outright misstatements.

Take as an the example of this piece by David Coulthard, a retired driver on the Formula 1 circuit, writing in the Daily Telegraph. It is true that during the last race, the Red Bull team for which Sebastian Vettel is a driver took the top two positions on the podium as they have dominated racing this season year and the last. The performance of that team has been very good and it has been the most consistent over the last couple of years. That notwithstanding, the article gives the impression that Sebastian Vettel's dominance is so pronounced that he is altogether worthy of the consideration of the crown of the best driver ever.

I disagree with this for the reason that comparisons across time as David Coulthard makes should of necessity come with a caveat that time references are tricky because of change of rules and circumstances. Secondly, it strikes me as odd that he chooses to concentrate criticism of Michael Schumacher and Lewis Hamilton as drivers who supposedly rely on the ability to steer very firast and therefore inferior to the more rounded Vettel. That criticism is allowed but is too limited to the supposed faults of two drivers alone that it makes it unworthy of much consideration.  Knowing as swell that the writer raced against Michael and obviously came out second best means that he is unlikely to be as fully objective. 

To my mind, he should be aware or honest to state that the consistent change in regulations makes the sports particularly prone to shifts in dominance that may have nothing to do with the capability of individual drivers. Formula 1 is also peculiar in the sense that drivers are hostage to the reliability and consistency of their teams. As it is today, the Red Bull team has a superior car in stability and fitness for the rules and the rest are catching up. To conclude, while I defer to his opinions ideas because I have been close to but never driven a Formula 1 machine, I am also reasonably certain that the differences in capability between the drivers is much smaller than that between the cars. Formula 1 is at least as much about  engineering as much as it is about the and capability of individual drivers. That explains why drivers in the same team tend to finish in roughly same positions. One would expect David Coulthard to know that or ask for data to prove that.

Friday, June 03, 2011

Boris Johnson Argues That Rugby Causes Less Violent Crime

Sometime last week, the UEFA Champions League finals took place in Wembley Stadium in London. The general conclusion was that on that day, the better club, FC Barcelona of Spain won the trophy. As is expected, the punditry went out to express opinions and theories on why FC Barcelona completely dominated the match and what that implies about the football tradition of England, where the vanquished team is based. As usual, most of the commentary was not worthy of reading, leave alone taking seriously.

In my view, the standout analysis of the match and its outcomes came from the Mayor of London, Boris Johnson. His article is definitely worthy of reading and reflection upon. And yet, he too missed a delicate point and made a common error. By way of summary, his argument is that the dominance of FC Barcelona suggests that the approaches chosen by English teams is manifestly inferior. Going further, he posits that England may be better suited for rugby, the sport in which it has produced a recent world beating team. No errors so far, except that he alludes to the fact that rugby is a distinct sport in the sense that areas with the highest participation in rugby are also areas with the lowest levels of crime.

Be that as it may, it is still a leap in abstraction unless he can prove that the direction of causality heads from Rugby towards low violence. Is it not just possible that areas that experience low levels of violence in the first instance are attracted to rugby? Mayor Johnson, correlation is not causation.

Wednesday, April 27, 2011

Luck Comes in Threes for Irish Bettor

Niall Smyth, a part time poker player has shown a remarkably lucky streak by taking small bets successively to convert an initial bet  of €10  into € 550,000. Needless to mention, the odds of betting on a horse race, then taking the prize into a qualifying game for an elite tournament before beating the filed of 614 poker players is very slim. As reported in this article by the Irish Times, this represents an interesting mix of luck and capability on the part of the player that a statistician would claim should happen very rarely. 

Tuesday, January 25, 2011

Sports Franchises in Big Cities

One of the memorable points from reading this very perceptive expose of modern soccer was the observation that the three most populous cities in Europe have never produced a team that won the continental event. And this is in spite of the obvious fact that franchises in these cities would have some of the largest fan bases and access to quality personnel. The authors highlight a number of smaller cities in France, England and Germany that have won the championship more than once.

Reading a NYT article by Richard Sandomir leads to the understanding that a similar fate has met some of sports franchises in the US too. To my mind, it seems that the mere size and diversity of a city may not be sufficient for arranging a winning sports franchise. It is unclear to me whether this is a highly selected sample of poor performers or whether there is a reason why metropolises seem to struggle when it comes to creating winning teams. A data based review of this is necessary to settle this point because the Los Angeles Lakers have been quite successful as a sports franchise. Is there empirical proof that mid-size cities  really have better sports franchises?

Tuesday, January 18, 2011

Statistics Meets the Decline Effect

A week ago, I went through this extremely perceptive piece by Josh Lerner in the New Yorker Magazine addresses the "Declining Effect" in scientific studies. The claims are quite strange because among the things one learns about the validity of a scientific study is the fact that the results would hold if a well-designed study was replicated. And yet the author makes the claim that almost all scientific studies that make one claim often find that the power of the effect being measured erodes with time. To my mind, this must concern scholars of all kinds because recent policy debates are increasingly reliant on the power of replicable and scientifically sound studies in both the social and physical sciences.

The author of the piece explores a number of reasons to explain this equally strong phenomenon but reaches no firm conclusion save that there is a discernible bias towards positive results in scientific journals and the tendency to chase results that meet the significance test. The first thing that came to my mind was the possibility that the systematic weakening in results that are initially robust is itself a form of reversion towards the mean but this too is discounted.  So what it leaves is the fact that there is a lot of randomness and that proving an idea one way or another is at some level a matter that has no certainty. 

Friday, November 19, 2010

Rerunning Formula 1 2008 On 2010 Scoring System

While writing the last blog post, I made reference to the continuing tinkering with the points scoring system in Formula 1 races. It argued that the more recent changes does not seem to make a difference to the outcome of the championship. Towards the end, the blog post stated that given the fact that the 2008 Championship also went down to the very last lap of the race, it is possible that the scoring system adopted in 2010 would yield a different winner. Concerned that I may have overstated, I went to the Formula 1 results archive to test that claim. The simple test involved ranking the championship winner and all other drivers who won at least one race in that year by using the 2010 scoring format.

The table on the above shows the finishing position of each of the seven drivers on the columns for each of the 18 races of that season, with the zeroes as code for drivers that did not finish that race. Clearly, I was mistaken to think that the championship results would have favored Felipe Massa due to his having won more races than Lewis Hamilton.  It remains that while the finish was still very close and was determined by that last race, Lewis Hamilton would still have been the championship winner. Truly, the scoring system requires massive tinkering to generate the excitement that Formula 1 bosses seem to think is lacking.

Thursday, November 18, 2010

Formula 1: Fixing An Unbroken Points System

It is unlikely that any other driver will dominate the sports of Formula 1 racing as much as Michael Schumacher did. Not only did he win the highly technical event seven times, but he also retains the record for the highest number of wins for individual races. One of the more subtle indices of his dominance is the fact that the scoring system was revised by the sport's administrators ostensibly to add to the competition.  While I have remained skeptical of these revisions that are often sold to the public as instruments to for creating more exciting competition, I have never got to compare them against one another to determine whether they do make a difference.

Michael Wallace has gone a step ahead and written a piece in Significance Magazine here in which he compares the performance of the top five drivers of 2010 against the scoring systems used in this and previous seasons.  His findings show that despite the manipulation of the scoring systems, it is clear that the result would not have changed significantly. To my mind, this shows that sometimes even the people who run sports simply go for change for change's sake. Having said that, the results would probably have been different for the 2008 season during which Felipe Massa came so close to winning the championship but lost to Lewis Hamilton.  the results would have been different considering that Massa had won more races than Hamilton who had been more consistently placed as a podium finisher.

Monday, August 17, 2009

Usain Bolt Takes 36 Strides to the Record

I have reviewed casually the data from the Olympic games from last year. I told myself that the men's 100 meters sprint record was in my view off the curve that within two games, the timers would have to include at least three decimal places. My reason then was that with a world record established at 9.69 seconds by Usain Bolt, it would be soon not be possible to break that record by large margins.

Its  its a good thing that I did not place a large bet on that because Usain Bolt run a race at the World championships that set the record at 9.58 seconds. To my mind, it was not just the way that this runner made the world's best trained sprinters to look ordinary but the margin was huge in two ways. In my calculation to cut down that record of 9.69 seconds over a year by 0.11 seconds is an amazing feat notwithstanding the new training methods and technological aids. tThe second point is that his challengers run a fast race that would have won Ggold medals for the first four, two games ago.

As the story in the Irish Times here states, he did it all in 36 strides. That does not sound too many but they were really quick slides. Just like a bolt of lightening. It just makes me wonder how far the record can go and with the impression that Bolt has some more sprint power in reserve, we need not start taking 4 decimal places yet. nNot until bBolt is done with running.

Thursday, November 20, 2008

Basing Vaccine Administration on Samples

One of the greatest ideas in biological and medical sciences is the development of vaccines. I stand in awe of the effectiveness of vaccines especially since the idea behind their development is for me rather simple though counter-intuitive.

In this story in Slate Magazine, Sydnet Spiesel makes the case for the administration of flu vaccines on a regular basis because new strains emerge predictably. As a result, the vaccines developed for the year before are not as effective subsequently.

While the logic behind the argument for vaccination of the most vulnerable is fairly solid, I am less convinced that universal vaccination is necessarily cost-effective. As piece argues, the benefits for vaccinations come from reduced mortality, reduced hospitalization and a reduction in deaths for the most vulnerable. However, noting that vaccinations for influenza provides herd immunity in the sense that the vaccinated are less likely to pass the disease or become ill, then it is perhaps less efficient to vaccinate all.

The critical factor then is to determine the proportion of a population that would need vaccination and ensure that the greatest number are protected. This would require knowledge of the demographic profile of the population and to concentrate the vaccines to those most vulnerable on the one hand and those who are the most efficient spreaders. I argue that this would require vaccination by sampling and determining the most appropriate moments for administering the vaccines to the distinct cohorts in the overall sample.