Big Data or Pig Data?

pig-bum-textI don’t know if my reading of this Orwellian* piece is in sync with what Rameez intended, but he thought it was fine for me to post it here. See what you think: 

“Big Data or Pig Data” (A fable on huge amounts of data and why we don’t need models) by Remeez Rahman, computer scientist: posted at Realm of the SCENSCI

 There was a pig who wanted to be a scientist. He was not interested in models. When asked how he planned on making sense of the world, the pig would say in a deep mysterious voice, “I don’t do models: the world is my model” and then with a twinkle in his eyes, look at his interlocutor smugly.

By his phrase, “I don’t do models, the world is my model”, he meant that the world’s data was enough for him, the pig scientist. The more the data, the more accurately the pig declared, he would be able to predict what might happen in the world.

Around that time, some dogs opened a pub called, “Doogle” which was visited by all animals in the jungle. The wine was delicious and the traffic at the pub was unprecedented. The dogs became rich and famous; they also obtained a lot of data from the visiting animals. They bought even more pubs and collected even more data about their customers.

Now, they wanted to analyze this data to attract even more customers towards Doogle. The pig saw this as a big opportunity and gathered other like-minded pigs. The drove of pigs helped Doogle in applying pigstatistical methods (ham-correlation formulation etc), to predict various things including: kinds of animals attracted to

Pigs looking at pig data, and applying pigstatistics

the kinds of beverages; drinking patterns of different animals; the kinds of tables liked by classes of animals; arrival times; number of glasses Doogle would need in the near future, etc, etc, etc. To an astonishing degree, the pigs made quite accurate predictions using their pigstatistics.

The services of our pigs were acquired by other entities including FaceSlap, Barker, and Snorter, among others. Our heroic pigs helped their clients in outshining the competition. In fact the pigs method of collecting huge amounts of data and then applying pigstatistics on it came to be known as “Pig Data” in their honor.

In the meantime, somewhere in the jungle, the group of owl scientists who had through history been making models and theories and performing experiments based on them, were now being told that it was all meaningless; that their approach was worthless. The owls didn’t pay any attention, even though everyone else was euphoric. However, if the truth be told, some owls did lose heart and became so demoralized that they gradually transformed into pigs! And immersed themselves deep in the world of pig-data.

From time to time, Doogle, FaceSlap and others, would make some modifications, such as changing the color of the wine-glass and seeing how quickly people reached for the glass based on the color. Upon analysis of the customers reactions, the pigs could then analyze which color resulted in the fastest response-time. So this was the era of pig-data. The pigs had won the battle. Pig data was everywhere.

But the fact is that our hero-pig, whom we met at the beginning of our fable, was still not happy. He felt that things were only getting started. He wanted to replace the owls completely. What’s more, he wanted to predict EVERYTHING. He wanted psychohistory, as the ‘good doctor’ of old had dreamt. Yes sir, predicting everything was his goal!

He decided to start his quest by studying falling bodies. As was his norm, he collected data about all instances of all objects falling down all over the place. He now had huge amounts of data, and he applied pigstatistics on it. He discovered that more things fell in the morning and during day-time, when animals were awake, and fewer things fell during the night when animals were sleeping!

He shared his findings in front of the whole jungle, looking directly at the owls, who were also present. The chief owl, called Owlileonewtein, countered that while such information could be useful, it did not explain much. Why did bodies fall? At what rate did they fall? What were the relevant factors, etc?

On hearing this, the pig positively beamed with joy because he had come prepared.

Our Hero ecstatic on discovering the law of falling bodies

He announced proudly that he had found a correlation between the weight of the body and the speed of falling. His stats told him that while heavy things fell at a great speed, light things such as animal hair, bird feathers, etc fell much more slowly. “So therefore”, he thundered, “I have discovered the law of falling bodies. Heavy goes fast; light goes slow.” All the animals clapped in joy. The law of falling bodies had been discovered!

Upon hearing all this, Owlileonewtein, the chief owl, said forcefully, “But this is not correct. If we ignore friction and air resistance, I can tell you that all bodies, regardless of their heaviness, fall at the same rate. Indeed consider a frictionless plane…”

But as soon as he said this, the pig snorted, “Frictionless plane? My dear animals, has anyone ever heard of such an oxymoron?” All animals laughed.

Owlilelonewtein protested: “No, based on my model, we can do suitable experiments to test it…”.

On hearing this, the pig suddenly got very serious and menacing. He lifted his paw and pointed it at Owlileonewtein, “You sir, are a relic of the past. Your way of doing things is over. Haven’t you heard what my fellow pig scientist, Peter Norpig, head of pig intelligence at Doogle, has said, ‘All models are wrong, and we can learn models from data.’ So enough of your models and enough of your model-based experiments. We need neither! All we need is pig-data!” And with this, the pig in his furious excitement stood up on his hind-legs, and shouted, stretching the word ‘pig’ with the full force of his pig personality:
“Piiiiiiiiiiiiiiiiiiiiig!” And the animals responded: “DATA!”

“Piiiiiiiiiiiiiiiiiiiig” — “DATA”! “Heavy goes fast; light goes slow!”

Having demonstrated his power to the owls, as a last act of annihilation, he picked up a stone from the ground and tore away a strand of hair from his tail. Holding one object in each fore-leg, he dropped them at the same time. The stone reached the ground much earlier than the strand. With this, he dusted one fore-leg against the other, and then turned around to show his backside to the owls. He shouted triumphantly, one last time:

“Heavy goes fast; light goes slow!”, “Heavy goes fast; light goes slow!”

“Piiiiiiiiiiiiiiiiiiiig” — “DATA!”

* Think “Animal Farm”.
Share your idea for the “moral of the story” in the comments, as will I. You can see what Rameez wrote in his endnote.  
Categories: Statistics | 22 Comments

Post navigation

22 thoughts on “Big Data or Pig Data?

  1. That’s great. (And I think Tukey, as well as Newton, would approve.)

    I believe in model-based analysis. The real utility of statistics is in the context of model-based analyses. To tie that thought in with Pig Data, one thing that’s always bother me about (for example) neural networks is that they lack the ability to extrapolate. (Someone will hopefully correct me if I’m wrong about that.) They’re useful for interpolating – particularly when outputs are nonlinear functions of input variables – but not useful for making predictions.

    Undoubtedly the fact that I work in the physical sciences colors my view, but physical observables have physical origins and I believe you need models to establish the connections between cause and effect. Write down an equation. Make a prediction. Conduct an experiment. How does observation compare with prediction? Use the discrepancy to develop a deeper understanding of the phenomena being observed. Revise your equation and update your prediction. Repeat as necessary.

  2. Thanks Chris G. To avoid biasing reactions, I’m withholding “the moral of the story according to me” until at least 6 people comment.

  3. fredhkw

    I try to use one sentence to conclude the story: Theory without data is useless, while data without theory is dangerous.

    btw @Chris, I think I haven’t heard the poor extrapolation of ANN, can you please suggest some reference that I can read? Thanks.

    • Sorry, no reference – just my experience asking people who do neural nets about using their algorithms to extrapolate/forecast. My takeaway is that they’re excellent for classification but not useful for forecasting. Please correct me if that’s not a fair characterization.

  4. Christian Hennig

    I think I have a broader definition of “model” as the guy who wrote this. How on earth do you make predictions from data without assuming anything formal? Even if it’s not a probability model?
    What does pig statistics have to say about predicting the next observation after 1,2,4,4,5,4,6,6,7?

    This is an issue that I have with many methods that are sold to us as “model-free”. Usually it means that their assumptions are just better hidden.

    • Christian: Yes I think Rahman would agree. Presumably, some empirically based rule or another, extracted from associations in the data, directs the predictions.

    • Hi Christian,
      I am the guy who wrote this :-). Yes, I agree with what you say. Even at the level that you talk about, saying that the thing is model-free is disingenuous. Even finding patterns or associations or regularity in data pre-supposes some sort of model or hypothesis about what these are. I agree!

      However, you are right when you say that I wrote the piece with a different concept of what a model is. I hope my comment to Deborah below makes my position clear. In short, when I use these words (model/theory), I mean models/theories about the domain/system under investigation, and that really goes beyond the data, I think. Thanks for reading!

      • But they extract that empirical model from the data, with a certain criterion for model fit. It’s predictive not explanatory.

        • I agree. You will get prediction – approximation to the data – and not understanding and explanation

          • Paul

            You will get predictions that work for a while then fail, sometimes catastrophically.

  5. Deborah,

    Thank you for putting up the post. First, I want to be clear about something: I recognize that Big Data is being (and has been) successfully used in engineering domains. In fact, many colleagues of mine are working on it and doing great stuff. And that’s fine and very useful and no one is denigrating that.

    The problem starts when people start saying that they can do science without models and theories, all based on huge amounts of data. It is said that theories can be induced from the data. This seems like a rather alien concept of science, for the reasons that I tried to bring out in the post. Theories are usually under-determined by the data. Famous examples include Galileo not being able to explain why objects don’t go flying off a rotating earth (and ignoring this little problem); Darwin ignoring the data (the fossil record) and coming up with the gradualist program for evolution; Mendel ignoring a lot of the data that didn’t fit his model, etc. Their models could not have been induced from the data. If we start considering all kinds of data without any model/theory guiding us in selecting/distorting/even ignoring, the data, then its hard to see how progress can be made.
    There is another reaction that I have received from people about this post. Some have felt that it is a straw-man attack. What is described doesn’t really happen anywhere, so goes the criticism. Is it really a straw-man attack? I don’t think so. I often quote Sydney Brenner’s views on the genome project and other such data oriented approaches in biology. Sydney says: “The orgy of fact extraction in which everybody is currently engaged has, like most consumer economies, accumulated a vast debt. This is a debt of theory, and some of us are soon going to have an exciting time paying it back – with interest, I hope”.

    I feel that this big-data-oriented approach is being used (sometimes quite proudly) to do “science”, and therefore critical discussion is needed.

    • I don’t think that the pig leader would have gotten to the point of even the crude model on falling bodies without seeking and making assumptions about how to generalize (employing other laws and instruments). I think the fable would end with “From time to time, Doogle, FaceSlap and others, would make some modifications, such as changing the color of the wine-glass and seeing how quickly people reached for the glass based on the color”. This is what Doogle research is about.

      Ths suffices for such things as determining my highest “lifetime value” to the retailer.
      http://online.wsj.com/article/SB10001424052702304458604577488822667325882.html

    • So what about the pig statistics on genome data. You’re right it’s scarcely a straw man argument for that domain, but what reasons would you give.i.e., what are some pigstat things being done in genomics.

      • Such approaches end up collecting huge amounts of data. The subsequent step is to find statistical relationships to analyze, for instance, susceptibility or resistance to some disease or other such things. But this happens without any understanding of the underlying biological systems. The hope is that based on huge computing resources and statisitcal approaches, we can analyze massive amounts of data to separate meaningful indicators from all the noise. Having underlying principles or explanation is not necessary or so goes the thinking. This basically means that experiments too go out the window. Reverse-engineering is the name of the game, I guess.

        • Rameez: I have been reading Efron’s large-scale inference and various other approaches using microarrays, etc. The screening work, as I see it, goes under the heading of a “purely behavioristic” goal, where controlling the noise in the network of outputs is of central interest, not appraising evidence for specific scientific hypotheses.The statistics are fascinating, but it’s not clear how to judge the success of such screening procedures. Rather, even if I know this method gave low false discovery probabilities and it outputs gene 610 (as potentially associated with disease D), what have I learned about gene 610? Can they go on a test if it’s truly involved, and in what way? I thought the next step was to do experiments on genes determined to be “interesting” from the purely statistical analyses. No?

          I have no expertise in this arena whatsoever, but I have always had a sneaking suspicion that an advance on substantive knowledge (of various genetic regulatory mechanisms, or whatever) would supersede these seemingly crude screening methods. Even if this happens, it isn’t clear these earlier methods were not needful as a first attempt. By the way, I understand that some of the techniques do take into account biological function in some way, what’s your take on this?

          • Deborah: My knowledge of the arena is pretty limited as well, and also biased in the sense that I have mostly read criticisms of this approach. The basic problem seems to me to be: some in the field think that it is possible to derive models of biological systems from observations of their behavior. Put another way (and hopefully answering your question partly), according to my understanding, some people claim that substantive explanation of, to use your words, “various genetic regulatory mechanisms” (or cellular behavior) can be obtained through an analysis of the behavioral data. All that is needed is more data and sophisticated computer programs and resources.

            I think this is problematic for a host of reasons that we discussed earlier. Having said this, it might eventually turn out that the problem is so complex that the best we can hope for is whatever big-data kind of results that we get, and that we cannot frame theories for such systems. But that is obviously an empirical question and giving up the notion of trying to formulate a theoretical framework from the get-go is kind of a self-fulfilling prophecy of failure, I think.

  6. “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”
    – John Tukey

    Perhaps not directly relevant to the discussion but a quote I think worth sharing.

  7. E. Berk

    Why did the pigs not report on associations between # of falling objects and # of pigs who are awake, and use this to predict one from the other?

    Pigstatistics alone would not have brought them even as far as flawed Newton.

  8. Jeff Walker

    Another version of this story was written as a critique of biology many years ago by some combination of Richard Lewontin, Leigh van Valen, Richard Levins, and Robert MacArthur. It’s a classic read:
    http://www.autodidactproject.org/other/sn-nabi2.html

  9. Pie Tutors

    Wow, It is really a hilarious integration of big data with pig and its jungle-mates. It is really good to see a different approach to talk about big data and the existing models. I believe that problem is that the different steps in building a model cannot be considered separately. The modeling method that is to be selected depends on the properties of the data: how many records, how many variables, proportion of missing values, type of outcome to be predicted example binary, numerical or categorical. Accordingly the data preparation activities will also interact with the modeling technique that is chosen. hence it is quite difficult to explain whether models has lost their relevance or data is more important.

I welcome constructive comments for 14-21 days. If you wish to have a comment of yours removed during that time, send me an e-mail.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.