Sunday, December 16, 2012

Chomsky vs Norvig

I respect Peter Norvig, and there is no denial that he has made many contributions to science, but in this argument I tend to side with Chomsky, and here's why:

Articles sources:

Chomsky: http://www.theatlantic.com/technology/archive/2012/11/noam-chomsky-on-where-artificial-intelligence-went-wrong/261637/?single_page=true


Short Q-A-Comment exert:

Norvig: I take Chomsky's points to be the following:
Chomsky: Statistical language models have had engineering success, but that is irrelevant to science.
Norvig: I agree that engineering success is not the goal or the measure of science. But I observe that science and engineering develop together, and that engineering success shows that something is working right, and so is evidence (but not proof) of a scientifically successful model.
Bobev: The engineering success in the current case can only be evidence of "something working right" with the statistical model - and it's long proven that statistics is scientifically successful model.

Chomsky: Accurately modeling linguistic facts is just butterfly collecting (I'd use "cataloging butterfly in attempt to determine how they fly"); what matters in science (and specifically linguistics) is the underlying principles.
Norvig: Science is a combination of gathering facts and making theories; neither can progress on its own. I think Chomsky is wrong to push the needle so far towards theory over facts; in the history of science, the laborious accumulation of facts is the dominant mode, not a novelty. The science of understanding language is no different than other sciences in this respect.
Bobev: I agree that science is about gathering facts, but in the current case the facts being gathered do not find application in science, but in engineering. What's the scientific value can be derived by the fact that the probability of "am" following "I" is say 50% ? Any other facts, are relative to the automation and resolving difficulties of obtaining and sorting and storing these probabilities. In the same time, there is very little done for obtaining, sorting and analyzing languages.

Chomsky: Statistical models are incomprehensible; they provide no insight.
Norvig: I agree that it can be difficult to make sense of a model containing billions of parameters. Certainly a human can't understand such a model by inspecting the values of each parameter individually. But one can gain insight by examining the properties of the model—where it succeeds and fails, how well it learns as a function of data, etc.
Bobev: As Chomsky says, it's not that statistical model cannot provide any insights, but it cannot provide insights to the question we are interested in: How does the brain use language on physiological level.

Chomsky: Statistical models may provide an accurate simulation of some phenomena, but the simulation is done completely the wrong way; people don't decide what the third word of a sentence should be by consulting a probability table keyed on the previous two words, rather they map from an internal semantic form to a syntactic tree-structure, which is then linearized into words. This is done without any probability or statistics.
Norvig: I agree that a Markov model of word probabilities cannot model all of language. It is equally true that a concise tree-structure model without probabilities cannot model all of language. What is needed is a probabilistic model that covers words, trees, semantics, context, discourse, etc. Chomsky dismisses all probabilistic models because of shortcomings of particular 50-year old models. I understand how Chomsky arrives at the conclusion that probabilistic models are unnecessary, from his study of the generation of language. But the vast majority of people who study interpretation tasks, such as speech recognition, quickly see that interpretation is an inherently probabilistic problem: given a stream of noisy input to my ears, what did the speaker most likely mean? Einstein said to make everything as simple as possible, but no simpler. Many phenomena in science are stochastic, and the simplest model of them is a probabilistic model; I believe language is such a phenomenon and therefore that probabilistic models are our best tool for representing facts about language, for algorithmically processing language, and for understanding how humans process language.
Bobev: I totally agree that the two approaches should be combined at some point. What I'm unhappy about is that the current focus of the entire field is on probabilistic model. Every new paper is based solely on statistics.


Chomsky: Statistical models have been proven incapable of learning language; therefore language must be innate, so why are these statistical modelers wasting their time on the wrong enterprise?
Norvig: In 1967, Gold's Theorem showed some theoretical limitations of logical deduction on formal mathematical languages. But this result has nothing to do with the task faced by learners of natural language. In any event, by 1969 we knew that probabilistic inference (over probabilistic context-free grammars) is not subject to those limitations (Horning showed that learning of PCFGs is possible). I agree with Chomsky that it is undeniable that humans have some innate capability to learn natural language, but we don't know enough about that capability to rule out probabilistic language representations, nor statistical learning. I think it is much more likely that human language learning involves something like probabilistic and statistical inference, but we just don't know yet.
Bobev:  I agree that we cannot rule out involvement of probabilistic element in some aspects of language use, but I think it's pretty obvious that the language cannot be based only on probabilistic representation. I cannot cite who proved what when, but I know that if I can "invent" new words, and other people still understand me, I have bridged any statistical representation of the language, but I'm still adhering to the language model used by the mind. What you work with is a statinguage - how can statistical model handle that?

In-line comments on some points made by Norvig

Norvig:  If you have a vocabulary of 100,000 words and a second-order Markov model in which the probability of a word depends on the previous two words, then you need a quadrillion (10^15) probability values to specify the model. The only feasible way to learn these 10^15 values is to gather statistics from data.
Bobev: @4bytes per Float value, that makes 3.5 petabytes, and the most generous estimations of human brain capacity for non-chemical storage is 2.5 petabytes. But even if we allow for chemical storage, or if we assume that brain stores data at 1byte per Float value, the language we use is 300 000 words, which will require much more than the 10^15. And if we go to third-order Markov model? And what of learning 2nd language? And what of all other knowledge or memories? For me it's obvious, that we do not store such info in our brains. So if our brains use language more efficiently and correctly at the same time, there must be a different representation of language in the mind.

Norvig: Clearly, it is inaccurate to say that statistical models (and probabilistic models) have achieved limited success; rather they have achieved a dominant (although not exclusive) position.
Bobev: statistical models (and probabilistic models) have achieved no success in explaining how language is represented or used in our mind or on a physiological level. Statistical models are dominant at specific language based contests/tasks, because they are cheats - they strive to replicate only the outcome, but not the process. And all the "progress" is due to computation power - after all the Bayesian networks have not changed since 1980s. I consider useful the hybrid models, if they provide some understanding of how to incorporate statistical and rules based processing.

Norvig: Another measure of success is the degree to which an idea captures a community of researchers. As Steve Abney wrote in 1996, "In the space of the last ten years, statistical methods have gone from being virtually unknown in computational linguistics to being a fundamental given. ... anyone who cannot at least use the terminology persuasively risks being mistaken for kitchen help at the ACL [Association for Computational Linguistics] banquet."
Bobev: That is exactly why I'm so upset. Many researchers have been pushed towards statistical methods by conformity. The rapid success of the engineering aspect of the field created a lot of hype, which has shifted a lot of interest, support and funding from the goals understanding how it works, instead of simulating results.

Norvig: A dictionary definition of science is "the systematic study of the structure and behavior of the physical and natural world through observation and experiment," which stresses accurate modeling over insight,...
Bobev: Wait, what? 1) I don't find a mention of "modeling" in that definition, let alone a stressed one. 2) If you want to read "study of the structure and behavior" as "modeling", it should be more in the notion of "discover underlining model" rather than "create a model that simulates". It's important to clarify what you will study exactly, because it's one thing to study the human use of the language, and another to study machine use of language.

Norvig: It certainly seems that this article is much more focused on "accurately modeling the world" than on "providing insight."
Bobev: Again, where do you see modeling? Science aims at understanding the what governs the observed phenomenon - if the paper addresses the efficiency of some electrodes, it's doing so in order to explain why. Scientists do experiments to confirm specific idea - insight if you will. The subject involved with striving to create a system that yields specific results is engineering - the outcomes of that process are called "prototypes", not "experiments". 

Norvig: and for the 2010 Nobel Prizes in science:
Physics: for groundbreaking experiments regarding the two-dimensional material graphene
Chemistry: for palladium-catalyzed cross couplings in organic synthesis
Physiology or Medicine: for the development of in vitro fertilization
My conclusion is that 100% of these articles and awards are more about "accurately modeling the world" than they are about "providing insight," although they all have some theoretical insight component as well. I recognize that judging one way or the other is a difficult ill-defined task, and that you shouldn't accept my judgements, because I have an inherent bias.
Bobev: Well, you got it wrong. Unless they have the theoretical insight "component" (probably more of a core, really), no one would consider them science. There is some rule of thumb in science, that says that experiments should be chosen so that they can clearly prove or disprove a hypothesis that you have already formed. There are of course accidental discoveries, and sometimes throwing a random experiment with no clear goal or idea, might set you on the right track, but even with this approach you have to fit the results in theory insight.

Norvig:  I repeated the experiment, using a much cruder model with Laplacian smoothing and no categories, trained over the Google Book corpus from 1800 to 1954, and found that (a) is about 10,000 times more probable. If we had a probabilistic model over trees as well as word sequences, we could perhaps do an even better job of computing degree of grammaticality.
Furthermore, the statistical models are capable of delivering the judgment that both sentences are extremely improbable, when compared to, say, "Effective green products sell well." Chomsky's theory, being categorical, cannot make this distinction; all it can distinguish is grammatical/ungrammatical.
Bobev: OK, let's place the probability results for these three sentences in scale - plot them on interval (0,1). You haven't provided the exact values, but I have some idea of what numbers are yielded from these statistics. The first two sentences are so close to 0, that you feel it wont be in favor of your argument to show all the zeroes between the decimal point and the meaningful numbers. In that respect the first two sentences might differ by 10,000 times, and still be really close at each other, being in the zone of 10^-15 as a value. Now these two sentences might have got the right probability order just as a fluke - I seriously doubt that the same results will be yielded with the same sentences, but with the color replaced. Go through all the colors and let me know if the results aren't wrong, at least at 50% of the time. 
And why would we care that there is some other sentences that is extremely more probable? What has it to do with determining whether a given sentence is grammatical? Because that's what we need. If we as humans can determine if a sentence is grammatical or not with certainty, the desired model should be able to do the same. Only then, we can argue that such model may be physiologically implemented in the human brain. 

Norvig: "All grammars leak."
Bobev: Agreed. We need to incorporate statistics in modeling human use of language, but it's more than just probabilities.

Norvig: Since people have to continually understand the uncertain. ambiguous, noisy speech of others, it seems they must be using something like probabilistic reasoning. Chomsky for some reason wants to avoid this, and therefore he must declare the actual facts of language use out of bounds and declare that true linguistics only exists in the mathematical realm.
Bobev: I don't think Chomsky wants to avoid the use of probabilistic methodology, but rather that he's "concerned with discovering a mental reality underlying actual behavior" in humans. He believes that such is the goal of linguistics, and you believe it's the creation of "statistical (or probabilistic) models, which while accurately modeling reality, do not make claims to correspond to the generative process used by nature". 
Conclusion: So the whole thing is comparing apples and oranges -- you simply strive for different things.

I personally do not care what's the goal of linguistics as a science, but I tend to side with Chomsky in the view that efforts for making models which "make no claim to correspond to the generative process used by nature" are of no use (or are detrimental) for discovering what underlines actual behavior in language use.

IMHO the only way to advance both aspects (accurate modeling and discovering the process used by nature) is to go hybrid.

No comments:

Post a Comment