# Variance: regression, clustering, residual and variance

This is the translation of my recent post in Chinese. I was trying to talk in the way that a statistician would use after having stayed along with so many statistics people in the past years.

-------------------------------------------Start----------------------------

Variance is an interesting word. When we use it in statistics, it is defined as the "deviation from the center", which corresponds to the formula  $\sum (x- \bar{x})^2 / (n-1)$ , or in the matrix form $Var(X) = E(X^2)- E(X)^2=X'X/N-(X'1/N)^2$ (1 is a column vector with N*1 ones). From its definition it is the second (order) central moment, i.e. sum of the squared distance to the central. It measures how much the distribution deviates from its center -- the larger the sparser; the smaller the denser. This is how it works in the 1-dimension world. Many of you should be familiar with these.

Variance has a close relative called standard deviation, which is essentially the square root of variance, denoted by $\sigma$. There is also something called the six-sigma theory-- which comes from the 6-sigma coverage of a normal distribution.

Okay, enough on the single dimension case. Let's look at two dimensions then. Usually we can visualize the two dimension world with a scatter plot. Here is a famous one -- old faithful.

Old faithful is a "cone geyser located in Wyoming, in Yellowstone National Park in the United States (wiki)...It is one of the most predictable geographical features on Earth, erupting almost every 91 minutes." We can see there are about two hundreds points in this plot. It is a very interesting graph that can tell you much about Variance.

Here is the intuition. Try to use natural language (rather than statistical or mathematical tones) to describe this chart, for example when you take your 6 year old kid to the Yellowstone and he is waiting for next eruption. What would you tell him if you have this data set? Perhaps "I bet the longer you wait, the longer next eruption lasts. Let's  count the time!". Then the kid has a glance on your chart and say "No. It tells us that if we wait for more than one hour (70 minutes) then we will see a longer eruption in the next (4-5 minutes)". Which way is more accurate?

Okay... stop playing with kids. We now consider the scientific way. Frankly, which model will give us a smaller variance after processing?

Well, always Regression first. Such a strong positive relationship, right?  ( no causality.... just correlation)

Now we obtain a significantly positive line though R-square from the linear model is only 81% (could it be better fitted?). Let's look at the residuals.

It looks like that the residuals are sparsely distributed...(the ideal residual is white noise which carries no information). In this residual chart we can roughly identify two clusters -- so why don't we try clustering?

Before running any program, let's have a quick review the foundations of the K-means algorithm. In a 2-D world, we define the center as $(\bar{x}, \bar{y})$, then the 2-D variance is the sum of squares of each pint going to the center.

The blue point is the center. No need to worry about the outlier's impact on the mean too much...it looks good for now. Wait... doesn't it feel like the starry sky at night? Just a quick trick and I promise I will go back to the key point.

For a linear regression model, we look at the sum of squared residuals - the smaller the better fit is. For clustering methods, we can still look at such measurement: sum of squared distance to the center within each cluster. K-means is calculated by numerical iterations and its goal is to minimize such second central moment (refer to its loss function). We can try to cluster these stars to two galaxies here.

After clustering, we can calculate the residuals similarly - distance to the central (represents each cluster's position). Then the residual point.

Red ones are from K-means which the blue ones come from the previous regression. Looks similar right?... so back to the conversation with the kid -- both of you are right with about 80% accuracy.

Shall we do the regression again for each cluster?

Not many improvements. After clustering + regression the R-square increases to 84% (+3 points). This is because within each cluster it is hard to find any linear pattern of the residuals, and the regression line's slope drops from 10 to 6 and 4  respectively, while each sub-regression only delivers an R-square less than 10%... so not much information after clustering.  Anyway, it is better than a simple regression for sure. (the reason why we use k-means rather than some simple rules like x>3.5 is that k-means gives the optimized clustering results based on its loss function).

Here is another question: why do not we cluster to 3 or 5? It's more about overfitting... only 200 points here. If the sample size is big then we can try more clusters.

Fair enough. Of course statisticians won't be satisfied with these findings. The residual chart indicates an important information that the distribution of the residuals is not a standard normal distribution (not white noise). They call it heteroscedasticity. There are many forms of heteroscedasticity. The simplest one is residual increases when x increases. Other cases are in the following figure.

The existence of heteroscedasticity makes our model (which is based on the training data set) less efficient. I'd like to say that statistical modelling is the process that we fight with residuals' distribution -- if we can diagnose any pattern then there is a way to improve the model. The econometricians prefer to name the residuals "rubbish bin" -- however it is also a gold mine in some sense. Data is a limited resource... wasting is luxurious.

Some additional notes...

# A few books want to read

Fortunately or accidentally, I only have two classes this term. Meanwhile, they separate them into four days, so I only have two-hours class every day from Monday to Thursday. Compared to my previous schedule, it is too relaxing.

An advantage now is that I have enough time to read and think. Today I found Becker's book by chance, when I was browsing the literature on "social economics", or socio-economics. It is quite exciting, and I have realized how deep the water might be- before I was only using my naive intuition that there is something I can contribute soon.

The book I'm talking about now is

Gary S. Becker and Kevin M. Murphy, 2001, Social Economics: Market Behavior in a Social Environment.

Before I was paying more attention solely to network economics, and it turned out to be that they were quite similar to each other in most sense; however, socio-economics is for sure more broad.

Moreover, I took a few hours finishing reading another book,

Salsburg, D. (2002) The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century

From the name, you can see that this book is about basic statistics. To save time, I read its Chinese translation. Not very long, but very exciting - maybe I have been, and will always be attracted by mathematics and statistics. Especially the later one, perhaps due to the fact that I have so many friends in this area, is one field beyond economics that has influenced me the most, and more on the level of conception and methodology than techniques or actual methods.

@Roma. Things remain to be clear

While reading this book, it reminds me another book I read before, which is about the famous economist Keynes,

Robert Skidelsky, 2005, John Maynard Keynes: 1883-1946: Economist, Philosopher, Statesman

What impressed me most at that time was not Keynes' contribution to economics - although nobody can neglect that, but his ideas on probability.  Until now, I still have the wish that one day I want to read Keynes' original book on probability somewhere.

I want to read Becker's book only for the reason that I need an idea for my history paper. One question I have been seeking for the answer for a while: why do we need to care about the network structure? Before, I was only arguing that the "summation is a naive way to draw the group's characteristics"; now it seems that I need to really re-think about this argument. In addition to sum or mean, people have developed distribution to help understand the world; furthermore, from central limit theorem, normal distribution can be utilized in most scenarios. Therefore, under what particular case will summation cause a severe problem?

Another thing I'm thinking about now is after reading the "Lady tasting tea", a term still remains to be explained more clearly: frequency school and Bayesian school's debate on the definition of probability. On one side we are lucky today that following Baye's idea will not be regarded as heterodox any more; on the other side, although his idea itself is very simple, how to make a perfect use of it is still a very tricky and should be dealed with carefully.

I'll stop here for now, and see whether I can gain some new senses soon. This year is too short- I need a longer time to make all things clear.

# Probability, Information and Economics

These days I was busy reading the biography of John Maynard Keynes, the most famous economist in the past century. One point mentioned in that book attracted my attention -- that is about his ideas on probability.

Every one who has studied macroeconomics must know a word "rational expectations". That is a great issue if talked. Simply, as the wikipedia says,

To assume rational expectations is to assume that agents' expectations are wrong at every one instance, but correct on average over long time periods. In other words, although the future is not fully predictable, agents' expectations are assumed not to be systematically biased and use all relevant information in forming expectations of economic variables.

Here  I do not want to say much about it. I'd like to mention another area, Information Economics. Typically, information economics deals with the situation that there is asymmetric information between principal and agent. Then as we all know, there is moral hazard and adverse selection. With the application of game theory, the common issues can be solved. However, seldom do I read paper discussing about the role of information in economic activities in other approaches. Therefore, followed Keynes' idea, I wonder what will happen if the spread of information is introduced into the economic activities.

Simply, probability reflects the situation that we do not know enough about how the real world functions. Therefore, we use probability to describe the combination every possible result. There is an interesting question: the normal distribution. I'll talk about it later on.

As the aim of science, we are pursuing the ability to predict. I know many people will have different ideas, but it does not matter much. At least, we want to know the mechanisms in every particular field. That is, we are pursuing "certainty" instead of "uncertainty". From uncertainty to probability, then to certainty, in this way we know much better about the real world. It is an old philosophic issue: is there a fixed point?

Then what will happen if the knowledge spreads? I have not got a clear understanding yet. The disappearance of probability is too hard to imagine. We can use "normal distribution" to describe some phenomenons, such as people's height, weight. The result is a description of a group, but not that accurate for a particular person. To predict a person's height, for example, we should get enough information, if applicable, his gene, his nutrition, and what he did in the past... Maybe it is too hard to define what is "enough". Anyway, the probability can be replaced under a special circumstance.

In the first step, I want to talk about how the spread of information influences the social activities. I think we have underestimated the importance of information in economics, or we have no applicable models to explain. I do not know whether more modern mathematical tools are needed in the explanations. As least, I need to read more about the history of probability, including the famous debate between frequency school and Bayesian. And maybe more knowledge about psychology and communication are essential. I want to talk about it later after learning measure theory.