Tag Archives: A/B test

Several ways to solve a least square (regression)

As far as I know, there are several ways to solve a linear regression with computers. Here is a summary.

  • Native close form solution: just (X'X)^{-1}(X'Y). We can always solve that inverse matrix. It works fine for small dataset but to inverse a large matrix might be very computations expensive.
  • QR decomposition: this is the default method in R when you run lm(). In short, QR decomposition is to decompose the X matrix to X =¬†QR where Q is an¬†orthogonal matrix and R is an upper triangular matrix. Therefore,  X'X \beta = X'Y and then R'Q'QR \beta = R'Q'Y then R'R \beta = R'Q'Ythen 'R \beta = Q'Y. Because R is a upper triangular matrix then we can get beta directly after computing Q'Y. (More details)
  • Regression anatomy formula: my boss mentioned it (thanks man! I never notice that) and I read the book Mostly Harmless Econometrics again today, on page 36 footnote, there is the regression anatomy formula. Basically if you have solved a regression already and just want to add an additional control variable, then you can follow this approach to make the computation easy.
    Screen Shot 2015-04-17 at 5.20.12 PM
    Especially if you have already computed a simple A/B test (i.e. only one dummy variable on the right hand side), then you can obtain such residues directly without running a real regression and then compute the estimate for your additional control variables straightforwardly. The variance estimate also follows.
  • Bootstrap: in most case bootstrap is expensive because you need to re-draw repeatably from your sample. However, if it is a very large data and it is naturally distribution over a parallel file distribution system (e.g. Hadoop), then draw from each node could be the best map-reduce strategy you may adopt in this case. As far as I know, the rHadoop package accommodates such idea for their parallel lm() function (or map-reduce algorithm).

Any other ways?

Why I think p-value is less important (for business decisions)

This article only stands for my own opinion.

When the Internet companies start to embrace experimentation (e.g. A/B Testing), they are told that this is the scientific way to make a judgement and help business decisions. Then statisticians teach software engineers the basic t-test and p-values etc. I am not sure how many developers really understand those complicated statistical theories, but they use it as way -- as a practical method to verify if their product works.

What do they look at? Well, p-value. They have no time to think about what "Average Treatment Effect" is or what the hell "Standard error and Standard deviations" are. As long as they know p-value helps, that's enough. So do many applied researchers who follow this criteria in their specialized fields.

In the past years there are always professionals claiming that p-values is not interpreted in the way that people should have done; they should look at bla bla bla... but the problem is that there is no better substitute to p-value. Yes you should not rely on a single number, but will two numbers be much better? or three? or four? There must be someone making a binary Yes/No decision.

So why do I think that p-value is not as important as it is today for business decisions as what I wrote in the title? My concern is that how accurate you have to be. Data and statistical analysis will never give you a 100% answer that what should be done; any statistical analysis has some kind of flaws (metrics of interest; data cleaning; modelling, etc.) -- nothing is perfect. Therefore, yes p-value is informative, but more like a directional thing. People make decisions anyway without data, and there is no guarantee that this scientific approach provides you the optimal path of business development. Many great experts have enough experience to put less weight on a single experiment outcome. This is a long run game, not one time.

Plus the opportunity cost -- experiment is not free lunch. You lose the chance to improve user's experience in the control group; you invest in tracking all the data and conduct the analysis; it takes time to design and run a proper test, etc. If it is a fast-developing environment, experiments could barely tell anything about the future. Their predictive power is very limited. If you are in a heavy competition, go fast is more important than waiting. Therefore instead of making adaptive decisions based on each single experiment, just go and try everything. The first decision you should make is how much you want to invest in testing and waiting; if the cost of failure product is not high, then just go for it.

I love these cartoons from the book "How Google Works". You have to be innovative in the Internet industry because you are creating a new world that no body could imagine before.

Screen Shot 2015-02-06 at 12.48.06 AM Screen Shot 2015-02-06 at 12.48.13 AM Screen Shot 2015-02-06 at 12.48.45 AM Screen Shot 2015-02-06 at 12.49.05 AM