A primer of curve fitting. « The HabitableZone

A primer of curve fitting. May 30, 2013 7:19 am ER

When considering a spread pf points on a graph (for example, stock market results), it is sometimes useful to fit a curve to those points to indicate long term trends or patterns. This kind of data is often very “noisy”, that is, it is affected by many random factors, or subtle systematic influences, which tend to scatter the points and obscure what we really want to know: is it going up or down and how fast, or what will it do in the future and how soon? Does past behavior offer any clues to the future?

So we often “draw a line” through the points, eyeballing it so that it conveys the general trend of the data, to “smooth it out”. The human brain is actually quite good at this kind of analysis, we are programmed by evolution to detect subtle and anomalous patterns in nature, and sometimes our ability to pick them out is quite uncanny. But we also have mathematical procedures to generate these curves in order to eliminate subjectivity and bias as much as possible. Remember, I said “as much as possible”. We also have an uncanny ability to fool ourselves, to see patterns where none exist, and sometimes we have reasons to lie.

The simplest curve fit is the linear regression: a straight line drawn through the data points. It involves the least assumptions and is the easiest to generate. It often fails to properly model some of the really complex behavior we see in natural systems, but it is a good first approximation that gives us a rough idea of what is going on. There is a mathematical procedure, an algorithm, that when fed all the x’s and y’s of the data points comes up with an equation of a straight line (y=mx+b) that best fits those points.

The term “best fit” is also rigorously quantifiable. Each of the original data points will not lie on the regression line, in fact its unlikely any of them will. The distance of each point from the line is called its “residual error”, and there exists a mathematical procedure to minimize that as much as possible. It’s called minimizing the Root Mean Square error, or the RMS. You square all the residuals (to make them all positive numbers), add them all up, divide by the number of points, and then take the square root. The resulting RMS error needs to be as small as possible. Soooooo, how do you make it smaller?

You can find the points with the largest residuals and simply drop them from your analysis and recompute the regression and come up with a new (hopefully, smaller) RMS. This procedure is justifiable, in any collection of x,y points there will always be some that are outliers, that have been inordinately affected by random or systematic error and can be safely left out of the calculation. As a rule, these points will be those with the largest residuals. But it also introduces a potential bias into the analysis; by removing the right points you can come up with any regression you like. Another factor to be considered is the limitations of the algorithm itself. If you fire a shotgun at a piece of graph paper the regression will dutifully draw a line through the pellets and come up with the smallest possible RMS, although it will be huge. The resulting analysis may look convincing, but it is completely bogus. And of course, in all statistical processes, the more data you have to start with, the more reliable are the statistics.

This regression procedure can also be extended from linear data points (like the stock market example) to higher dimensions. Regressions are often used to control ground surveys, aerial photography and astronomical imagery (to pick examples I am personally familiar with). In both cases, you have a large number of points on the image or map, but a few have been measured or surveyed with ultraprecise methods and can be used to generate a model to supply a positional correction to other points which can be directly calculated as a function of their location on the image. The regression is used to model distortions in optical systems or survey errors, and it gives you error bars which can be used to predict the precision of other measurements.

In some cases, you can just look at the data and by visual inspection surmise a straight line cannot be fitted to it. The property you are graphing may not just be increasing or decreasing, the rate at which it is doing so may also be increasing or decreasing. It can also be oscillating, asymptotic, chaotic or exhibiting all sorts of strange but nonetheless very real behavior. You need not be restricted to a linear regression, you can go to a polynomial fit, a power law, exponential or logarithmic curves, even trigonometric functions. You may be able to generate much smaller RMS values this way, but its hard not to suspect that there is a very arbirary element to this. If you get to pick and choose which mathematical function best fits the data you may very well be closing in on some underlying physical phenomenon which is driving the trend, but you might also be fudging the analysis to fit what you already believe.

There is a subjective element to science. Its why we have peer review and consensus.

Moving averages by podrock 2013-05-30 12:55:13
- Who needs standard deviations? by ER 2013-05-30 14:26:37

Recent posts

Search

The Control Panel