Saturday 24 October 2009

LOG-R using numpy (and why it's worth a code jam)

What

"In statistics, logistic regression (sometimes called the logistic model or logit model) is used for prediction of the probability of occurrence of an event by fitting data to a logistic curve". This is a pretty useless definition (from wikipedia) of a pretty useful concept. Logistic models are used extensively to model growth. Of what, you may well ask. Answer: anything. Business, market share, fish in a pond, if you have a growth process, and ESPECIALLY if the rates of growth are not constant through time (or space for that matter) logistic regression can potentially be applied to model the process.

How

Jeffrey Whitaker has written a module for logistic regression (back in 2006) that uses numpy. You might think the implementation is quite complex but at its heart is just the ubiquitous Newton-Raphson method used to find square roots.

Why

Why produce logistic regressions using this convoluted programming method. Are there not softwares that make this easier? The answer is yes and no. One can analyse data in OpenOffice and fit regression lines. However, some data sets that have "logistic shapes" such as learning curves cannot be modelled correctly in OpenOffice. The regressions supported include linear, logarithmic (function of lnx), exponential (function of e^x) and power regression (y=b*x^a) which will not model logistic data accurately.