""" quantile_regression_demo.py OVERVIEW Most of us are familiar with the charts that pediatricians use that show percentiles of weight and height as a function of age; generating such a chart from a small sample of data requires quantile regression or similar methods. (When working with a large enough sample of data, one can bin the data, i.e., divide the x-axis into intervals and calculate percentiles independently for each interval. But, this approach uses the data inefficiently and is unworkable when sample sizes are small). Quantiles and percentiles are the same except for a factor of 100, e.g., the 30th percentile is the 0.3 quantile. This Python script demonstrates that one can perform quantile regression using only Python, NumPy, and SciPy. The only other dependency is on matplotlib, which is used to plot the data and the quantile estimates. In detail, the script does the following: (1) Model parameters are assigned. (Currently, these are hardwired into the code). (2) The program generates an artificial bivariate sample of data (x, y) as follows: - x is generated by drawing from a distribution that is uniform on [x_min, x_max], where x_min and x_max are currently 0 and 1, respectively. - y is then generated according to a normal distribution having mean -0.5 + x and standard deviation 1.0 + 0.5 * x. (All of this can be changed, e.g., one could choose to make the mean of y quadratic in x). (3) The code defines an objective function based on the tilted absolute value function (see references for motivation). (4) The SciPy optimization package is then used to optimize (minimize) the objective function. (5) Using the matplotlib module, the code plots a scatter diagram of the data with an overlay of percentile lines. NOTES My initial approach to this problem was to use the rpy2 interface to the R statistical environment to invoke the R quantreg package, but the rpy2 module is severely buggy, and I was unable to make this work. REFERENCES 1. Journal article, 'Quantile Regression', http://www.econ.uiuc.edu/~roger/research/rq/QRJEP.pdf 2. Powerpoint slides, 'Robust Statistics and Quantile Regression', http://www.savbb.sk/~grendar/pdf/robust_quantreg.pdf 3. For an introduction to the Rpy2 interface between Python and R: http://rpy.sourceforge.net/rpy2/doc-2.1/html/introduction.html 4. Defining R formulas and the associated data is tricky. For discussion and examples, see the following: http://rpy.sourceforge.net/rpy2/doc-2.3/html/robjects_formulae.html AUTHOR Dr. Phillip M. Feldman """ # Section 1: Import from modules. from matplotlib import pyplot import numpy, pdb from numpy import array, pi from numpy.polynomial.polynomial import polyval from scipy.optimize import fmin # Section 2: Define tilted absolute value function. def tilted_abs(rho, x): """ OVERVIEW The tilted absolute value function is used in quantile regression. INPUTS rho: This parameter is a probability, and thus takes values between 0 and 1. x: This parameter represents a value of the independent variable, and in general takes any real value (float) or NumPy array of floats. """ return x * (rho - (x < 0)) # Section 3: Get user input parameters. Currently, these are hardwired into the # code. # `N` is the number of random data points to be generated: N= 3000 if N <= 20: symbol_size= 100 elif N <= 200: symbol_size= 50 else: symbol_size= 25 x_min= 0.0; x_max= 1.0 # Define mean and standard deviation of y as functions of x using lambda # notation: y_mean= lambda x: -0.5 + x y_std = lambda x: 1.0 + 0.5 * x # Define number of polynomial coefficients. (The degree of the model is one # more than the number of coefficients). Specify 2 for a linear model, 3 for a # quadratic model, and so on. N_coefficients= 3 fractions= [0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9] # Define seed (list of integers) for random number generator: seed= [1, 2, 4] # Section 4: Generate random datasets x and y. # Create a random number stream: RNG= numpy.random.RandomState(seed) x= RNG.uniform(low=x_min, high=x_max, size=N) y= y_mean(x) + y_std(x)*RNG.normal(loc=0.0, scale=1.0, size=N) y_min= y.min(); y_max= y.max() # Find indices that would sort x values into ascending order: indices= numpy.argsort(x) # Apply this sort order to both x and y: x= x[indices] y= y[indices] # Section 5 (deactivated): Use the rpy2 interface to the R statistical # environment to invoke the quantreg package to estimate quantiles of y as a # function of x. # # The R command-prompt syntax that one would use is as follows: # # library(quantreg) # # formula= ??? # # fit1 <- rq(formula, tau=fractions) # R.initr() # R.library(quantreg) # formula= R_objects.Formula('y ~ x') # env= formula.environment # env['x']= x # env['y']= y # fit1= R_objects.rq(formula, tau=fractions) # Section 6: Estimate quantiles via direct optimization. def model(x, beta): """ This example defines the model as a polynomial, where the coefficients of the polynomial are passed via `beta`. """ return polyval(x, beta) def objective(beta, rho): """ The objective function to be minimized is the sum of the tilted absolute values of the differences between the observations and the model. """ return tilted_abs(rho, y - model(x, beta)).sum() # Define starting point for optimization: beta_0= numpy.zeros(N_coefficients) if N_coefficients >= 2: beta_0[1]= 1.0 # `beta_hat[i]` will store the parameter estimates for the quantile # corresponding to `fractions[i]`: beta_hat= [] for i, fraction in enumerate(fractions): beta_hat.append( fmin(objective, x0=beta_0, args=(fraction,), xtol=1e-8, disp=True, maxiter=3000) ) # Section 7: Plot the data with overlays of estimated quantiles. # Create figure window: fig= pyplot.figure(figsize=[9,7], dpi=120, facecolor=[1,1,1]) # Plot (x,y) pairs on a scatter diagram. The argument `s` specifies the symbol # area in units of points squared. p1= pyplot.scatter(x, y, s=symbol_size) pyplot.xlim([x_min, x_max]) pyplot.ylim([y_min, y_max]) pyplot.title("Scatter Diagram with Quantiles Corresponding to\n" "the Fractions %s\n" % str(fractions)[1:-1], size=18) pyplot.xlabel("x", size=18) pyplot.ylabel("y", size=18) # Enable 'hold' so that lines will be plotted as overlays on scatter diagram # rather than in separate figure windows: pyplot.hold(True) # Draw a line for each quantile: for i, fraction in enumerate(fractions): pyplot.plot(x, polyval(x, beta_hat[i]), linewidth=2) pyplot.grid(True) # Nothing is displayed until we invoke `show`: pyplot.show()