Next: About this document ... Up: Regularization is model styling Previous: Abandoned theory for matching

PRECONCEPTION AND CROSS VALIDATION

First, we first look at data $\bold d$ . Then we think about a model $\bold m$ , and an operator $\bold L$ to link the model and the data. Sometimes, the operator is merely the first term in a series expansion about $(\bold m_0,\bold d_0)$ . Then, we fit $\bold d-\bold d_0 \approx \bold L ( \bold m-\bold m_0)$ . To fit the model, we must reduce the fitting residuals. Realizing the importance of a data residual is not always simply the size of the residual, but is a function of it, we conjure up (topic for later chapters) a weighting function (which could be a filter) operator $\bold W$ . With $\bold W$ we define our data residual:

$\begin{displaymath} \bold r_d \eq \bold W [ \bold L ( \bold m-\bold m_0) -\ ( \bold d-\bold d_0) ] \end{displaymath}$

(19)

Next, we realize that the data might not be adequate to determine the model, perhaps because our comfortable dense sampling of the model ill fits our economical sparse sampling of data. Thus we adopt a fitting goal that mathematicians call ``regularization,'' and we might call a ``model styling'' goal or more simply, a quantification of our preconception of the best model. We quantify our goals by choosing an operator $\bold A$ , often simply a roughener like a gradient (the choice again a topic in this and later chapters). It defines our model residual by $\bold A \bold m$ or $\bold A ( \bold m-\bold m_0)$ , say we choose:

$\begin{displaymath} \bold r_m \eq \bold A \bold m \end{displaymath}$

(20)

In an ideal world, our model preconception (prejudice?) would not conflict with measured data, but real life is much more interesting than that. The reason we pay for data acquisition is that conflicts between data and preconceived notions invariably arise. We need an adjustable parameter that measures our ``bullheadedness,'' how much we intend to stick to our preconceived notions in spite of contradicting data. This parameter is generally called epsilon $\epsilon$ , because we like to imagine that our bullheadedness (prejudice?) is small. (In mathematics, $\epsilon$ is often taken to be an infinitesimally small quantity.) Although any bullheadedness seems like a bad thing, it must be admitted that measurements are imperfect too. Thus, as a practical matter, we often find ourselves minimizing:

$\begin{displaymath} \min \quad := \quad \bold r_d \cdot \bold r_d + \epsilon^2 \bold r_m \cdot \bold r_m \end{displaymath}$

(21)

and wondering what to choose for $\epsilon$ . I have two suggestions: My simplest suggestion is to choose $\epsilon$ so that the residual of data fitting matches that of model styling. Thus:

$\begin{displaymath} \epsilon \eq \sqrt{\frac{ \bold r_d \cdot \bold r_d }{ \bold r_m \cdot \bold r_m }} \end{displaymath}$

(22)

My second suggestion is to think of the force on our final solution. In physics, force is associated with a gradient. We have a gradient for the data fitting and another for the model styling:

$\displaystyle \bold g_d$	$\textstyle =$	$\displaystyle \bold L\T \bold W\T \bold r_d$	(23)
$\displaystyle \bold g_m$	$\textstyle =$	$\displaystyle \bold A\T \bold r_m$	(24)

We could balance these forces by the choice:

$\begin{displaymath} \epsilon \eq \sqrt{ \bold g_d \cdot \bold g_d \over \bold g_m \cdot \bold g_m } \end{displaymath}$

(25)

Although we often ignore $\epsilon$ in discussing the formulation of an application, when time comes to solve the problem, reality intercedes. Generally, $\bold r_d$ has different physical units than $\bold r_m$ (likewise $\bold g_d$ and $\bold g_m$ ), and we cannot allow our solution to depend on the accidental choice of units in which we express the problem. I have had much experience choosing $\epsilon$ , but it is only recently that I boiled it down to the suggestions of equations (22) and (25). Normally I also try other values, like double or half previous choices, and I examine the solutions for subjective appearance. The epsilon story continues in Chapter .

Computationally, we could choose a new $\epsilon$ with each iteration, but it is more expeditious to freeze $\epsilon$ , solve the problem, recompute $\epsilon$ , and solve the problem again. I have never seen a case in which more than one repetition was necessary.

People who work with small applications (less than about vector components) have access to an attractive theoretical approach called ``cross-validation.'' Simply speaking, we could solve the problem many times, each time omitting a different data value. Each solution would provide a model that could be used to predict the omitted data value. The quality of these predictions is a function of $\epsilon$ which provides a guide to finding it. My objections to cross validation are two-fold: First, I do not know how to apply it in the large applications we solve in this book (I should think more about it); and second, people who worry much about $\epsilon$ , perhaps first should think more carefully about their choice of the filters $\bold W$ and $\bold A$ , which is the focus of this book. Notice that both $\bold W$ and $\bold A$ can be defined with a scaling factor like $\epsilon$ . Often more important in practice, with $\bold W$ and $\bold A$ we have a scaling factor that need not be constant but can be a function of space or spatial frequency within the data space and/or model space.

EXERCISES:

Figures 1-4 seem to extrapolate to vanishing signals at the side boundaries. Why is that so, and what could be done to leave the sides unconstrained in that way?
Show that the interpolation curve in Figure 2 is not parabolic as it appears, but cubic. (HINT: First show that $(\nabla^2)\T \nabla^2 u = \bold 0$ .)
Verify by a program example that the number of iterations required with simple constraints is the number of free parameters.
A signal on a uniform mesh has missing values. How should we estimate the mean?
It is desired to find a compromise between the Laplacian roughener and the gradient roughener. What is the size of the residual space?
Like the seismic prospecting industry, you have solved a huge problem using binning. You have computer power left over to do a few iterations with linear interpolation. How much does the cost per iteration increase? Should you refine your model mesh, or can you use the same model mesh that you used when binning?
Nuclear energy, having finally reached its potential, has dried up the prospecting industries so you find yourself doing medical imaging (or earthquake seismology). You probe the human body from all sides on a dense regular mesh in cylindrical coordinates. Unfortunately, you need to represent your data in Fourier space. There is no such thing as a fast Fourier transform in cylindrical coordinates, while slow Fourier transforms are pitifully slow. Your only hope to keep up with your competitors is to somehow do your FTs in cartesian coordinates. Write down the sequence of steps to achieve your goals using the methods of this chapter.

Next: About this document ... Up: Regularization is model styling Previous: Abandoned theory for matching

2014-12-03