Austin Schuh | 70cc955 | 2019-01-21 19:46:48 -0800 | [diff] [blame] | 1 | .. default-domain:: cpp |
| 2 | |
| 3 | .. cpp:namespace:: ceres |
| 4 | |
| 5 | .. _chapter-numerical_derivatives: |
| 6 | |
| 7 | =================== |
| 8 | Numeric derivatives |
| 9 | =================== |
| 10 | |
| 11 | The other extreme from using analytic derivatives is to use numeric |
| 12 | derivatives. The key observation here is that the process of |
| 13 | differentiating a function :math:`f(x)` w.r.t :math:`x` can be written |
| 14 | as the limiting process: |
| 15 | |
| 16 | .. math:: |
| 17 | Df(x) = \lim_{h \rightarrow 0} \frac{f(x + h) - f(x)}{h} |
| 18 | |
| 19 | |
| 20 | Forward Differences |
| 21 | =================== |
| 22 | |
| 23 | Now of course one cannot perform the limiting operation numerically on |
| 24 | a computer so we do the next best thing, which is to choose a small |
| 25 | value of :math:`h` and approximate the derivative as |
| 26 | |
| 27 | .. math:: |
| 28 | Df(x) \approx \frac{f(x + h) - f(x)}{h} |
| 29 | |
| 30 | |
| 31 | The above formula is the simplest most basic form of numeric |
| 32 | differentiation. It is known as the *Forward Difference* formula. |
| 33 | |
| 34 | So how would one go about constructing a numerically differentiated |
| 35 | version of ``Rat43Analytic`` (`Rat43 |
| 36 | <http://www.itl.nist.gov/div898/strd/nls/data/ratkowsky3.shtml>`_) in |
| 37 | Ceres Solver. This is done in two steps: |
| 38 | |
| 39 | 1. Define *Functor* that given the parameter values will evaluate the |
| 40 | residual for a given :math:`(x,y)`. |
| 41 | 2. Construct a :class:`CostFunction` by using |
| 42 | :class:`NumericDiffCostFunction` to wrap an instance of |
| 43 | ``Rat43CostFunctor``. |
| 44 | |
| 45 | .. code-block:: c++ |
| 46 | |
| 47 | struct Rat43CostFunctor { |
| 48 | Rat43CostFunctor(const double x, const double y) : x_(x), y_(y) {} |
| 49 | |
| 50 | bool operator()(const double* parameters, double* residuals) const { |
| 51 | const double b1 = parameters[0]; |
| 52 | const double b2 = parameters[1]; |
| 53 | const double b3 = parameters[2]; |
| 54 | const double b4 = parameters[3]; |
| 55 | residuals[0] = b1 * pow(1.0 + exp(b2 - b3 * x_), -1.0 / b4) - y_; |
| 56 | return true; |
| 57 | } |
| 58 | |
| 59 | const double x_; |
| 60 | const double y_; |
| 61 | } |
| 62 | |
| 63 | CostFunction* cost_function = |
| 64 | new NumericDiffCostFunction<Rat43CostFunctor, FORWARD, 1, 4>( |
| 65 | new Rat43CostFunctor(x, y)); |
| 66 | |
| 67 | This is about the minimum amount of work one can expect to do to |
| 68 | define the cost function. The only thing that the user needs to do is |
| 69 | to make sure that the evaluation of the residual is implemented |
| 70 | correctly and efficiently. |
| 71 | |
| 72 | Before going further, it is instructive to get an estimate of the |
| 73 | error in the forward difference formula. We do this by considering the |
| 74 | `Taylor expansion <https://en.wikipedia.org/wiki/Taylor_series>`_ of |
| 75 | :math:`f` near :math:`x`. |
| 76 | |
| 77 | .. math:: |
| 78 | \begin{align} |
| 79 | f(x+h) &= f(x) + h Df(x) + \frac{h^2}{2!} D^2f(x) + |
| 80 | \frac{h^3}{3!}D^3f(x) + \cdots \\ |
| 81 | Df(x) &= \frac{f(x + h) - f(x)}{h} - \left [\frac{h}{2!}D^2f(x) + |
| 82 | \frac{h^2}{3!}D^3f(x) + \cdots \right]\\ |
| 83 | Df(x) &= \frac{f(x + h) - f(x)}{h} + O(h) |
| 84 | \end{align} |
| 85 | |
| 86 | i.e., the error in the forward difference formula is |
| 87 | :math:`O(h)` [#f4]_. |
| 88 | |
| 89 | |
| 90 | Implementation Details |
| 91 | ---------------------- |
| 92 | |
| 93 | :class:`NumericDiffCostFunction` implements a generic algorithm to |
| 94 | numerically differentiate a given functor. While the actual |
| 95 | implementation of :class:`NumericDiffCostFunction` is complicated, the |
| 96 | net result is a :class:`CostFunction` that roughly looks something |
| 97 | like the following: |
| 98 | |
| 99 | .. code-block:: c++ |
| 100 | |
| 101 | class Rat43NumericDiffForward : public SizedCostFunction<1,4> { |
| 102 | public: |
| 103 | Rat43NumericDiffForward(const Rat43Functor* functor) : functor_(functor) {} |
| 104 | virtual ~Rat43NumericDiffForward() {} |
| 105 | virtual bool Evaluate(double const* const* parameters, |
| 106 | double* residuals, |
| 107 | double** jacobians) const { |
| 108 | functor_(parameters[0], residuals); |
| 109 | if (!jacobians) return true; |
| 110 | double* jacobian = jacobians[0]; |
| 111 | if (!jacobian) return true; |
| 112 | |
| 113 | const double f = residuals[0]; |
| 114 | double parameters_plus_h[4]; |
| 115 | for (int i = 0; i < 4; ++i) { |
| 116 | std::copy(parameters, parameters + 4, parameters_plus_h); |
| 117 | const double kRelativeStepSize = 1e-6; |
| 118 | const double h = std::abs(parameters[i]) * kRelativeStepSize; |
| 119 | parameters_plus_h[i] += h; |
| 120 | double f_plus; |
| 121 | functor_(parameters_plus_h, &f_plus); |
| 122 | jacobian[i] = (f_plus - f) / h; |
| 123 | } |
| 124 | return true; |
| 125 | } |
| 126 | |
| 127 | private: |
| 128 | std::unique_ptr<Rat43Functor> functor_; |
| 129 | }; |
| 130 | |
| 131 | |
| 132 | Note the choice of step size :math:`h` in the above code, instead of |
| 133 | an absolute step size which is the same for all parameters, we use a |
| 134 | relative step size of :math:`\text{kRelativeStepSize} = 10^{-6}`. This |
| 135 | gives better derivative estimates than an absolute step size [#f2]_ |
| 136 | [#f3]_. This choice of step size only works for parameter values that |
| 137 | are not close to zero. So the actual implementation of |
| 138 | :class:`NumericDiffCostFunction`, uses a more complex step size |
| 139 | selection logic, where close to zero, it switches to a fixed step |
| 140 | size. |
| 141 | |
| 142 | |
| 143 | Central Differences |
| 144 | =================== |
| 145 | |
| 146 | :math:`O(h)` error in the Forward Difference formula is okay but not |
| 147 | great. A better method is to use the *Central Difference* formula: |
| 148 | |
| 149 | .. math:: |
| 150 | Df(x) \approx \frac{f(x + h) - f(x - h)}{2h} |
| 151 | |
| 152 | Notice that if the value of :math:`f(x)` is known, the Forward |
| 153 | Difference formula only requires one extra evaluation, but the Central |
| 154 | Difference formula requires two evaluations, making it twice as |
| 155 | expensive. So is the extra evaluation worth it? |
| 156 | |
| 157 | To answer this question, we again compute the error of approximation |
| 158 | in the central difference formula: |
| 159 | |
| 160 | .. math:: |
| 161 | \begin{align} |
| 162 | f(x + h) &= f(x) + h Df(x) + \frac{h^2}{2!} |
| 163 | D^2f(x) + \frac{h^3}{3!} D^3f(x) + \frac{h^4}{4!} D^4f(x) + \cdots\\ |
| 164 | f(x - h) &= f(x) - h Df(x) + \frac{h^2}{2!} |
| 165 | D^2f(x) - \frac{h^3}{3!} D^3f(c_2) + \frac{h^4}{4!} D^4f(x) + |
| 166 | \cdots\\ |
| 167 | Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + \frac{h^2}{3!} |
| 168 | D^3f(x) + \frac{h^4}{5!} |
| 169 | D^5f(x) + \cdots \\ |
| 170 | Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + O(h^2) |
| 171 | \end{align} |
| 172 | |
| 173 | The error of the Central Difference formula is :math:`O(h^2)`, i.e., |
| 174 | the error goes down quadratically whereas the error in the Forward |
| 175 | Difference formula only goes down linearly. |
| 176 | |
| 177 | Using central differences instead of forward differences in Ceres |
| 178 | Solver is a simple matter of changing a template argument to |
| 179 | :class:`NumericDiffCostFunction` as follows: |
| 180 | |
| 181 | .. code-block:: c++ |
| 182 | |
| 183 | CostFunction* cost_function = |
| 184 | new NumericDiffCostFunction<Rat43CostFunctor, CENTRAL, 1, 4>( |
| 185 | new Rat43CostFunctor(x, y)); |
| 186 | |
| 187 | But what do these differences in the error mean in practice? To see |
| 188 | this, consider the problem of evaluating the derivative of the |
| 189 | univariate function |
| 190 | |
| 191 | .. math:: |
| 192 | f(x) = \frac{e^x}{\sin x - x^2}, |
| 193 | |
| 194 | at :math:`x = 1.0`. |
| 195 | |
| 196 | It is easy to determine that :math:`Df(1.0) = |
| 197 | 140.73773557129658`. Using this value as reference, we can now compute |
| 198 | the relative error in the forward and central difference formulae as a |
| 199 | function of the absolute step size and plot them. |
| 200 | |
| 201 | .. figure:: forward_central_error.png |
| 202 | :figwidth: 100% |
| 203 | :align: center |
| 204 | |
| 205 | Reading the graph from right to left, a number of things stand out in |
| 206 | the above graph: |
| 207 | |
| 208 | 1. The graph for both formulae have two distinct regions. At first, |
| 209 | starting from a large value of :math:`h` the error goes down as |
| 210 | the effect of truncating the Taylor series dominates, but as the |
| 211 | value of :math:`h` continues to decrease, the error starts |
| 212 | increasing again as roundoff error starts to dominate the |
| 213 | computation. So we cannot just keep on reducing the value of |
| 214 | :math:`h` to get better estimates of :math:`Df`. The fact that we |
| 215 | are using finite precision arithmetic becomes a limiting factor. |
| 216 | 2. Forward Difference formula is not a great method for evaluating |
| 217 | derivatives. Central Difference formula converges much more |
| 218 | quickly to a more accurate estimate of the derivative with |
| 219 | decreasing step size. So unless the evaluation of :math:`f(x)` is |
| 220 | so expensive that you absolutely cannot afford the extra |
| 221 | evaluation required by central differences, **do not use the |
| 222 | Forward Difference formula**. |
| 223 | 3. Neither formula works well for a poorly chosen value of :math:`h`. |
| 224 | |
| 225 | |
| 226 | Ridders' Method |
| 227 | =============== |
| 228 | |
| 229 | So, can we get better estimates of :math:`Df` without requiring such |
| 230 | small values of :math:`h` that we start hitting floating point |
| 231 | roundoff errors? |
| 232 | |
| 233 | One possible approach is to find a method whose error goes down faster |
| 234 | than :math:`O(h^2)`. This can be done by applying `Richardson |
| 235 | Extrapolation |
| 236 | <https://en.wikipedia.org/wiki/Richardson_extrapolation>`_ to the |
| 237 | problem of differentiation. This is also known as *Ridders' Method* |
| 238 | [Ridders]_. |
| 239 | |
| 240 | Let us recall, the error in the central differences formula. |
| 241 | |
| 242 | .. math:: |
| 243 | \begin{align} |
| 244 | Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + \frac{h^2}{3!} |
| 245 | D^3f(x) + \frac{h^4}{5!} |
| 246 | D^5f(x) + \cdots\\ |
| 247 | & = \frac{f(x + h) - f(x - h)}{2h} + K_2 h^2 + K_4 h^4 + \cdots |
| 248 | \end{align} |
| 249 | |
| 250 | The key thing to note here is that the terms :math:`K_2, K_4, ...` |
| 251 | are independent of :math:`h` and only depend on :math:`x`. |
| 252 | |
| 253 | Let us now define: |
| 254 | |
| 255 | .. math:: |
| 256 | |
| 257 | A(1, m) = \frac{f(x + h/2^{m-1}) - f(x - h/2^{m-1})}{2h/2^{m-1}}. |
| 258 | |
| 259 | Then observe that |
| 260 | |
| 261 | .. math:: |
| 262 | |
| 263 | Df(x) = A(1,1) + K_2 h^2 + K_4 h^4 + \cdots |
| 264 | |
| 265 | and |
| 266 | |
| 267 | .. math:: |
| 268 | |
| 269 | Df(x) = A(1, 2) + K_2 (h/2)^2 + K_4 (h/2)^4 + \cdots |
| 270 | |
| 271 | Here we have halved the step size to obtain a second central |
| 272 | differences estimate of :math:`Df(x)`. Combining these two estimates, |
| 273 | we get: |
| 274 | |
| 275 | .. math:: |
| 276 | |
| 277 | Df(x) = \frac{4 A(1, 2) - A(1,1)}{4 - 1} + O(h^4) |
| 278 | |
| 279 | which is an approximation of :math:`Df(x)` with truncation error that |
| 280 | goes down as :math:`O(h^4)`. But we do not have to stop here. We can |
| 281 | iterate this process to obtain even more accurate estimates as |
| 282 | follows: |
| 283 | |
| 284 | .. math:: |
| 285 | |
| 286 | A(n, m) = \begin{cases} |
| 287 | \frac{\displaystyle f(x + h/2^{m-1}) - f(x - |
| 288 | h/2^{m-1})}{\displaystyle 2h/2^{m-1}} & n = 1 \\ |
| 289 | \frac{\displaystyle 4^{n-1} A(n - 1, m + 1) - A(n - 1, m)}{\displaystyle 4^{n-1} - 1} & n > 1 |
| 290 | \end{cases} |
| 291 | |
| 292 | It is straightforward to show that the approximation error in |
| 293 | :math:`A(n, 1)` is :math:`O(h^{2n})`. To see how the above formula can |
| 294 | be implemented in practice to compute :math:`A(n,1)` it is helpful to |
| 295 | structure the computation as the following tableau: |
| 296 | |
| 297 | .. math:: |
| 298 | \begin{array}{ccccc} |
| 299 | A(1,1) & A(1, 2) & A(1, 3) & A(1, 4) & \cdots\\ |
| 300 | & A(2, 1) & A(2, 2) & A(2, 3) & \cdots\\ |
| 301 | & & A(3, 1) & A(3, 2) & \cdots\\ |
| 302 | & & & A(4, 1) & \cdots \\ |
| 303 | & & & & \ddots |
| 304 | \end{array} |
| 305 | |
| 306 | So, to compute :math:`A(n, 1)` for increasing values of :math:`n` we |
| 307 | move from the left to the right, computing one column at a |
| 308 | time. Assuming that the primary cost here is the evaluation of the |
| 309 | function :math:`f(x)`, the cost of computing a new column of the above |
| 310 | tableau is two function evaluations. Since the cost of evaluating |
| 311 | :math:`A(1, n)`, requires evaluating the central difference formula |
| 312 | for step size of :math:`2^{1-n}h` |
| 313 | |
| 314 | Applying this method to :math:`f(x) = \frac{e^x}{\sin x - x^2}` |
| 315 | starting with a fairly large step size :math:`h = 0.01`, we get: |
| 316 | |
| 317 | .. math:: |
| 318 | \begin{array}{rrrrr} |
| 319 | 141.678097131 &140.971663667 &140.796145400 &140.752333523 &140.741384778\\ |
| 320 | &140.736185846 &140.737639311 &140.737729564 &140.737735196\\ |
| 321 | & &140.737736209 &140.737735581 &140.737735571\\ |
| 322 | & & &140.737735571 &140.737735571\\ |
| 323 | & & & &140.737735571\\ |
| 324 | \end{array} |
| 325 | |
| 326 | Compared to the *correct* value :math:`Df(1.0) = 140.73773557129658`, |
| 327 | :math:`A(5, 1)` has a relative error of :math:`10^{-13}`. For |
| 328 | comparison, the relative error for the central difference formula with |
| 329 | the same stepsize (:math:`0.01/2^4 = 0.000625`) is :math:`10^{-5}`. |
| 330 | |
| 331 | The above tableau is the basis of Ridders' method for numeric |
| 332 | differentiation. The full implementation is an adaptive scheme that |
| 333 | tracks its own estimation error and stops automatically when the |
| 334 | desired precision is reached. Of course it is more expensive than the |
| 335 | forward and central difference formulae, but is also significantly |
| 336 | more robust and accurate. |
| 337 | |
| 338 | Using Ridder's method instead of forward or central differences in |
| 339 | Ceres is again a simple matter of changing a template argument to |
| 340 | :class:`NumericDiffCostFunction` as follows: |
| 341 | |
| 342 | .. code-block:: c++ |
| 343 | |
| 344 | CostFunction* cost_function = |
| 345 | new NumericDiffCostFunction<Rat43CostFunctor, RIDDERS, 1, 4>( |
| 346 | new Rat43CostFunctor(x, y)); |
| 347 | |
| 348 | The following graph shows the relative error of the three methods as a |
| 349 | function of the absolute step size. For Ridders's method we assume |
| 350 | that the step size for evaluating :math:`A(n,1)` is :math:`2^{1-n}h`. |
| 351 | |
| 352 | .. figure:: forward_central_ridders_error.png |
| 353 | :figwidth: 100% |
| 354 | :align: center |
| 355 | |
| 356 | Using the 10 function evaluations that are needed to compute |
| 357 | :math:`A(5,1)` we are able to approximate :math:`Df(1.0)` about a 1000 |
| 358 | times better than the best central differences estimate. To put these |
| 359 | numbers in perspective, machine epsilon for double precision |
| 360 | arithmetic is :math:`\approx 2.22 \times 10^{-16}`. |
| 361 | |
| 362 | Going back to ``Rat43``, let us also look at the runtime cost of the |
| 363 | various methods for computing numeric derivatives. |
| 364 | |
| 365 | ========================== ========= |
| 366 | CostFunction Time (ns) |
| 367 | ========================== ========= |
| 368 | Rat43Analytic 255 |
| 369 | Rat43AnalyticOptimized 92 |
| 370 | Rat43NumericDiffForward 262 |
| 371 | Rat43NumericDiffCentral 517 |
| 372 | Rat43NumericDiffRidders 3760 |
| 373 | ========================== ========= |
| 374 | |
| 375 | As expected, Central Differences is about twice as expensive as |
| 376 | Forward Differences and the remarkable accuracy improvements of |
| 377 | Ridders' method cost an order of magnitude more runtime. |
| 378 | |
| 379 | Recommendations |
| 380 | =============== |
| 381 | |
| 382 | Numeric differentiation should be used when you cannot compute the |
| 383 | derivatives either analytically or using automatic differentiation. This |
| 384 | is usually the case when you are calling an external library or |
| 385 | function whose analytic form you do not know or even if you do, you |
| 386 | are not in a position to re-write it in a manner required to use |
| 387 | :ref:`chapter-automatic_derivatives`. |
| 388 | |
| 389 | |
| 390 | When using numeric differentiation, use at least Central Differences, |
| 391 | and if execution time is not a concern or the objective function is |
| 392 | such that determining a good static relative step size is hard, |
| 393 | Ridders' method is recommended. |
| 394 | |
| 395 | .. rubric:: Footnotes |
| 396 | |
| 397 | .. [#f2] `Numerical Differentiation |
| 398 | <https://en.wikipedia.org/wiki/Numerical_differentiation#Practical_considerations_using_floating_point_arithmetic>`_ |
| 399 | .. [#f3] [Press]_ Numerical Recipes, Section 5.7 |
| 400 | .. [#f4] In asymptotic error analysis, an error of :math:`O(h^k)` |
| 401 | means that the absolute-value of the error is at most some |
| 402 | constant times :math:`h^k` when :math:`h` is close enough to |
| 403 | :math:`0`. |