Blame - docs/source/numerical_derivatives.rst - RealtimeRoboticsGroup/test

blob: 57b46bf360bd156c4d51f06dab387e7ee591b3f8 [file] [log] [blame]

Austin Schuh	70cc955	2019-01-21 19:46:48 -0800	[diff] [blame]	1	.. default-domain:: cpp
				2
				3	.. cpp:namespace:: ceres
				4
				5	.. _chapter-numerical_derivatives:
				6
				7	===================
				8	Numeric derivatives
				9	===================
				10
				11	The other extreme from using analytic derivatives is to use numeric
				12	derivatives. The key observation here is that the process of
				13	differentiating a function :math:`f(x)` w.r.t :math:`x` can be written
				14	as the limiting process:
				15
				16	.. math::
				17	Df(x) = \lim_{h \rightarrow 0} \frac{f(x + h) - f(x)}{h}
				18
				19
				20	Forward Differences
				21	===================
				22
				23	Now of course one cannot perform the limiting operation numerically on
				24	a computer so we do the next best thing, which is to choose a small
				25	value of :math:`h` and approximate the derivative as
				26
				27	.. math::
				28	Df(x) \approx \frac{f(x + h) - f(x)}{h}
				29
				30
				31	The above formula is the simplest most basic form of numeric
				32	differentiation. It is known as the Forward Difference formula.
				33
				34	So how would one go about constructing a numerically differentiated
				35	version of ``Rat43Analytic`` (`Rat43
				36	<http://www.itl.nist.gov/div898/strd/nls/data/ratkowsky3.shtml>`_) in
				37	Ceres Solver. This is done in two steps:
				38
				39	1. Define Functor that given the parameter values will evaluate the
				40	residual for a given :math:`(x,y)`.
				41	2. Construct a :class:`CostFunction` by using
				42	:class:`NumericDiffCostFunction` to wrap an instance of
				43	``Rat43CostFunctor``.
				44
				45	.. code-block:: c++
				46
				47	struct Rat43CostFunctor {
				48	Rat43CostFunctor(const double x, const double y) : x_(x), y_(y) {}
				49
				50	bool operator()(const double* parameters, double* residuals) const {
				51	const double b1 = parameters[0];
				52	const double b2 = parameters[1];
				53	const double b3 = parameters[2];
				54	const double b4 = parameters[3];
				55	residuals[0] = b1 * pow(1.0 + exp(b2 - b3 * x_), -1.0 / b4) - y_;
				56	return true;
				57	}
				58
				59	const double x_;
				60	const double y_;
				61	}
				62
				63	CostFunction* cost_function =
				64	new NumericDiffCostFunction<Rat43CostFunctor, FORWARD, 1, 4>(
				65	new Rat43CostFunctor(x, y));
				66
				67	This is about the minimum amount of work one can expect to do to
				68	define the cost function. The only thing that the user needs to do is
				69	to make sure that the evaluation of the residual is implemented
				70	correctly and efficiently.
				71
				72	Before going further, it is instructive to get an estimate of the
				73	error in the forward difference formula. We do this by considering the
				74	`Taylor expansion <https://en.wikipedia.org/wiki/Taylor_series>`_ of
				75	:math:`f` near :math:`x`.
				76
				77	.. math::
				78	\begin{align}
				79	f(x+h) &= f(x) + h Df(x) + \frac{h^2}{2!} D^2f(x) +
				80	\frac{h^3}{3!}D^3f(x) + \cdots \\
				81	Df(x) &= \frac{f(x + h) - f(x)}{h} - \left [\frac{h}{2!}D^2f(x) +
				82	\frac{h^2}{3!}D^3f(x) + \cdots \right]\\
				83	Df(x) &= \frac{f(x + h) - f(x)}{h} + O(h)
				84	\end{align}
				85
				86	i.e., the error in the forward difference formula is
				87	:math:`O(h)` [#f4]_.
				88
				89
				90	Implementation Details
				91	----------------------
				92
				93	:class:`NumericDiffCostFunction` implements a generic algorithm to
				94	numerically differentiate a given functor. While the actual
				95	implementation of :class:`NumericDiffCostFunction` is complicated, the
				96	net result is a :class:`CostFunction` that roughly looks something
				97	like the following:
				98
				99	.. code-block:: c++
				100
				101	class Rat43NumericDiffForward : public SizedCostFunction<1,4> {
				102	public:
				103	Rat43NumericDiffForward(const Rat43Functor* functor) : functor_(functor) {}
				104	virtual ~Rat43NumericDiffForward() {}
				105	virtual bool Evaluate(double const* const* parameters,
				106	double* residuals,
				107	double** jacobians) const {
				108	functor_(parameters[0], residuals);
				109	if (!jacobians) return true;
				110	double* jacobian = jacobians[0];
				111	if (!jacobian) return true;
				112
				113	const double f = residuals[0];
				114	double parameters_plus_h[4];
				115	for (int i = 0; i < 4; ++i) {
				116	std::copy(parameters, parameters + 4, parameters_plus_h);
				117	const double kRelativeStepSize = 1e-6;
				118	const double h = std::abs(parameters[i]) * kRelativeStepSize;
				119	parameters_plus_h[i] += h;
				120	double f_plus;
				121	functor_(parameters_plus_h, &f_plus);
				122	jacobian[i] = (f_plus - f) / h;
				123	}
				124	return true;
				125	}
				126
				127	private:
				128	std::unique_ptr<Rat43Functor> functor_;
				129	};
				130
				131
				132	Note the choice of step size :math:`h` in the above code, instead of
				133	an absolute step size which is the same for all parameters, we use a
				134	relative step size of :math:`\text{kRelativeStepSize} = 10^{-6}`. This
				135	gives better derivative estimates than an absolute step size [#f2]_
				136	[#f3]_. This choice of step size only works for parameter values that
				137	are not close to zero. So the actual implementation of
				138	:class:`NumericDiffCostFunction`, uses a more complex step size
				139	selection logic, where close to zero, it switches to a fixed step
				140	size.
				141
				142
				143	Central Differences
				144	===================
				145
				146	:math:`O(h)` error in the Forward Difference formula is okay but not
				147	great. A better method is to use the Central Difference formula:
				148
				149	.. math::
				150	Df(x) \approx \frac{f(x + h) - f(x - h)}{2h}
				151
				152	Notice that if the value of :math:`f(x)` is known, the Forward
				153	Difference formula only requires one extra evaluation, but the Central
				154	Difference formula requires two evaluations, making it twice as
				155	expensive. So is the extra evaluation worth it?
				156
				157	To answer this question, we again compute the error of approximation
				158	in the central difference formula:
				159
				160	.. math::
				161	\begin{align}
				162	f(x + h) &= f(x) + h Df(x) + \frac{h^2}{2!}
				163	D^2f(x) + \frac{h^3}{3!} D^3f(x) + \frac{h^4}{4!} D^4f(x) + \cdots\\
				164	f(x - h) &= f(x) - h Df(x) + \frac{h^2}{2!}
				165	D^2f(x) - \frac{h^3}{3!} D^3f(c_2) + \frac{h^4}{4!} D^4f(x) +
				166	\cdots\\
				167	Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + \frac{h^2}{3!}
				168	D^3f(x) + \frac{h^4}{5!}
				169	D^5f(x) + \cdots \\
				170	Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + O(h^2)
				171	\end{align}
				172
				173	The error of the Central Difference formula is :math:`O(h^2)`, i.e.,
				174	the error goes down quadratically whereas the error in the Forward
				175	Difference formula only goes down linearly.
				176
				177	Using central differences instead of forward differences in Ceres
				178	Solver is a simple matter of changing a template argument to
				179	:class:`NumericDiffCostFunction` as follows:
				180
				181	.. code-block:: c++
				182
				183	CostFunction* cost_function =
				184	new NumericDiffCostFunction<Rat43CostFunctor, CENTRAL, 1, 4>(
				185	new Rat43CostFunctor(x, y));
				186
				187	But what do these differences in the error mean in practice? To see
				188	this, consider the problem of evaluating the derivative of the
				189	univariate function
				190
				191	.. math::
				192	f(x) = \frac{e^x}{\sin x - x^2},
				193
				194	at :math:`x = 1.0`.
				195
				196	It is easy to determine that :math:`Df(1.0) =
				197	140.73773557129658`. Using this value as reference, we can now compute
				198	the relative error in the forward and central difference formulae as a
				199	function of the absolute step size and plot them.
				200
				201	.. figure:: forward_central_error.png
				202	:figwidth: 100%
				203	:align: center
				204
				205	Reading the graph from right to left, a number of things stand out in
				206	the above graph:
				207
				208	1. The graph for both formulae have two distinct regions. At first,
				209	starting from a large value of :math:`h` the error goes down as
				210	the effect of truncating the Taylor series dominates, but as the
				211	value of :math:`h` continues to decrease, the error starts
				212	increasing again as roundoff error starts to dominate the
				213	computation. So we cannot just keep on reducing the value of
				214	:math:`h` to get better estimates of :math:`Df`. The fact that we
				215	are using finite precision arithmetic becomes a limiting factor.
				216	2. Forward Difference formula is not a great method for evaluating
				217	derivatives. Central Difference formula converges much more
				218	quickly to a more accurate estimate of the derivative with
				219	decreasing step size. So unless the evaluation of :math:`f(x)` is
				220	so expensive that you absolutely cannot afford the extra
				221	evaluation required by central differences, **do not use the
				222	Forward Difference formula**.
				223	3. Neither formula works well for a poorly chosen value of :math:`h`.
				224
				225
				226	Ridders' Method
				227	===============
				228
				229	So, can we get better estimates of :math:`Df` without requiring such
				230	small values of :math:`h` that we start hitting floating point
				231	roundoff errors?
				232
				233	One possible approach is to find a method whose error goes down faster
				234	than :math:`O(h^2)`. This can be done by applying `Richardson
				235	Extrapolation
				236	<https://en.wikipedia.org/wiki/Richardson_extrapolation>`_ to the
				237	problem of differentiation. This is also known as Ridders' Method
				238	[Ridders]_.
				239
				240	Let us recall, the error in the central differences formula.
				241
				242	.. math::
				243	\begin{align}
				244	Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + \frac{h^2}{3!}
				245	D^3f(x) + \frac{h^4}{5!}
				246	D^5f(x) + \cdots\\
				247	& = \frac{f(x + h) - f(x - h)}{2h} + K_2 h^2 + K_4 h^4 + \cdots
				248	\end{align}
				249
				250	The key thing to note here is that the terms :math:`K_2, K_4, ...`
				251	are independent of :math:`h` and only depend on :math:`x`.
				252
				253	Let us now define:
				254
				255	.. math::
				256
				257	A(1, m) = \frac{f(x + h/2^{m-1}) - f(x - h/2^{m-1})}{2h/2^{m-1}}.
				258
				259	Then observe that
				260
				261	.. math::
				262
				263	Df(x) = A(1,1) + K_2 h^2 + K_4 h^4 + \cdots
				264
				265	and
				266
				267	.. math::
				268
				269	Df(x) = A(1, 2) + K_2 (h/2)^2 + K_4 (h/2)^4 + \cdots
				270
				271	Here we have halved the step size to obtain a second central
				272	differences estimate of :math:`Df(x)`. Combining these two estimates,
				273	we get:
				274
				275	.. math::
				276
				277	Df(x) = \frac{4 A(1, 2) - A(1,1)}{4 - 1} + O(h^4)
				278
				279	which is an approximation of :math:`Df(x)` with truncation error that
				280	goes down as :math:`O(h^4)`. But we do not have to stop here. We can
				281	iterate this process to obtain even more accurate estimates as
				282	follows:
				283
				284	.. math::
				285
				286	A(n, m) = \begin{cases}
				287	\frac{\displaystyle f(x + h/2^{m-1}) - f(x -
				288	h/2^{m-1})}{\displaystyle 2h/2^{m-1}} & n = 1 \\
				289	\frac{\displaystyle 4^{n-1} A(n - 1, m + 1) - A(n - 1, m)}{\displaystyle 4^{n-1} - 1} & n > 1
				290	\end{cases}
				291
				292	It is straightforward to show that the approximation error in
				293	:math:`A(n, 1)` is :math:`O(h^{2n})`. To see how the above formula can
				294	be implemented in practice to compute :math:`A(n,1)` it is helpful to
				295	structure the computation as the following tableau:
				296
				297	.. math::
				298	\begin{array}{ccccc}
				299	A(1,1) & A(1, 2) & A(1, 3) & A(1, 4) & \cdots\\
				300	& A(2, 1) & A(2, 2) & A(2, 3) & \cdots\\
				301	& & A(3, 1) & A(3, 2) & \cdots\\
				302	& & & A(4, 1) & \cdots \\
				303	& & & & \ddots
				304	\end{array}
				305
				306	So, to compute :math:`A(n, 1)` for increasing values of :math:`n` we
				307	move from the left to the right, computing one column at a
				308	time. Assuming that the primary cost here is the evaluation of the
				309	function :math:`f(x)`, the cost of computing a new column of the above
				310	tableau is two function evaluations. Since the cost of evaluating
				311	:math:`A(1, n)`, requires evaluating the central difference formula
				312	for step size of :math:`2^{1-n}h`
				313
				314	Applying this method to :math:`f(x) = \frac{e^x}{\sin x - x^2}`
				315	starting with a fairly large step size :math:`h = 0.01`, we get:
				316
				317	.. math::
				318	\begin{array}{rrrrr}
				319	141.678097131 &140.971663667 &140.796145400 &140.752333523 &140.741384778\\
				320	&140.736185846 &140.737639311 &140.737729564 &140.737735196\\
				321	& &140.737736209 &140.737735581 &140.737735571\\
				322	& & &140.737735571 &140.737735571\\
				323	& & & &140.737735571\\
				324	\end{array}
				325
				326	Compared to the correct value :math:`Df(1.0) = 140.73773557129658`,
				327	:math:`A(5, 1)` has a relative error of :math:`10^{-13}`. For
				328	comparison, the relative error for the central difference formula with
				329	the same stepsize (:math:`0.01/2^4 = 0.000625`) is :math:`10^{-5}`.
				330
				331	The above tableau is the basis of Ridders' method for numeric
				332	differentiation. The full implementation is an adaptive scheme that
				333	tracks its own estimation error and stops automatically when the
				334	desired precision is reached. Of course it is more expensive than the
				335	forward and central difference formulae, but is also significantly
				336	more robust and accurate.
				337
				338	Using Ridder's method instead of forward or central differences in
				339	Ceres is again a simple matter of changing a template argument to
				340	:class:`NumericDiffCostFunction` as follows:
				341
				342	.. code-block:: c++
				343
				344	CostFunction* cost_function =
				345	new NumericDiffCostFunction<Rat43CostFunctor, RIDDERS, 1, 4>(
				346	new Rat43CostFunctor(x, y));
				347
				348	The following graph shows the relative error of the three methods as a
				349	function of the absolute step size. For Ridders's method we assume
				350	that the step size for evaluating :math:`A(n,1)` is :math:`2^{1-n}h`.
				351
				352	.. figure:: forward_central_ridders_error.png
				353	:figwidth: 100%
				354	:align: center
				355
				356	Using the 10 function evaluations that are needed to compute
				357	:math:`A(5,1)` we are able to approximate :math:`Df(1.0)` about a 1000
				358	times better than the best central differences estimate. To put these
				359	numbers in perspective, machine epsilon for double precision
				360	arithmetic is :math:`\approx 2.22 \times 10^{-16}`.
				361
				362	Going back to ``Rat43``, let us also look at the runtime cost of the
				363	various methods for computing numeric derivatives.
				364
				365	========================== =========
				366	CostFunction Time (ns)
				367	========================== =========
				368	Rat43Analytic 255
				369	Rat43AnalyticOptimized 92
				370	Rat43NumericDiffForward 262
				371	Rat43NumericDiffCentral 517
				372	Rat43NumericDiffRidders 3760
				373	========================== =========
				374
				375	As expected, Central Differences is about twice as expensive as
				376	Forward Differences and the remarkable accuracy improvements of
				377	Ridders' method cost an order of magnitude more runtime.
				378
				379	Recommendations
				380	===============
				381
				382	Numeric differentiation should be used when you cannot compute the
				383	derivatives either analytically or using automatic differentiation. This
				384	is usually the case when you are calling an external library or
				385	function whose analytic form you do not know or even if you do, you
				386	are not in a position to re-write it in a manner required to use
				387	:ref:`chapter-automatic_derivatives`.
				388
				389
				390	When using numeric differentiation, use at least Central Differences,
				391	and if execution time is not a concern or the objective function is
				392	such that determining a good static relative step size is hard,
				393	Ridders' method is recommended.
				394
				395	.. rubric:: Footnotes
				396
				397	.. [#f2] `Numerical Differentiation
				398	<https://en.wikipedia.org/wiki/Numerical_differentiation#Practical_considerations_using_floating_point_arithmetic>`_
				399	.. [#f3] [Press]_ Numerical Recipes, Section 5.7
				400	.. [#f4] In asymptotic error analysis, an error of :math:`O(h^k)`
				401	means that the absolute-value of the error is at most some
				402	constant times :math:`h^k` when :math:`h` is close enough to
				403	:math:`0`.