blob: 9a25bef1a3eb53df3a837eff3a81d9b2a8ffa5d9 [file] [log] [blame]
Austin Schuhdace2a62020-08-18 10:56:48 -07001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2<html>
3<head>
4 <title>GMP Itemized Development Tasks</title>
5 <link rel="shortcut icon" href="favicon.ico">
6 <link rel="stylesheet" href="gmp.css">
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8</head>
9
10<center>
11 <h1>
12 GMP Itemized Development Tasks
13 </h1>
14</center>
15
16<font size=-1>
17<pre>
18Copyright 2000-2004, 2006, 2008, 2009 Free Software Foundation, Inc.
19
20This file is part of the GNU MP Library.
21
22The GNU MP Library is free software; you can redistribute it and/or modify
23it under the terms of either:
24
25 * the GNU Lesser General Public License as published by the Free
26 Software Foundation; either version 3 of the License, or (at your
27 option) any later version.
28
29or
30
31 * the GNU General Public License as published by the Free Software
32 Foundation; either version 2 of the License, or (at your option) any
33 later version.
34
35or both in parallel, as here.
36
37The GNU MP Library is distributed in the hope that it will be useful, but
38WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
39or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
40for more details.
41
42You should have received copies of the GNU General Public License and the
43GNU Lesser General Public License along with the GNU MP Library. If not,
44see https://www.gnu.org/licenses/.
45</pre>
46</font>
47
48<hr>
49<!-- NB. timestamp updated automatically by emacs -->
50 This file current as of 29 Jan 2014. An up-to-date version is available at
51 <a href="https://gmplib.org/tasks.html">https://gmplib.org/tasks.html</a>.
52 Please send comments about this page to gmp-devel<font>@</font>gmplib.org.
53
54<p> These are itemized GMP development tasks. Not all the tasks
55 listed here are suitable for volunteers, but many of them are.
56 Please see the <a href="projects.html">projects file</a> for more
57 sizeable projects.
58
59<p> CAUTION: This file needs updating. Many of the tasks here have
60either already been taken care of, or have become irrelevant.
61
62<h4>Correctness and Completeness</h4>
63<ul>
64<li> <code>_LONG_LONG_LIMB</code> in gmp.h is not namespace clean. Reported
65 by Patrick Pelissier.
66 <br>
67 We sort of mentioned <code>_LONG_LONG_LIMB</code> in past releases, so
68 need to be careful about changing it. It used to be a define
69 applications had to set for long long limb systems, but that in
70 particular is no longer relevant now that it's established automatically.
71<li> The various reuse.c tests need to force reallocation by calling
72 <code>_mpz_realloc</code> with a small (1 limb) size.
73<li> One reuse case is missing from mpX/tests/reuse.c:
74 <code>mpz_XXX(a,a,a)</code>.
75<li> Make the string reading functions allow the `0x' prefix when the base is
76 explicitly 16. They currently only allow that prefix when the base is
77 unspecified (zero).
78<li> <code>mpf_eq</code> is not always correct, when one operand is
79 1000000000... and the other operand is 0111111111..., i.e., extremely
80 close. There is a special case in <code>mpf_sub</code> for this
81 situation; put similar code in <code>mpf_eq</code>. [In progress.]
82<li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies. It should
83 not use just whole limbs, but partial limbs. [In progress.]
84<li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance
85 garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
86 of a <code>long</code> is not detected.
87<li> <code>mpf_add</code> doesn't check for a carry from truncated portions of
88 the inputs, and in that respect doesn't implement the "infinite precision
89 followed by truncate" specified in the manual.
90<li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global
91 variables with pointers to <code>mpz_add</code> etc, which doesn't work
92 when those routines are coming from a DLL (because they're effectively
93 function pointer global variables themselves). Need to rearrange perhaps
94 to a set of calls to a test function rather than iterating over an array.
95<li> <code>mpz_pow_ui</code>: Detect when the result would be more memory than
96 a <code>size_t</code> can represent and raise some suitable exception,
97 probably an alloc call asking for <code>SIZE_T_MAX</code>, and if that
98 somehow succeeds then an <code>abort</code>. Various size overflows of
99 this kind are not handled gracefully, probably resulting in segvs.
100 <br>
101 In <code>mpz_n_pow_ui</code>, detect when the count of low zero bits
102 exceeds an <code>unsigned long</code>. There's a (small) chance of this
103 happening but still having enough memory to represent the value.
104 Reported by Winfried Dreckmann in for instance <code>mpz_ui_pow_ui (x,
105 4UL, 1431655766UL)</code>.
106<li> <code>mpf</code>: Detect exponent overflow and raise some exception.
107 It'd be nice to allow the full <code>mp_exp_t</code> range since that's
108 how it's been in the past, but maybe dropping one bit would make it
109 easier to test if e1+e2 goes out of bounds.
110</ul>
111
112
113
114<h4>Machine Independent Optimization</h4>
115<ul>
116<li> <code>mpf_cmp</code>: For better cache locality, don't test for low zero
117 limbs until the high limbs fail to give an ordering. Reduce code size by
118 turning the three <code>mpn_cmp</code>'s into a single loop stopping when
119 the end of one operand is reached (and then looking for a non-zero in the
120 rest of the other).
121<li> <code>mpf_mul_2exp</code>, <code>mpf_div_2exp</code>: The use of
122 <code>mpn_lshift</code> for any size&lt;=prec means repeated
123 <code>mul_2exp</code> and <code>div_2exp</code> calls accumulate low zero
124 limbs until size==prec+1 is reached. Those zeros will slow down
125 subsequent operations, especially if the value is otherwise only small.
126 If low bits of the low limb are zero, use <code>mpn_rshift</code> so as
127 to not increase the size.
128<li> <code>mpn_dc_sqrtrem</code>, <code>mpn_sqrtrem2</code>: Don't use
129 <code>mpn_add_1</code> and <code>mpn_sub_1</code> for 1 limb operations,
130 instead <code>ADDC_LIMB</code> and <code>SUBC_LIMB</code>.
131<li> <code>mpn_sqrtrem2</code>: Use plain variables for <code>sp[0]</code> and
132 <code>rp[0]</code> calculations, so the compiler needn't worry about
133 aliasing between <code>sp</code> and <code>rp</code>.
134<li> <code>mpn_sqrtrem</code>: Some work can be saved in the last step when
135 the remainder is not required, as noted in Paul's paper.
136<li> <code>mpq_add</code>, <code>mpq_sub</code>: The gcd fits a single limb
137 with high probability and in this case <code>binvert_limb</code> could
138 be used to calculate the inverse just once for the two exact divisions
139 "op1.den / gcd" and "op2.den / gcd", rather than letting
140 <code>mpn_bdiv_q_1</code> do it each time. This would require calling
141 <code>mpn_pi1_bdiv_q_1</code>.
142<li> <code>mpn_gcdext</code>: Don't test <code>count_leading_zeros</code> for
143 zero, instead check the high bit of the operand and avoid invoking
144 <code>count_leading_zeros</code>. This is an optimization on all
145 machines, and significant on machines with slow
146 <code>count_leading_zeros</code>, though it's possible an already
147 normalized operand might not be encountered very often.
148<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
149 most significant limb (if <code>GMP_LIMB_BITS</code> &lt= 52 bits).
150 (Peter Montgomery has some ideas on this subject.)
151<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
152 products with fewer operations.
153<li> Consider inlining <code>mpz_set_ui</code>. This would be both small and
154 fast, especially for compile-time constants, but would make application
155 binaries depend on having 1 limb allocated to an <code>mpz_t</code>,
156 preventing the "lazy" allocation scheme below.
157<li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe
158 <code>mpz_[cft]div_r_ui</code>. A <code>__gmp_divide_by_zero</code>
159 would be needed for the divide by zero test, unless that could be left to
160 <code>mpn_mod_1</code> (not sure currently whether all the risc chips
161 provoke the right exception there if using mul-by-inverse).
162<li> Consider inlining: <code>mpz_fits_s*_p</code>. The setups for
163 <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it
164 might, unfortunately, be necessary to forcibly include &lt;limits.h&gt;
165 since there's no apparent way to get <code>SHRT_MAX</code> with an
166 expression (since <code>short</code> and <code>unsigned short</code> can
167 be different sizes).
168<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very fast on one
169 or two limb moduli, due to a lot of function call overheads. These could
170 perhaps be handled as special cases.
171<li> Make sure <code>mpz_powm_ui</code> is never slower than the corresponding
172 computation using <code>mpz_powm</code>.
173<li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code>
174 using the division method when they're small, since the REDC form of a
175 small multiplier is normally a full size product. Probably would need a
176 new tuned parameter to say what size multiplier is "small", as a function
177 of the size of the modulus.
178<li> <code>mpn_gcd</code> might be able to be sped up on small to moderate
179 sizes by improving <code>find_a</code>, possibly just by providing an
180 alternate implementation for CPUs with slowish
181 <code>count_leading_zeros</code>.
182<li> <code>mpf_set_str</code> produces low zero limbs when a string has a
183 fraction but is exactly representable, eg. 0.5 in decimal. These could be
184 stripped to save work in later operations.
185<li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should
186 use <code>mpn_and_n</code> etc for the benefit of the small number of
187 targets with native versions of those routines. Need to be careful not to
188 pass size==0. Is some code sharing possible between the <code>mpz</code>
189 routines?
190<li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands
191 unless it's really necessary (currently only sizes are tested, not
192 whether r really is u or v).
193<li> <code>mpf_add</code>: Under the check for v having no effect on the
194 result, perhaps test for r==u and do nothing in that case, rather than
195 currently it looks like an <code>MPN_COPY_INCR</code> will be done to
196 reduce prec+1 limbs to prec.
197<li> <code>mpf_div_ui</code>: Instead of padding with low zeros, call
198 <code>mpn_divrem_1</code> asking for fractional quotient limbs.
199<li> <code>mpf_div_ui</code>: Eliminate <code>TMP_ALLOC</code>. When r!=u
200 there's no overlap and the division can be called on those operands.
201 When r==u and is prec+1 limbs, then it's an in-place division. If r==u
202 and not prec+1 limbs, then move the available limbs up to prec+1 and do
203 an in-place there.
204<li> <code>mpf_div_ui</code>: Whether the high quotient limb is zero can be
205 determined by testing the dividend for high&lt;divisor. When non-zero, the
206 division can be done on prec dividend limbs instead of prec+1. The result
207 size is also known before the division, so that can be a tail call (once
208 the <code>TMP_ALLOC</code> is eliminated).
209<li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and
210 shift the dividend on-the-fly, since this should cost nothing on
211 superscalar processors and avoid the need for temporary copying in
212 <code>mpn_tdiv_qr</code>.
213<li> <code>mpf_sqrt</code>: If r!=u, and if u doesn't need to be padded with
214 zeros, then there's no need for the tp temporary.
215<li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and
216 <code>num2*den1</code> products limb-by-limb from high to low and look at
217 each step for values differing by more than the possible carry bit from
218 the uncalculated portion.
219<li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply
220 and compare. The benefits of karatsuba and higher multiplication
221 algorithms are lost, but if it's assumed only a few high limbs will be
222 needed to determine an order then that's fine.
223<li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>,
224 <code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc
225 instead of the functions, so they get inlined on all compilers, not just
226 gcc and others with <code>inline</code> recognised in gmp.h.
227 <code>__GMPN_ADD_1</code> etc are meant mostly to support application
228 inline <code>mpn_add_1</code> etc and if they don't come out good for
229 internal uses then special forms can be introduced, for instance many
230 internal uses are in-place. Sometimes a block of code is executed based
231 on the carry-out, rather than using it arithmetically, and those places
232 might want to do their own loops entirely.
233<li> <code>__gmp_extract_double</code> on 64-bit systems could use just one
234 bitfield for the mantissa extraction, not two, when endianness permits.
235 Might depend on the compiler allowing <code>long long</code> bit fields
236 when that's the only actual 64-bit type.
237<li> tal-notreent.c could keep a block of memory permanently allocated.
238 Currently the last nested <code>TMP_FREE</code> releases all memory, so
239 there's an allocate and free every time a top-level function using
240 <code>TMP</code> is called. Would need
241 <code>mp_set_memory_functions</code> to tell tal-notreent.c to release
242 any cached memory when changing allocation functions though.
243<li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially
244 inlined. If the current chunk has enough room then a couple of pointers
245 can be updated. Only if more space is required then a call to some sort
246 of <code>__gmp_tmp_increase</code> would be needed. The requirement that
247 <code>TMP_ALLOC</code> is an expression might make the implementation a
248 bit ugly and/or a bit sub-optimal.
249<pre>
250#define TMP_ALLOC(n)
251 ((ROUND_UP(n) &gt; current-&gt;end - current-&gt;point ?
252 __gmp_tmp_increase (ROUND_UP (n)) : 0),
253 current-&gt;point += ROUND_UP (n),
254 current-&gt;point - ROUND_UP (n))
255</pre>
256<li> <code>__mp_bases</code> has a lot of data for bases which are pretty much
257 never used. Perhaps the table should just go up to base 16, and have
258 code to generate data above that, if and when required. Naturally this
259 assumes the code would be smaller than the data saved.
260<li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used
261 if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted
262 otherwise, to save space.
263<li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which
264 are of course fast, it seems a little silly to make a second pass over
265 the <code>mpn_get_str</code> output to convert to ASCII. Perhaps combine
266 that with the bit extractions.
267<li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of
268 A), and A&lt;B, then the code ends up generating the cofactor T (of B) and
269 deriving S from that. Perhaps it'd be possible to arrange to get S in
270 the first place by calling <code>mpn_gcdext</code> with A+B,B. This
271 might only be an advantage if A and B are about the same size.
272<li> <code>mpz_n_pow_ui</code> does a good job with small bases and stripping
273 powers of 2, but it's perhaps a bit too complicated for what it gains.
274 The simpler <code>mpn_pow_1</code> is a little faster on small exponents.
275 (Note some of the ugliness in <code>mpz_n_pow_ui</code> is due to
276 supporting <code>mpn_mul_2</code>.)
277 <br>
278 Perhaps the stripping of 2s in <code>mpz_n_pow_ui</code> should be
279 confined to single limb operands for simplicity and since that's where
280 the greatest gain would be.
281 <br>
282 Ideally <code>mpn_pow_1</code> and <code>mpz_n_pow_ui</code> would be
283 merged. The reason <code>mpz_n_pow_ui</code> writes to an
284 <code>mpz_t</code> is that its callers leave it to make a good estimate
285 of the result size. Callers of <code>mpn_pow_1</code> already know the
286 size by separate means (<code>mp_bases</code>).
287<li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly.
288</ul>
289
290
291<h4>Machine Dependent Optimization</h4>
292<ul>
293<li> <code>invert_limb</code> on various processors might benefit from the
294 little Newton iteration done for alpha and ia64.
295<li> Alpha 21264: <code>mpn_addlsh1_n</code> could be implemented with
296 <code>mpn_addmul_1</code>, since that code at 3.5 is a touch faster than
297 a separate <code>lshift</code> and <code>add_n</code> at
298 1.75+2.125=3.875. Or very likely some specific <code>addlsh1_n</code>
299 code could beat both.
300<li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>,
301 <code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>.
302<li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
303 and <code>mpn_submul_1</code> for the 21164. This should use both integer
304 multiplies and floating-point multiplies. For the floating-point
305 operations, the single-limb multiplier should be split into three 21-bit
306 chunks, or perhaps even better in four 16-bit chunks. Probably possible
307 to reach 9 cycles/limb.
308<li> Alpha: GCC 3.4 will introduce <code>__builtin_ctzl</code>,
309 <code>__builtin_clzl</code> and <code>__builtin_popcountl</code> using
310 the corresponding CIX <code>ct</code> instructions, and
311 <code>__builtin_alpha_cmpbge</code>. These should give GCC more
312 information about scheduling etc than the <code>asm</code> blocks
313 currently used in longlong.h and gmp-impl.h.
314<li> Alpha Unicos: Apparently there's no <code>alloca</code> on this system,
315 making <code>configure</code> choose the slower
316 <code>malloc-reentrant</code> allocation method. Is there a better way?
317 Maybe variable-length arrays per notes below.
318<li> Alpha Unicos 21164, 21264: <code>.align</code> is not used since it pads
319 with garbage. Does the code get the intended slotting required for the
320 claimed speeds? <code>.align</code> at the start of a function would
321 presumably be safe no matter how it pads.
322<li> ARM V5: <code>count_leading_zeros</code> can use the <code>clz</code>
323 instruction. For GCC 3.4 and up, do this via <code>__builtin_clzl</code>
324 since then gcc knows it's "predicable".
325<li> Itanium: GCC 3.4 introduces <code>__builtin_popcount</code> which can be
326 used instead of an <code>asm</code> block. The builtin should give gcc
327 more opportunities for scheduling, bundling and predication.
328 <code>__builtin_ctz</code> similarly (it just uses popcount as per
329 current longlong.h).
330<li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
331 for s2 &lt; 2^32 (or perhaps for any zero 16-bit s2 chunk). Not sure how
332 much this can improve the speed, though, since the symmetry that we rely
333 on is lost. Perhaps we can just gain cycles when s2 &lt; 2^16, or more
334 accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
335<li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to
336 <code>mpn_addmul_1</code>.
337<li> UltraSPARC/64: Write <code>umul_ppmm</code>. Using four
338 "<code>mulx</code>"s either with an asm block or via the generic C code is
339 about 90 cycles. Try using fp operations, and also try using karatsuba
340 for just three "<code>mulx</code>"s.
341<li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>.
342 Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do.
343<li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 &lt; 2^16.
344<li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if
345 possible (see commented out code in longlong.h). This is unlikely to
346 save more than a couple of cycles, so perhaps isn't worth bothering with.
347<li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code>
348 or anything to indicate V9 support when -mcpu=v9 is selected. See
349 gcc/config/sol2-sld-64.h. Will need to pass something through from
350 ./configure to select the right code in longlong.h. (Currently nothing
351 is lost because <code>mulx</code> for multiplying is commented out.)
352<li> UltraSPARC/32: <code>mpn_divexact_1</code> and
353 <code>mpn_modexact_1c_odd</code> can use a 64-bit inverse and take
354 64-bits at a time from the dividend, as per the 32-bit divisor case in
355 mpn/sparc64/mode1o.c. This must be done in assembler, since the full
356 64-bit registers (<code>%gN</code>) are not available from C.
357<li> UltraSPARC/32: <code>mpn_divexact_by3c</code> can work 64-bits at a time
358 using <code>mulx</code>, in assembler. This would be the same as for
359 sparc64.
360<li> UltraSPARC: <code>binvert_limb</code> might save a few cycles from
361 masking down to just the useful bits at each point in the calculation,
362 since <code>mulx</code> speed depends on the highest bit set. Either
363 explicit masks or small types like <code>short</code> and
364 <code>int</code> ought to work.
365<li> Sparc64 HAL R1 <code>popc</code>: This chip reputedly implements
366 <code>popc</code> properly (see gcc sparc.md). Would need to recognise
367 it as <code>sparchalr1</code> or something in configure / config.sub /
368 config.guess. <code>popc_limb</code> in gmp-impl.h could use this (per
369 commented out code). <code>count_trailing_zeros</code> could use it too.
370<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
371 <code>mpn_mul_1</code>. The current code runs at 11 cycles/limb. It
372 should be possible to saturate the cache, which will happen at 8
373 cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 &lt; 2^32;
374 it should be possible to make them run at about 5 cycles/limb.
375<li> PPC601: See which of the power or powerpc32 code runs better. Currently
376 the powerpc32 is used, but only because it's the default for
377 <code>powerpc*</code>.
378<li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
379 <code>mpn_mul_1</code>. Use both integer and floating-point operations,
380 possibly two floating-point and one integer limb per loop. Split operands
381 into four 16-bit chunks for fast fp operations. Should easily reach 9
382 cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
383 (using one int + two fp).
384<li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop
385 as <code>mpn_lshift</code>. Some judicious use of m4 might let the two
386 share source code, or with a register to control the loop direction
387 perhaps even share object code.
388<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
389 for important machines. Helping the generic sqr_basecase.c with an
390 <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
391<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
392 Will bring time from 1.75 to 1.25 cycles/limb.
393<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1. (See
394 Pentium code.)
395<li> X86: Good authority has it that in the past an inline <code>rep
396 movs</code> would upset GCC register allocation for the whole function.
397 Is this still true in GCC 3? It uses <code>rep movs</code> itself for
398 <code>__builtin_memcpy</code>. Examine the code for some simple and
399 complex functions to find out. Inlining <code>rep movs</code> would be
400 desirable, it'd be both smaller and faster.
401<li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come
402 down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
403 <code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README.
404<li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code>
405 might benefit from some destination prefetching.
406<li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a
407 mul-by-inverse, hoping for maybe 30 c/l.
408<li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to
409 do something branch-free for unaligned startups, and shaving one insn
410 from the loop with alternative indexing might save a cycle.
411<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
412 The pipeline is now extremely deep, perhaps unnecessarily deep.
413<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
414<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
415 <code>mpn_sqr_basecase</code>. This should use a "vertical multiplication
416 method", to avoid carry propagation. splitting one of the operands in
417 11-bit chunks.
418<li> Pentium: <code>mpn_lshift</code> by 31 should use the special rshift
419 by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the
420 special lshift by 1. This would be best as a jump across to the other
421 routine, could let both live in lshift.asm and omit rshift.asm on finding
422 <code>mpn_rshift</code> already provided.
423<li> Cray T3E: Experiment with optimization options. In particular,
424 -hpipeline3 seems promising. We should at least up -O to -O2 or -O3.
425<li> Cray: <code>mpn_com</code> and <code>mpn_and_n</code> etc very probably
426 wants a pragma like <code>MPN_COPY_INCR</code>.
427<li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>,
428 <code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small
429 and could be inlined to avoid function calls.
430<li> Cray: Variable length arrays seem to be faster than the tal-notreent.c
431 scheme. Not sure why, maybe they merely give the compiler more
432 information about aliasing (or the lack thereof). Would like to modify
433 <code>TMP_ALLOC</code> to use them, or introduce a new scheme. Memory
434 blocks wanted unconditionally are easy enough, those wanted only
435 sometimes are a problem. Perhaps a special size calculation to ask for a
436 dummy length 1 when unwanted, or perhaps an inlined subroutine
437 duplicating code under each conditional. Don't really want to turn
438 everything into a dog's dinner just because Cray don't offer an
439 <code>alloca</code>.
440<li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize.
441 Does it? <code>bits_per_digit</code> and the inner loop over bits in a
442 limb might prevent it. Perhaps special cases for binary, octal and hex
443 would be worthwhile (very possibly for all processors too).
444<li> S390: <code>BSWAP_LIMB_FETCH</code> looks like it could be done with
445 <code>lrvg</code>, as per glibc sysdeps/s390/s390-64/bits/byteswap.h.
446 This is only for 64-bit mode or something is it, since 32-bit mode has
447 other code? Also, is it worth using for <code>BSWAP_LIMB</code> too, or
448 would that mean a store and re-fetch? Presumably that's what comes out
449 in glibc.
450<li> Improve <code>count_leading_zeros</code> for 64-bit machines:
451 <pre>
452 if ((x &gt&gt 32) == 0) { x &lt&lt= 32; cnt += 32; }
453 if ((x &gt&gt 48) == 0) { x &lt&lt= 16; cnt += 16; }
454 ... </pre>
455<li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps
456 be used in <code>__GMP_EXTERN_INLINE</code>. What would be the right way
457 to identify suitable versions of that compiler?
458<li> IRIX <code>cc</code> is rumoured to have an <code>_int_mult_upper</code>
459 (in <code>&lt;intrinsics.h&gt;</code> like Cray), but it didn't seem to
460 exist on some IRIX 6.5 systems tried. If it does actually exist
461 somewhere it would very likely be an improvement over a function call to
462 umul.asm.
463<li> <code>mpn_get_str</code> final divisions by the base with
464 <code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse
465 on suitable machines. This ends up happening for decimal by presenting
466 the compiler with a run-time constant, but the same for other bases would
467 be good. Perhaps use could be made of the fact base&lt;256.
468<li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a
469 structure like <code>div_t</code> to avoid going through memory, in
470 particular helping RISCs that don't do store-to-load forwarding. Clearly
471 this is only possible if the ABI returns a structure of two
472 <code>mp_limb_t</code>s in registers.
473 <br>
474 On PowerPC, structures are returned in memory on AIX and Darwin. In SVR4
475 they're returned in registers, except that draft SVR4 had said memory, so
476 it'd be prudent to check which is done. We can jam the compiler into the
477 right mode if we know how, since all this is purely internal to libgmp.
478 (gcc has an option, though of course gcc doesn't matter since we use
479 inline asm there.)
480</ul>
481
482<h4>New Functionality</h4>
483<ul>
484<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
485<li> Let `0b' and `0B' mean binary input everywhere.
486<li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation.
487 Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let
488 <code>_mpz_realloc</code> do the initial alloc. Set
489 <code>z-&gt;_mp_d</code> to a dummy that <code>mpz_get_ui</code> and
490 similar can unconditionally fetch from. Niels Möller has had a go at
491 this.
492 <br>
493 The advantages of the lazy scheme would be:
494 <ul>
495 <li> Initial allocate would be the size required for the first value
496 stored, rather than getting 1 limb in <code>mpz_init</code> and then
497 more or less immediately reallocating.
498 <li> <code>mpz_init</code> would only store magic values in the
499 <code>mpz_t</code> fields, and could be inlined.
500 <li> A fixed initializer could even be used by applications, like
501 <code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient
502 for globals.
503 </ul>
504 The advantages of the current scheme are:
505 <ul>
506 <li> <code>mpz_set_ui</code> and other similar routines needn't check the
507 size allocated and can just store unconditionally.
508 <li> <code>mpz_set_ui</code> and perhaps others like
509 <code>mpz_tdiv_r_ui</code> and a prospective
510 <code>mpz_set_ull</code> could be inlined.
511 </ul>
512<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>. Make sure
513 format is portable between 32-bit and 64-bit machines, and between
514 little-endian and big-endian machines. A format which MPFR can use too
515 would be good.
516<li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn
517 logops and copys available in gmp.h, either as library functions or
518 inlines, with the availability of library functions instantiated in the
519 generated gmp.h at build time.
520<li> <code>mpz_set_str</code> etc variants taking string lengths rather than
521 null-terminators.
522<li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>,
523 <code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions,
524 if they could share code with the current such functions (which should be
525 possible).
526<li> <code>mpz_and_ui</code> etc might be of use sometimes. Suggested by
527 Niels Möller.
528<li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully
529 accept 0x, 0b etc when base==0. Perhaps the exponent could default to
530 decimal in this case, with a further 0x, 0b etc allowed there.
531 Eg. 0xFFAA@0x5A. A leading "0" for octal would match the integers, but
532 probably something like "0.123" ought not mean octal.
533<li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented
534 feature of gmp.h, so applications could know whether to
535 <code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>.
536<li> <code>GMP_PRIdMP_LIMB</code> and similar defines following C99
537 &lt;inttypes.h&gt; might be of use to applications printing limbs. But
538 if <code>GMP_LONG_LONG_LIMB</code> or whatever is added then perhaps this
539 can easily enough be left to applications.
540<li> <code>gmp_printf</code> could accept <code>%b</code> for binary output.
541 It'd be nice if it worked for plain <code>int</code> etc too, not just
542 <code>mpz_t</code> etc.
543<li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base,
544 for both integer and float conversions. A base either in the format
545 string or as a parameter with <code>*</code> should be allowed. Maybe
546 <code>&amp;13b</code> (b for base) or something like that.
547<li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float
548 conversions, eg. <code>"%.4Qf"</code>. This would be merely for
549 convenience, but still might be useful. Rounding would be the same as
550 for an <code>mpf_t</code> (ie. currently round-to-nearest, but not
551 actually documented). Alternately, perhaps a separate
552 <code>mpq_get_str_point</code> or some such might be more use. Suggested
553 by Pedro Gimeno.
554<li> <code>mpz_rscan0</code> or <code>mpz_revscan0</code> or some such
555 searching towards the low end of an integer might match
556 <code>mpz_scan0</code> nicely. Likewise for <code>scan1</code>.
557 Suggested by Roberto Bagnara.
558<li> <code>mpz_bit_subset</code> or some such to test whether one integer is a
559 bitwise subset of another might be of use. Some sort of return value
560 indicating whether it's a proper or non-proper subset would be good and
561 wouldn't cost anything in the implementation. Suggested by Roberto
562 Bagnara.
563<li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between
564 <code>mpf_t</code> and <code>long double</code>, suggested by Dan
565 Christensen. Other <code>long double</code> routines might be desirable
566 too, but <code>mpf</code> would be a start.
567 <br>
568 <code>long double</code> is an ANSI-ism, so everything involving it would
569 need to be suppressed on a K&amp;R compiler.
570 <br>
571 There'd be some work to be done by <code>configure</code> to recognise
572 the format in use, MPFR has a start on this. Often <code>long
573 double</code> is the same as <code>double</code>, which is easy but
574 pretty pointless. A single float format detector macro could look at
575 <code>double</code> then <code>long double</code>
576 <br>
577 Sometimes there's a compiler option for the size of a <code>long
578 double</code>, eg. xlc on AIX can use either 64-bit or 128-bit. It's
579 probably simplest to regard this as a compiler compatibility issue, and
580 leave it to users or sysadmins to ensure application and library code is
581 built the same.
582<li> <code>mpz_sqrt_if_perfect_square</code>: When
583 <code>mpz_perfect_square_p</code> does its tests it calculates a square
584 root and then discards it. For some applications it might be useful to
585 return that root. Suggested by Jason Moxham.
586<li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>,
587 <code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for
588 <code>long long</code>. These would aid interoperability, though a
589 mixture of GMP and <code>long long</code> would probably not be too
590 common. Since <code>long long</code> is not always available (it's in
591 C99 and GCC though), disadvantages of using <code>long long</code> in
592 libgmp.a would be
593 <ul>
594 <li> Library contents vary according to the build compiler.
595 <li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the
596 application compiler could take the <code>long long</code>
597 prototypes.
598 <li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> might be wanted to
599 indicate whether the functions are available. (Applications using
600 autoconf could probe the library too.)
601 </ul>
602 It'd be possible to defer the need for <code>long long</code> to
603 application compile time, by having something like
604 <code>mpz_set_2ui</code> called with two halves of a <code>long
605 long</code>. Disadvantages of this would be,
606 <ul>
607 <li> Bigger code in the application, though perhaps not if a <code>long
608 long</code> is normally passed as two halves anyway.
609 <li> <code>mpz_get_ull</code> would be a rather big inline, or would have
610 to be two function calls.
611 <li> <code>mpz_get_sll</code> would be a worse inline, and would put the
612 treatment of <code>-0x10..00</code> into applications (see
613 <code>mpz_get_si</code> correctness above).
614 <li> Although having libgmp.a independent of the build compiler is nice,
615 it sort of sacrifices the capabilities of a good compiler to
616 uniformity with inferior ones.
617 </ul>
618 Plain use of <code>long long</code> is probably the lesser evil, if only
619 because it makes best use of gcc. In fact perhaps it would suffice to
620 guarantee <code>long long</code> conversions only when using GCC for both
621 application and library. That would cover free software, and we can
622 worry about selected vendor compilers later.
623 <br>
624 In C++ the situation is probably clearer, we demand fairly recent C++ so
625 <code>long long</code> should be available always. We'd probably prefer
626 to have the C and C++ the same in respect of <code>long long</code>
627 support, but it would be possible to have it unconditionally in gmpxx.h,
628 by some means or another.
629<li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>.
630 Suggested by Alexander Kruppa.
631</ul>
632
633
634<h4>Configuration</h4>
635
636<ul>
637<li> Alpha ev7, ev79: Add code to config.guess to detect these. Believe ev7
638 will be "3-1307" in the current switch, but need to verify that. (On
639 OSF, current configfsf.guess identifies ev7 using psrinfo, we need to do
640 it ourselves for other systems.)
641<li> Alpha OSF: Libtool (version 1.5) doesn't seem to recognise this system is
642 "pic always" and ends up running gcc twice with the same options. This
643 is wasteful, but harmless. Perhaps a newer libtool will be better.
644<li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>,
645 but is that available only for M series chips or some such? Perhaps it
646 should be configured in some way.
647<li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
648<li> HPPA: gcc 3.2 introduces a <code>-mschedule=7200</code> etc parameter,
649 which could be driven by an exact hppa cpu type.
650<li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc.
651 "hinv -c processor" gives lots of information on Irix. Standard
652 config.guess appends "el" to indicate endianness, but
653 <code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP.
654<li> PowerPC: The function descriptor nonsense for AIX is currently driven by
655 <code>*-*-aix*</code>. It might be more reliable to do some sort of
656 feature test, examining the compiler output perhaps. It might also be
657 nice to merge the aix.m4 files into powerpc-defs.m4.
658<li> config.m4 is generated only by the configure script, it won't be
659 regenerated by config.status. Creating it as an <code>AC_OUTPUT</code>
660 would work, but it might upset "make" to have things like <code>L$</code>
661 get into the Makefiles through <code>AC_SUBST</code>.
662 <code>AC_CONFIG_COMMANDS</code> would be the alternative. With some
663 careful m4 quoting the <code>changequote</code> calls might not be
664 needed, which might free up the order in which things had to be output.
665<li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code>
666 scheme. Though we probably wouldn't be using its assembler support we
667 could try to use those variables in compatible ways.
668<li> <code>GMP_LDFLAGS</code> could probably be done with plain
669 <code>LDFLAGS</code> already used by automake for all linking. But with
670 a bit of luck the next libtool will pass pretty much all
671 <code>CFLAGS</code> through to the compiler when linking, making
672 <code>GMP_LDFLAGS</code> unnecessary.
673<li> mpn/Makeasm.am uses <code>-c</code> and <code>-o</code> together in the
674 .S and .asm rules, but apparently that isn't completely portable (there's
675 an autoconf <code>AC_PROG_CC_C_O</code> test for it). So far we've not
676 had problems, but perhaps the rules could be rewritten to use "foo.s" as
677 the temporary, or to do a suitable "mv" of the result. The only danger
678 from using foo.s would be if a compile failed and the temporary foo.s
679 then looked like the primary source. Hopefully if the
680 <code>SUFFIXES</code> are ordered to have .S and .asm ahead of .s that
681 wouldn't happen. Might need to check.
682</ul>
683
684
685<h4>Random Numbers</h4>
686<ul>
687<li> <code>_gmp_rand</code> is not particularly fast on the linear
688 congruential algorithm and could stand various improvements.
689 <ul>
690 <li> Make a second seed area within <code>gmp_randstate_t</code> (or
691 <code>_mp_algdata</code> rather) to save some copying.
692 <li> Make a special case for a single limb <code>2exp</code> modulus, to
693 avoid <code>mpn_mul</code> calls. Perhaps the same for two limbs.
694 <li> Inline the <code>lc</code> code, to avoid a function call and
695 <code>TMP_ALLOC</code> for every chunk.
696 <li> Perhaps the <code>2exp</code> and general LC cases should be split,
697 for clarity (if the general case is retained).
698 </ul>
699<li> <code>gmp_randstate_t</code> used for parameters perhaps should become
700 <code>gmp_randstate_ptr</code> the same as other types.
701<li> Some of the empirical randomness tests could be included in a "make
702 check". They ought to work everywhere, for a given seed at least.
703</ul>
704
705
706<h4>C++</h4>
707<ul>
708<li> <code>mpz_class(string)</code>, etc: Use the C++ global locale to
709 identify whitespace.
710 <br>
711 <code>mpf_class(string)</code>: Use the C++ global locale decimal point,
712 rather than the C one.
713 <br>
714 Consider making these variant <code>mpz_set_str</code> etc forms
715 available for <code>mpz_t</code> too, not just <code>mpz_class</code>
716 etc.
717<li> <code>mpq_class operator+=</code>: Don't emit an unnecessary
718 <code>mpq_set(q,q)</code> before <code>mpz_addmul</code> etc.
719<li> Put various bits of gmpxx.h into libgmpxx, to avoid excessive inlining.
720 Candidates for this would be,
721 <ul>
722 <li> <code>mpz_class(const char *)</code>, etc: since they're normally
723 not fast anyway, and we can hide the exception <code>throw</code>.
724 <li> <code>mpz_class(string)</code>, etc: to hide the <code>cstr</code>
725 needed to get to the C conversion function.
726 <li> <code>mpz_class string, char*</code> etc constructors: likewise to
727 hide the throws and conversions.
728 <li> <code>mpz_class::get_str</code>, etc: to hide the <code>char*</code>
729 to <code>string</code> conversion and free. Perhaps
730 <code>mpz_get_str</code> can write directly into a
731 <code>string</code>, to avoid copying.
732 <br>
733 Consider making such <code>string</code> returning variants
734 available for use with plain <code>mpz_t</code> etc too.
735 </ul>
736</ul>
737
738<h4>Miscellaneous</h4>
739<ul>
740<li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document
741 what range of values the generated cofactors can take, and preferably
742 ensure the definition uniquely specifies the cofactors for given inputs.
743 A basic extended Euclidean algorithm or multi-step variant leads to
744 |x|&lt;|b| and |y|&lt;|a| or something like that, but there's probably
745 two solutions under just those restrictions.
746<li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than
747 <code>mpz_tdiv_qr_ui</code>. (Of course dividing multiple primes at a
748 time would be better still.)
749<li> The various test programs use quite a bit of the main
750 <code>libgmp</code>. This establishes good cross-checks, but it might be
751 better to use simple reference routines where possible. Where it's not
752 possible some attention could be paid to the order of the tests, so a
753 <code>libgmp</code> routine is only used for tests once it seems to be
754 good.
755<li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a
756 return to a previous k at certain sizes. This arises basically due to
757 the step effect caused by size multiples effectively used for each k.
758 Looking at a graph makes it fairly clear.
759<li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest
760 on the string returned by <code>mpf_get_str</code>. Perhaps some variant
761 of <code>mpf_get_str</code> could be made which would better suit.
762</ul>
763
764
765<h4>Aids to Development</h4>
766<ul>
767<li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf
768 function to check the validity of each <code>mp?_t</code> parameter, in
769 particular to check they've been <code>mp?_init</code>ed. This might
770 catch elementary mistakes in user programs. Care would need to be taken
771 over <code>MPZ_TMP_INIT</code>ed variables used internally. If nothing
772 else then consistency checks like size&lt;=alloc, ptr not
773 <code>NULL</code> and ptr+size not wrapping around the address space,
774 would be possible. A more sophisticated scheme could track
775 <code>_mp_d</code> pointers and ensure only a valid one is used. Such a
776 scheme probably wouldn't be reentrant, not without some help from the
777 system.
778<li> tune/time.c could try to determine at runtime whether
779 <code>getrusage</code> and <code>gettimeofday</code> are reliable.
780 Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
781 <code>getrusage</code> doesn't exist. If a test might take a long time
782 to run then perhaps cache the result in a file somewhere.
783<li> tune/time.c could choose the default precision based on the
784 <code>speed_unittime</code> determined, independent of the method in use.
785<li> Cray vector systems: CPU frequency could be determined from
786 <code>sysconf(_SC_CLK_TCK)</code>, since it seems to be clock cycle
787 based. Is this true for all Cray systems? Would like some documentation
788 or something to confirm.
789</ul>
790
791
792<h4>Documentation</h4>
793<ul>
794<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
795<li> <code>mpn_get_str</code> isn't terribly clear about how many digits it
796 produces. It'd probably be possible to say at most one leading zero,
797 which is what both it and <code>mpz_get_str</code> currently do. But
798 want to be careful not to bind ourselves to something that might not suit
799 another implementation.
800<li> <code>va_arg</code> doesn't do the right thing with <code>mpz_t</code>
801 etc directly, but instead needs a pointer type like <code>MP_INT*</code>.
802 It'd be good to show how to do this, but we'd either need to document
803 <code>mpz_ptr</code> and friends, or perhaps fallback on something
804 slightly nasty with <code>void*</code>.
805</ul>
806
807
808<h4>Bright Ideas</h4>
809
810<p> The following may or may not be feasible, and aren't likely to get done in the
811near future, but are at least worth thinking about.
812
813<ul>
814<li> Reorganize longlong.h so that we can inline the operations even for the
815 system compiler. When there is no such compiler feature, make calls to
816 stub functions. Write such stub functions for as many machines as
817 possible.
818<li> longlong.h could declare when it's using, or would like to use,
819 <code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be
820 included in libgmp only in that case, the same as is effectively done for
821 <code>__clz_tab</code>. Likewise udiv.asm and perhaps cntlz.asm. This
822 would only be a very small space saving, so perhaps not worth the
823 complexity.
824<li> longlong.h could be built at configure time by concatenating or
825 #including fragments from each directory in the mpn path. This would
826 select CPU specific macros the same way as CPU specific assembler code.
827 Code used would no longer depend on cpp predefines, and the current
828 nested conditionals could be flattened out.
829<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's
830 sort of supposed to return the low 31 (or 63) bits. But this is
831 undocumented, and perhaps not too important.
832<li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate
833 say an extra 16 limbs over what's needed, so as to reduce the chance of
834 having to do a reallocate if the <code>mpz_t</code> grows a bit more.
835 This could only be an option, since it'd badly bloat memory usage in
836 applications using many small values.
837<li> <code>mpq</code> functions could perhaps check for numerator or
838 denominator equal to 1, on the assumption that integers or
839 denominator-only values might be expected to occur reasonably often.
840<li> <code>count_trailing_zeros</code> is used on more or less uniformly
841 distributed numbers in a couple of places. For some CPUs
842 <code>count_trailing_zeros</code> is slow and it's probably worth handling
843 the frequently occurring 0 to 2 trailing zeros cases specially.
844<li> <code>mpf_t</code> might like to let the exponent be undefined when
845 size==0, instead of requiring it 0 as now. It should be possible to do
846 size==0 tests before paying attention to the exponent. The advantage is
847 not needing to set exp in the various places a zero result can arise,
848 which avoids some tedium but is otherwise perhaps not too important.
849 Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on
850 exp==0, maybe elsewhere too.
851<li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__
852 ((malloc))</code> on this, though don't know if it'd do much. GCC 3.0
853 allows that attribute on functions, but not function pointers (see info
854 node "Attribute Syntax"), so would need a new autoconf test. This can
855 wait until there's a GCC that supports it.
856<li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from
857 <code>mpn_add_1</code> and one from <code>mpn_sub_1</code>. If those two
858 routines were opened up a bit maybe that code could be shared. When a
859 copy needs to be done there's no carry to append for the add, and if the
860 copy is non-empty no high zero for the sub.
861</ul>
862
863
864<h4>Old and Obsolete Stuff</h4>
865
866<p> The following tasks apply to chips or systems that are old and/or obsolete.
867It's unlikely anything will be done about them unless anyone is actively using
868them.
869
870<ul>
871<li> Sparc32: The integer based udiv_nfp.asm used to be selected by
872 <code>configure --nfp</code> but that option is gone now that autoconf is
873 used. The file could go somewhere suitable in the mpn search if any
874 chips might benefit from it, though it's possible we don't currently
875 differentiate enough exact cpu types to do this properly.
876<li> VAX D and G format <code>double</code> floats are straightforward and
877 could perhaps be handled directly in <code>__gmp_extract_double</code>
878 and maybe in <code>mpn_get_d</code>, rather than falling back on the
879 generic code. (Both formats are detected by <code>configure</code>.)
880</ul>
881
882
883<hr>
884
885</body>
886</html>
887
888<!--
889Local variables:
890eval: (add-hook 'write-file-hooks 'time-stamp)
891time-stamp-start: "This file current as of "
892time-stamp-format: "%:d %3b %:y"
893time-stamp-end: "\\."
894time-stamp-line-limit: 50
895End:
896-->