Austin Schuh | dace2a6 | 2020-08-18 10:56:48 -0700 | [diff] [blame^] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
| 2 | <html> |
| 3 | <head> |
| 4 | <title>GMP Itemized Development Tasks</title> |
| 5 | <link rel="shortcut icon" href="favicon.ico"> |
| 6 | <link rel="stylesheet" href="gmp.css"> |
| 7 | <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| 8 | </head> |
| 9 | |
| 10 | <center> |
| 11 | <h1> |
| 12 | GMP Itemized Development Tasks |
| 13 | </h1> |
| 14 | </center> |
| 15 | |
| 16 | <font size=-1> |
| 17 | <pre> |
| 18 | Copyright 2000-2004, 2006, 2008, 2009 Free Software Foundation, Inc. |
| 19 | |
| 20 | This file is part of the GNU MP Library. |
| 21 | |
| 22 | The GNU MP Library is free software; you can redistribute it and/or modify |
| 23 | it under the terms of either: |
| 24 | |
| 25 | * the GNU Lesser General Public License as published by the Free |
| 26 | Software Foundation; either version 3 of the License, or (at your |
| 27 | option) any later version. |
| 28 | |
| 29 | or |
| 30 | |
| 31 | * the GNU General Public License as published by the Free Software |
| 32 | Foundation; either version 2 of the License, or (at your option) any |
| 33 | later version. |
| 34 | |
| 35 | or both in parallel, as here. |
| 36 | |
| 37 | The GNU MP Library is distributed in the hope that it will be useful, but |
| 38 | WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY |
| 39 | or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License |
| 40 | for more details. |
| 41 | |
| 42 | You should have received copies of the GNU General Public License and the |
| 43 | GNU Lesser General Public License along with the GNU MP Library. If not, |
| 44 | see https://www.gnu.org/licenses/. |
| 45 | </pre> |
| 46 | </font> |
| 47 | |
| 48 | <hr> |
| 49 | <!-- NB. timestamp updated automatically by emacs --> |
| 50 | This file current as of 29 Jan 2014. An up-to-date version is available at |
| 51 | <a href="https://gmplib.org/tasks.html">https://gmplib.org/tasks.html</a>. |
| 52 | Please send comments about this page to gmp-devel<font>@</font>gmplib.org. |
| 53 | |
| 54 | <p> These are itemized GMP development tasks. Not all the tasks |
| 55 | listed here are suitable for volunteers, but many of them are. |
| 56 | Please see the <a href="projects.html">projects file</a> for more |
| 57 | sizeable projects. |
| 58 | |
| 59 | <p> CAUTION: This file needs updating. Many of the tasks here have |
| 60 | either already been taken care of, or have become irrelevant. |
| 61 | |
| 62 | <h4>Correctness and Completeness</h4> |
| 63 | <ul> |
| 64 | <li> <code>_LONG_LONG_LIMB</code> in gmp.h is not namespace clean. Reported |
| 65 | by Patrick Pelissier. |
| 66 | <br> |
| 67 | We sort of mentioned <code>_LONG_LONG_LIMB</code> in past releases, so |
| 68 | need to be careful about changing it. It used to be a define |
| 69 | applications had to set for long long limb systems, but that in |
| 70 | particular is no longer relevant now that it's established automatically. |
| 71 | <li> The various reuse.c tests need to force reallocation by calling |
| 72 | <code>_mpz_realloc</code> with a small (1 limb) size. |
| 73 | <li> One reuse case is missing from mpX/tests/reuse.c: |
| 74 | <code>mpz_XXX(a,a,a)</code>. |
| 75 | <li> Make the string reading functions allow the `0x' prefix when the base is |
| 76 | explicitly 16. They currently only allow that prefix when the base is |
| 77 | unspecified (zero). |
| 78 | <li> <code>mpf_eq</code> is not always correct, when one operand is |
| 79 | 1000000000... and the other operand is 0111111111..., i.e., extremely |
| 80 | close. There is a special case in <code>mpf_sub</code> for this |
| 81 | situation; put similar code in <code>mpf_eq</code>. [In progress.] |
| 82 | <li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies. It should |
| 83 | not use just whole limbs, but partial limbs. [In progress.] |
| 84 | <li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance |
| 85 | garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow |
| 86 | of a <code>long</code> is not detected. |
| 87 | <li> <code>mpf_add</code> doesn't check for a carry from truncated portions of |
| 88 | the inputs, and in that respect doesn't implement the "infinite precision |
| 89 | followed by truncate" specified in the manual. |
| 90 | <li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global |
| 91 | variables with pointers to <code>mpz_add</code> etc, which doesn't work |
| 92 | when those routines are coming from a DLL (because they're effectively |
| 93 | function pointer global variables themselves). Need to rearrange perhaps |
| 94 | to a set of calls to a test function rather than iterating over an array. |
| 95 | <li> <code>mpz_pow_ui</code>: Detect when the result would be more memory than |
| 96 | a <code>size_t</code> can represent and raise some suitable exception, |
| 97 | probably an alloc call asking for <code>SIZE_T_MAX</code>, and if that |
| 98 | somehow succeeds then an <code>abort</code>. Various size overflows of |
| 99 | this kind are not handled gracefully, probably resulting in segvs. |
| 100 | <br> |
| 101 | In <code>mpz_n_pow_ui</code>, detect when the count of low zero bits |
| 102 | exceeds an <code>unsigned long</code>. There's a (small) chance of this |
| 103 | happening but still having enough memory to represent the value. |
| 104 | Reported by Winfried Dreckmann in for instance <code>mpz_ui_pow_ui (x, |
| 105 | 4UL, 1431655766UL)</code>. |
| 106 | <li> <code>mpf</code>: Detect exponent overflow and raise some exception. |
| 107 | It'd be nice to allow the full <code>mp_exp_t</code> range since that's |
| 108 | how it's been in the past, but maybe dropping one bit would make it |
| 109 | easier to test if e1+e2 goes out of bounds. |
| 110 | </ul> |
| 111 | |
| 112 | |
| 113 | |
| 114 | <h4>Machine Independent Optimization</h4> |
| 115 | <ul> |
| 116 | <li> <code>mpf_cmp</code>: For better cache locality, don't test for low zero |
| 117 | limbs until the high limbs fail to give an ordering. Reduce code size by |
| 118 | turning the three <code>mpn_cmp</code>'s into a single loop stopping when |
| 119 | the end of one operand is reached (and then looking for a non-zero in the |
| 120 | rest of the other). |
| 121 | <li> <code>mpf_mul_2exp</code>, <code>mpf_div_2exp</code>: The use of |
| 122 | <code>mpn_lshift</code> for any size<=prec means repeated |
| 123 | <code>mul_2exp</code> and <code>div_2exp</code> calls accumulate low zero |
| 124 | limbs until size==prec+1 is reached. Those zeros will slow down |
| 125 | subsequent operations, especially if the value is otherwise only small. |
| 126 | If low bits of the low limb are zero, use <code>mpn_rshift</code> so as |
| 127 | to not increase the size. |
| 128 | <li> <code>mpn_dc_sqrtrem</code>, <code>mpn_sqrtrem2</code>: Don't use |
| 129 | <code>mpn_add_1</code> and <code>mpn_sub_1</code> for 1 limb operations, |
| 130 | instead <code>ADDC_LIMB</code> and <code>SUBC_LIMB</code>. |
| 131 | <li> <code>mpn_sqrtrem2</code>: Use plain variables for <code>sp[0]</code> and |
| 132 | <code>rp[0]</code> calculations, so the compiler needn't worry about |
| 133 | aliasing between <code>sp</code> and <code>rp</code>. |
| 134 | <li> <code>mpn_sqrtrem</code>: Some work can be saved in the last step when |
| 135 | the remainder is not required, as noted in Paul's paper. |
| 136 | <li> <code>mpq_add</code>, <code>mpq_sub</code>: The gcd fits a single limb |
| 137 | with high probability and in this case <code>binvert_limb</code> could |
| 138 | be used to calculate the inverse just once for the two exact divisions |
| 139 | "op1.den / gcd" and "op2.den / gcd", rather than letting |
| 140 | <code>mpn_bdiv_q_1</code> do it each time. This would require calling |
| 141 | <code>mpn_pi1_bdiv_q_1</code>. |
| 142 | <li> <code>mpn_gcdext</code>: Don't test <code>count_leading_zeros</code> for |
| 143 | zero, instead check the high bit of the operand and avoid invoking |
| 144 | <code>count_leading_zeros</code>. This is an optimization on all |
| 145 | machines, and significant on machines with slow |
| 146 | <code>count_leading_zeros</code>, though it's possible an already |
| 147 | normalized operand might not be encountered very often. |
| 148 | <li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the |
| 149 | most significant limb (if <code>GMP_LIMB_BITS</code> <= 52 bits). |
| 150 | (Peter Montgomery has some ideas on this subject.) |
| 151 | <li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial |
| 152 | products with fewer operations. |
| 153 | <li> Consider inlining <code>mpz_set_ui</code>. This would be both small and |
| 154 | fast, especially for compile-time constants, but would make application |
| 155 | binaries depend on having 1 limb allocated to an <code>mpz_t</code>, |
| 156 | preventing the "lazy" allocation scheme below. |
| 157 | <li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe |
| 158 | <code>mpz_[cft]div_r_ui</code>. A <code>__gmp_divide_by_zero</code> |
| 159 | would be needed for the divide by zero test, unless that could be left to |
| 160 | <code>mpn_mod_1</code> (not sure currently whether all the risc chips |
| 161 | provoke the right exception there if using mul-by-inverse). |
| 162 | <li> Consider inlining: <code>mpz_fits_s*_p</code>. The setups for |
| 163 | <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it |
| 164 | might, unfortunately, be necessary to forcibly include <limits.h> |
| 165 | since there's no apparent way to get <code>SHRT_MAX</code> with an |
| 166 | expression (since <code>short</code> and <code>unsigned short</code> can |
| 167 | be different sizes). |
| 168 | <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very fast on one |
| 169 | or two limb moduli, due to a lot of function call overheads. These could |
| 170 | perhaps be handled as special cases. |
| 171 | <li> Make sure <code>mpz_powm_ui</code> is never slower than the corresponding |
| 172 | computation using <code>mpz_powm</code>. |
| 173 | <li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code> |
| 174 | using the division method when they're small, since the REDC form of a |
| 175 | small multiplier is normally a full size product. Probably would need a |
| 176 | new tuned parameter to say what size multiplier is "small", as a function |
| 177 | of the size of the modulus. |
| 178 | <li> <code>mpn_gcd</code> might be able to be sped up on small to moderate |
| 179 | sizes by improving <code>find_a</code>, possibly just by providing an |
| 180 | alternate implementation for CPUs with slowish |
| 181 | <code>count_leading_zeros</code>. |
| 182 | <li> <code>mpf_set_str</code> produces low zero limbs when a string has a |
| 183 | fraction but is exactly representable, eg. 0.5 in decimal. These could be |
| 184 | stripped to save work in later operations. |
| 185 | <li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should |
| 186 | use <code>mpn_and_n</code> etc for the benefit of the small number of |
| 187 | targets with native versions of those routines. Need to be careful not to |
| 188 | pass size==0. Is some code sharing possible between the <code>mpz</code> |
| 189 | routines? |
| 190 | <li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands |
| 191 | unless it's really necessary (currently only sizes are tested, not |
| 192 | whether r really is u or v). |
| 193 | <li> <code>mpf_add</code>: Under the check for v having no effect on the |
| 194 | result, perhaps test for r==u and do nothing in that case, rather than |
| 195 | currently it looks like an <code>MPN_COPY_INCR</code> will be done to |
| 196 | reduce prec+1 limbs to prec. |
| 197 | <li> <code>mpf_div_ui</code>: Instead of padding with low zeros, call |
| 198 | <code>mpn_divrem_1</code> asking for fractional quotient limbs. |
| 199 | <li> <code>mpf_div_ui</code>: Eliminate <code>TMP_ALLOC</code>. When r!=u |
| 200 | there's no overlap and the division can be called on those operands. |
| 201 | When r==u and is prec+1 limbs, then it's an in-place division. If r==u |
| 202 | and not prec+1 limbs, then move the available limbs up to prec+1 and do |
| 203 | an in-place there. |
| 204 | <li> <code>mpf_div_ui</code>: Whether the high quotient limb is zero can be |
| 205 | determined by testing the dividend for high<divisor. When non-zero, the |
| 206 | division can be done on prec dividend limbs instead of prec+1. The result |
| 207 | size is also known before the division, so that can be a tail call (once |
| 208 | the <code>TMP_ALLOC</code> is eliminated). |
| 209 | <li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and |
| 210 | shift the dividend on-the-fly, since this should cost nothing on |
| 211 | superscalar processors and avoid the need for temporary copying in |
| 212 | <code>mpn_tdiv_qr</code>. |
| 213 | <li> <code>mpf_sqrt</code>: If r!=u, and if u doesn't need to be padded with |
| 214 | zeros, then there's no need for the tp temporary. |
| 215 | <li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and |
| 216 | <code>num2*den1</code> products limb-by-limb from high to low and look at |
| 217 | each step for values differing by more than the possible carry bit from |
| 218 | the uncalculated portion. |
| 219 | <li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply |
| 220 | and compare. The benefits of karatsuba and higher multiplication |
| 221 | algorithms are lost, but if it's assumed only a few high limbs will be |
| 222 | needed to determine an order then that's fine. |
| 223 | <li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>, |
| 224 | <code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc |
| 225 | instead of the functions, so they get inlined on all compilers, not just |
| 226 | gcc and others with <code>inline</code> recognised in gmp.h. |
| 227 | <code>__GMPN_ADD_1</code> etc are meant mostly to support application |
| 228 | inline <code>mpn_add_1</code> etc and if they don't come out good for |
| 229 | internal uses then special forms can be introduced, for instance many |
| 230 | internal uses are in-place. Sometimes a block of code is executed based |
| 231 | on the carry-out, rather than using it arithmetically, and those places |
| 232 | might want to do their own loops entirely. |
| 233 | <li> <code>__gmp_extract_double</code> on 64-bit systems could use just one |
| 234 | bitfield for the mantissa extraction, not two, when endianness permits. |
| 235 | Might depend on the compiler allowing <code>long long</code> bit fields |
| 236 | when that's the only actual 64-bit type. |
| 237 | <li> tal-notreent.c could keep a block of memory permanently allocated. |
| 238 | Currently the last nested <code>TMP_FREE</code> releases all memory, so |
| 239 | there's an allocate and free every time a top-level function using |
| 240 | <code>TMP</code> is called. Would need |
| 241 | <code>mp_set_memory_functions</code> to tell tal-notreent.c to release |
| 242 | any cached memory when changing allocation functions though. |
| 243 | <li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially |
| 244 | inlined. If the current chunk has enough room then a couple of pointers |
| 245 | can be updated. Only if more space is required then a call to some sort |
| 246 | of <code>__gmp_tmp_increase</code> would be needed. The requirement that |
| 247 | <code>TMP_ALLOC</code> is an expression might make the implementation a |
| 248 | bit ugly and/or a bit sub-optimal. |
| 249 | <pre> |
| 250 | #define TMP_ALLOC(n) |
| 251 | ((ROUND_UP(n) > current->end - current->point ? |
| 252 | __gmp_tmp_increase (ROUND_UP (n)) : 0), |
| 253 | current->point += ROUND_UP (n), |
| 254 | current->point - ROUND_UP (n)) |
| 255 | </pre> |
| 256 | <li> <code>__mp_bases</code> has a lot of data for bases which are pretty much |
| 257 | never used. Perhaps the table should just go up to base 16, and have |
| 258 | code to generate data above that, if and when required. Naturally this |
| 259 | assumes the code would be smaller than the data saved. |
| 260 | <li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used |
| 261 | if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted |
| 262 | otherwise, to save space. |
| 263 | <li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which |
| 264 | are of course fast, it seems a little silly to make a second pass over |
| 265 | the <code>mpn_get_str</code> output to convert to ASCII. Perhaps combine |
| 266 | that with the bit extractions. |
| 267 | <li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of |
| 268 | A), and A<B, then the code ends up generating the cofactor T (of B) and |
| 269 | deriving S from that. Perhaps it'd be possible to arrange to get S in |
| 270 | the first place by calling <code>mpn_gcdext</code> with A+B,B. This |
| 271 | might only be an advantage if A and B are about the same size. |
| 272 | <li> <code>mpz_n_pow_ui</code> does a good job with small bases and stripping |
| 273 | powers of 2, but it's perhaps a bit too complicated for what it gains. |
| 274 | The simpler <code>mpn_pow_1</code> is a little faster on small exponents. |
| 275 | (Note some of the ugliness in <code>mpz_n_pow_ui</code> is due to |
| 276 | supporting <code>mpn_mul_2</code>.) |
| 277 | <br> |
| 278 | Perhaps the stripping of 2s in <code>mpz_n_pow_ui</code> should be |
| 279 | confined to single limb operands for simplicity and since that's where |
| 280 | the greatest gain would be. |
| 281 | <br> |
| 282 | Ideally <code>mpn_pow_1</code> and <code>mpz_n_pow_ui</code> would be |
| 283 | merged. The reason <code>mpz_n_pow_ui</code> writes to an |
| 284 | <code>mpz_t</code> is that its callers leave it to make a good estimate |
| 285 | of the result size. Callers of <code>mpn_pow_1</code> already know the |
| 286 | size by separate means (<code>mp_bases</code>). |
| 287 | <li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly. |
| 288 | </ul> |
| 289 | |
| 290 | |
| 291 | <h4>Machine Dependent Optimization</h4> |
| 292 | <ul> |
| 293 | <li> <code>invert_limb</code> on various processors might benefit from the |
| 294 | little Newton iteration done for alpha and ia64. |
| 295 | <li> Alpha 21264: <code>mpn_addlsh1_n</code> could be implemented with |
| 296 | <code>mpn_addmul_1</code>, since that code at 3.5 is a touch faster than |
| 297 | a separate <code>lshift</code> and <code>add_n</code> at |
| 298 | 1.75+2.125=3.875. Or very likely some specific <code>addlsh1_n</code> |
| 299 | code could beat both. |
| 300 | <li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>, |
| 301 | <code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>. |
| 302 | <li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>, |
| 303 | and <code>mpn_submul_1</code> for the 21164. This should use both integer |
| 304 | multiplies and floating-point multiplies. For the floating-point |
| 305 | operations, the single-limb multiplier should be split into three 21-bit |
| 306 | chunks, or perhaps even better in four 16-bit chunks. Probably possible |
| 307 | to reach 9 cycles/limb. |
| 308 | <li> Alpha: GCC 3.4 will introduce <code>__builtin_ctzl</code>, |
| 309 | <code>__builtin_clzl</code> and <code>__builtin_popcountl</code> using |
| 310 | the corresponding CIX <code>ct</code> instructions, and |
| 311 | <code>__builtin_alpha_cmpbge</code>. These should give GCC more |
| 312 | information about scheduling etc than the <code>asm</code> blocks |
| 313 | currently used in longlong.h and gmp-impl.h. |
| 314 | <li> Alpha Unicos: Apparently there's no <code>alloca</code> on this system, |
| 315 | making <code>configure</code> choose the slower |
| 316 | <code>malloc-reentrant</code> allocation method. Is there a better way? |
| 317 | Maybe variable-length arrays per notes below. |
| 318 | <li> Alpha Unicos 21164, 21264: <code>.align</code> is not used since it pads |
| 319 | with garbage. Does the code get the intended slotting required for the |
| 320 | claimed speeds? <code>.align</code> at the start of a function would |
| 321 | presumably be safe no matter how it pads. |
| 322 | <li> ARM V5: <code>count_leading_zeros</code> can use the <code>clz</code> |
| 323 | instruction. For GCC 3.4 and up, do this via <code>__builtin_clzl</code> |
| 324 | since then gcc knows it's "predicable". |
| 325 | <li> Itanium: GCC 3.4 introduces <code>__builtin_popcount</code> which can be |
| 326 | used instead of an <code>asm</code> block. The builtin should give gcc |
| 327 | more opportunities for scheduling, bundling and predication. |
| 328 | <code>__builtin_ctz</code> similarly (it just uses popcount as per |
| 329 | current longlong.h). |
| 330 | <li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>, |
| 331 | for s2 < 2^32 (or perhaps for any zero 16-bit s2 chunk). Not sure how |
| 332 | much this can improve the speed, though, since the symmetry that we rely |
| 333 | on is lost. Perhaps we can just gain cycles when s2 < 2^16, or more |
| 334 | accurately, when two 16-bit s2 chunks which are 16 bits apart are zero. |
| 335 | <li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to |
| 336 | <code>mpn_addmul_1</code>. |
| 337 | <li> UltraSPARC/64: Write <code>umul_ppmm</code>. Using four |
| 338 | "<code>mulx</code>"s either with an asm block or via the generic C code is |
| 339 | about 90 cycles. Try using fp operations, and also try using karatsuba |
| 340 | for just three "<code>mulx</code>"s. |
| 341 | <li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>. |
| 342 | Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do. |
| 343 | <li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 < 2^16. |
| 344 | <li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if |
| 345 | possible (see commented out code in longlong.h). This is unlikely to |
| 346 | save more than a couple of cycles, so perhaps isn't worth bothering with. |
| 347 | <li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code> |
| 348 | or anything to indicate V9 support when -mcpu=v9 is selected. See |
| 349 | gcc/config/sol2-sld-64.h. Will need to pass something through from |
| 350 | ./configure to select the right code in longlong.h. (Currently nothing |
| 351 | is lost because <code>mulx</code> for multiplying is commented out.) |
| 352 | <li> UltraSPARC/32: <code>mpn_divexact_1</code> and |
| 353 | <code>mpn_modexact_1c_odd</code> can use a 64-bit inverse and take |
| 354 | 64-bits at a time from the dividend, as per the 32-bit divisor case in |
| 355 | mpn/sparc64/mode1o.c. This must be done in assembler, since the full |
| 356 | 64-bit registers (<code>%gN</code>) are not available from C. |
| 357 | <li> UltraSPARC/32: <code>mpn_divexact_by3c</code> can work 64-bits at a time |
| 358 | using <code>mulx</code>, in assembler. This would be the same as for |
| 359 | sparc64. |
| 360 | <li> UltraSPARC: <code>binvert_limb</code> might save a few cycles from |
| 361 | masking down to just the useful bits at each point in the calculation, |
| 362 | since <code>mulx</code> speed depends on the highest bit set. Either |
| 363 | explicit masks or small types like <code>short</code> and |
| 364 | <code>int</code> ought to work. |
| 365 | <li> Sparc64 HAL R1 <code>popc</code>: This chip reputedly implements |
| 366 | <code>popc</code> properly (see gcc sparc.md). Would need to recognise |
| 367 | it as <code>sparchalr1</code> or something in configure / config.sub / |
| 368 | config.guess. <code>popc_limb</code> in gmp-impl.h could use this (per |
| 369 | commented out code). <code>count_trailing_zeros</code> could use it too. |
| 370 | <li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and |
| 371 | <code>mpn_mul_1</code>. The current code runs at 11 cycles/limb. It |
| 372 | should be possible to saturate the cache, which will happen at 8 |
| 373 | cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 < 2^32; |
| 374 | it should be possible to make them run at about 5 cycles/limb. |
| 375 | <li> PPC601: See which of the power or powerpc32 code runs better. Currently |
| 376 | the powerpc32 is used, but only because it's the default for |
| 377 | <code>powerpc*</code>. |
| 378 | <li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and |
| 379 | <code>mpn_mul_1</code>. Use both integer and floating-point operations, |
| 380 | possibly two floating-point and one integer limb per loop. Split operands |
| 381 | into four 16-bit chunks for fast fp operations. Should easily reach 9 |
| 382 | cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb |
| 383 | (using one int + two fp). |
| 384 | <li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop |
| 385 | as <code>mpn_lshift</code>. Some judicious use of m4 might let the two |
| 386 | share source code, or with a register to control the loop direction |
| 387 | perhaps even share object code. |
| 388 | <li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code> |
| 389 | for important machines. Helping the generic sqr_basecase.c with an |
| 390 | <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs. |
| 391 | <li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>. |
| 392 | Will bring time from 1.75 to 1.25 cycles/limb. |
| 393 | <li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1. (See |
| 394 | Pentium code.) |
| 395 | <li> X86: Good authority has it that in the past an inline <code>rep |
| 396 | movs</code> would upset GCC register allocation for the whole function. |
| 397 | Is this still true in GCC 3? It uses <code>rep movs</code> itself for |
| 398 | <code>__builtin_memcpy</code>. Examine the code for some simple and |
| 399 | complex functions to find out. Inlining <code>rep movs</code> would be |
| 400 | desirable, it'd be both smaller and faster. |
| 401 | <li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come |
| 402 | down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after |
| 403 | <code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README. |
| 404 | <li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code> |
| 405 | might benefit from some destination prefetching. |
| 406 | <li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a |
| 407 | mul-by-inverse, hoping for maybe 30 c/l. |
| 408 | <li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to |
| 409 | do something branch-free for unaligned startups, and shaving one insn |
| 410 | from the loop with alternative indexing might save a cycle. |
| 411 | <li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>. |
| 412 | The pipeline is now extremely deep, perhaps unnecessarily deep. |
| 413 | <li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language. |
| 414 | <li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and |
| 415 | <code>mpn_sqr_basecase</code>. This should use a "vertical multiplication |
| 416 | method", to avoid carry propagation. splitting one of the operands in |
| 417 | 11-bit chunks. |
| 418 | <li> Pentium: <code>mpn_lshift</code> by 31 should use the special rshift |
| 419 | by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the |
| 420 | special lshift by 1. This would be best as a jump across to the other |
| 421 | routine, could let both live in lshift.asm and omit rshift.asm on finding |
| 422 | <code>mpn_rshift</code> already provided. |
| 423 | <li> Cray T3E: Experiment with optimization options. In particular, |
| 424 | -hpipeline3 seems promising. We should at least up -O to -O2 or -O3. |
| 425 | <li> Cray: <code>mpn_com</code> and <code>mpn_and_n</code> etc very probably |
| 426 | wants a pragma like <code>MPN_COPY_INCR</code>. |
| 427 | <li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>, |
| 428 | <code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small |
| 429 | and could be inlined to avoid function calls. |
| 430 | <li> Cray: Variable length arrays seem to be faster than the tal-notreent.c |
| 431 | scheme. Not sure why, maybe they merely give the compiler more |
| 432 | information about aliasing (or the lack thereof). Would like to modify |
| 433 | <code>TMP_ALLOC</code> to use them, or introduce a new scheme. Memory |
| 434 | blocks wanted unconditionally are easy enough, those wanted only |
| 435 | sometimes are a problem. Perhaps a special size calculation to ask for a |
| 436 | dummy length 1 when unwanted, or perhaps an inlined subroutine |
| 437 | duplicating code under each conditional. Don't really want to turn |
| 438 | everything into a dog's dinner just because Cray don't offer an |
| 439 | <code>alloca</code>. |
| 440 | <li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize. |
| 441 | Does it? <code>bits_per_digit</code> and the inner loop over bits in a |
| 442 | limb might prevent it. Perhaps special cases for binary, octal and hex |
| 443 | would be worthwhile (very possibly for all processors too). |
| 444 | <li> S390: <code>BSWAP_LIMB_FETCH</code> looks like it could be done with |
| 445 | <code>lrvg</code>, as per glibc sysdeps/s390/s390-64/bits/byteswap.h. |
| 446 | This is only for 64-bit mode or something is it, since 32-bit mode has |
| 447 | other code? Also, is it worth using for <code>BSWAP_LIMB</code> too, or |
| 448 | would that mean a store and re-fetch? Presumably that's what comes out |
| 449 | in glibc. |
| 450 | <li> Improve <code>count_leading_zeros</code> for 64-bit machines: |
| 451 | <pre> |
| 452 | if ((x >> 32) == 0) { x <<= 32; cnt += 32; } |
| 453 | if ((x >> 48) == 0) { x <<= 16; cnt += 16; } |
| 454 | ... </pre> |
| 455 | <li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps |
| 456 | be used in <code>__GMP_EXTERN_INLINE</code>. What would be the right way |
| 457 | to identify suitable versions of that compiler? |
| 458 | <li> IRIX <code>cc</code> is rumoured to have an <code>_int_mult_upper</code> |
| 459 | (in <code><intrinsics.h></code> like Cray), but it didn't seem to |
| 460 | exist on some IRIX 6.5 systems tried. If it does actually exist |
| 461 | somewhere it would very likely be an improvement over a function call to |
| 462 | umul.asm. |
| 463 | <li> <code>mpn_get_str</code> final divisions by the base with |
| 464 | <code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse |
| 465 | on suitable machines. This ends up happening for decimal by presenting |
| 466 | the compiler with a run-time constant, but the same for other bases would |
| 467 | be good. Perhaps use could be made of the fact base<256. |
| 468 | <li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a |
| 469 | structure like <code>div_t</code> to avoid going through memory, in |
| 470 | particular helping RISCs that don't do store-to-load forwarding. Clearly |
| 471 | this is only possible if the ABI returns a structure of two |
| 472 | <code>mp_limb_t</code>s in registers. |
| 473 | <br> |
| 474 | On PowerPC, structures are returned in memory on AIX and Darwin. In SVR4 |
| 475 | they're returned in registers, except that draft SVR4 had said memory, so |
| 476 | it'd be prudent to check which is done. We can jam the compiler into the |
| 477 | right mode if we know how, since all this is purely internal to libgmp. |
| 478 | (gcc has an option, though of course gcc doesn't matter since we use |
| 479 | inline asm there.) |
| 480 | </ul> |
| 481 | |
| 482 | <h4>New Functionality</h4> |
| 483 | <ul> |
| 484 | <li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction). |
| 485 | <li> Let `0b' and `0B' mean binary input everywhere. |
| 486 | <li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation. |
| 487 | Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let |
| 488 | <code>_mpz_realloc</code> do the initial alloc. Set |
| 489 | <code>z->_mp_d</code> to a dummy that <code>mpz_get_ui</code> and |
| 490 | similar can unconditionally fetch from. Niels Möller has had a go at |
| 491 | this. |
| 492 | <br> |
| 493 | The advantages of the lazy scheme would be: |
| 494 | <ul> |
| 495 | <li> Initial allocate would be the size required for the first value |
| 496 | stored, rather than getting 1 limb in <code>mpz_init</code> and then |
| 497 | more or less immediately reallocating. |
| 498 | <li> <code>mpz_init</code> would only store magic values in the |
| 499 | <code>mpz_t</code> fields, and could be inlined. |
| 500 | <li> A fixed initializer could even be used by applications, like |
| 501 | <code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient |
| 502 | for globals. |
| 503 | </ul> |
| 504 | The advantages of the current scheme are: |
| 505 | <ul> |
| 506 | <li> <code>mpz_set_ui</code> and other similar routines needn't check the |
| 507 | size allocated and can just store unconditionally. |
| 508 | <li> <code>mpz_set_ui</code> and perhaps others like |
| 509 | <code>mpz_tdiv_r_ui</code> and a prospective |
| 510 | <code>mpz_set_ull</code> could be inlined. |
| 511 | </ul> |
| 512 | <li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>. Make sure |
| 513 | format is portable between 32-bit and 64-bit machines, and between |
| 514 | little-endian and big-endian machines. A format which MPFR can use too |
| 515 | would be good. |
| 516 | <li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn |
| 517 | logops and copys available in gmp.h, either as library functions or |
| 518 | inlines, with the availability of library functions instantiated in the |
| 519 | generated gmp.h at build time. |
| 520 | <li> <code>mpz_set_str</code> etc variants taking string lengths rather than |
| 521 | null-terminators. |
| 522 | <li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>, |
| 523 | <code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions, |
| 524 | if they could share code with the current such functions (which should be |
| 525 | possible). |
| 526 | <li> <code>mpz_and_ui</code> etc might be of use sometimes. Suggested by |
| 527 | Niels Möller. |
| 528 | <li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully |
| 529 | accept 0x, 0b etc when base==0. Perhaps the exponent could default to |
| 530 | decimal in this case, with a further 0x, 0b etc allowed there. |
| 531 | Eg. 0xFFAA@0x5A. A leading "0" for octal would match the integers, but |
| 532 | probably something like "0.123" ought not mean octal. |
| 533 | <li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented |
| 534 | feature of gmp.h, so applications could know whether to |
| 535 | <code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>. |
| 536 | <li> <code>GMP_PRIdMP_LIMB</code> and similar defines following C99 |
| 537 | <inttypes.h> might be of use to applications printing limbs. But |
| 538 | if <code>GMP_LONG_LONG_LIMB</code> or whatever is added then perhaps this |
| 539 | can easily enough be left to applications. |
| 540 | <li> <code>gmp_printf</code> could accept <code>%b</code> for binary output. |
| 541 | It'd be nice if it worked for plain <code>int</code> etc too, not just |
| 542 | <code>mpz_t</code> etc. |
| 543 | <li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base, |
| 544 | for both integer and float conversions. A base either in the format |
| 545 | string or as a parameter with <code>*</code> should be allowed. Maybe |
| 546 | <code>&13b</code> (b for base) or something like that. |
| 547 | <li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float |
| 548 | conversions, eg. <code>"%.4Qf"</code>. This would be merely for |
| 549 | convenience, but still might be useful. Rounding would be the same as |
| 550 | for an <code>mpf_t</code> (ie. currently round-to-nearest, but not |
| 551 | actually documented). Alternately, perhaps a separate |
| 552 | <code>mpq_get_str_point</code> or some such might be more use. Suggested |
| 553 | by Pedro Gimeno. |
| 554 | <li> <code>mpz_rscan0</code> or <code>mpz_revscan0</code> or some such |
| 555 | searching towards the low end of an integer might match |
| 556 | <code>mpz_scan0</code> nicely. Likewise for <code>scan1</code>. |
| 557 | Suggested by Roberto Bagnara. |
| 558 | <li> <code>mpz_bit_subset</code> or some such to test whether one integer is a |
| 559 | bitwise subset of another might be of use. Some sort of return value |
| 560 | indicating whether it's a proper or non-proper subset would be good and |
| 561 | wouldn't cost anything in the implementation. Suggested by Roberto |
| 562 | Bagnara. |
| 563 | <li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between |
| 564 | <code>mpf_t</code> and <code>long double</code>, suggested by Dan |
| 565 | Christensen. Other <code>long double</code> routines might be desirable |
| 566 | too, but <code>mpf</code> would be a start. |
| 567 | <br> |
| 568 | <code>long double</code> is an ANSI-ism, so everything involving it would |
| 569 | need to be suppressed on a K&R compiler. |
| 570 | <br> |
| 571 | There'd be some work to be done by <code>configure</code> to recognise |
| 572 | the format in use, MPFR has a start on this. Often <code>long |
| 573 | double</code> is the same as <code>double</code>, which is easy but |
| 574 | pretty pointless. A single float format detector macro could look at |
| 575 | <code>double</code> then <code>long double</code> |
| 576 | <br> |
| 577 | Sometimes there's a compiler option for the size of a <code>long |
| 578 | double</code>, eg. xlc on AIX can use either 64-bit or 128-bit. It's |
| 579 | probably simplest to regard this as a compiler compatibility issue, and |
| 580 | leave it to users or sysadmins to ensure application and library code is |
| 581 | built the same. |
| 582 | <li> <code>mpz_sqrt_if_perfect_square</code>: When |
| 583 | <code>mpz_perfect_square_p</code> does its tests it calculates a square |
| 584 | root and then discards it. For some applications it might be useful to |
| 585 | return that root. Suggested by Jason Moxham. |
| 586 | <li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>, |
| 587 | <code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for |
| 588 | <code>long long</code>. These would aid interoperability, though a |
| 589 | mixture of GMP and <code>long long</code> would probably not be too |
| 590 | common. Since <code>long long</code> is not always available (it's in |
| 591 | C99 and GCC though), disadvantages of using <code>long long</code> in |
| 592 | libgmp.a would be |
| 593 | <ul> |
| 594 | <li> Library contents vary according to the build compiler. |
| 595 | <li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the |
| 596 | application compiler could take the <code>long long</code> |
| 597 | prototypes. |
| 598 | <li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> might be wanted to |
| 599 | indicate whether the functions are available. (Applications using |
| 600 | autoconf could probe the library too.) |
| 601 | </ul> |
| 602 | It'd be possible to defer the need for <code>long long</code> to |
| 603 | application compile time, by having something like |
| 604 | <code>mpz_set_2ui</code> called with two halves of a <code>long |
| 605 | long</code>. Disadvantages of this would be, |
| 606 | <ul> |
| 607 | <li> Bigger code in the application, though perhaps not if a <code>long |
| 608 | long</code> is normally passed as two halves anyway. |
| 609 | <li> <code>mpz_get_ull</code> would be a rather big inline, or would have |
| 610 | to be two function calls. |
| 611 | <li> <code>mpz_get_sll</code> would be a worse inline, and would put the |
| 612 | treatment of <code>-0x10..00</code> into applications (see |
| 613 | <code>mpz_get_si</code> correctness above). |
| 614 | <li> Although having libgmp.a independent of the build compiler is nice, |
| 615 | it sort of sacrifices the capabilities of a good compiler to |
| 616 | uniformity with inferior ones. |
| 617 | </ul> |
| 618 | Plain use of <code>long long</code> is probably the lesser evil, if only |
| 619 | because it makes best use of gcc. In fact perhaps it would suffice to |
| 620 | guarantee <code>long long</code> conversions only when using GCC for both |
| 621 | application and library. That would cover free software, and we can |
| 622 | worry about selected vendor compilers later. |
| 623 | <br> |
| 624 | In C++ the situation is probably clearer, we demand fairly recent C++ so |
| 625 | <code>long long</code> should be available always. We'd probably prefer |
| 626 | to have the C and C++ the same in respect of <code>long long</code> |
| 627 | support, but it would be possible to have it unconditionally in gmpxx.h, |
| 628 | by some means or another. |
| 629 | <li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>. |
| 630 | Suggested by Alexander Kruppa. |
| 631 | </ul> |
| 632 | |
| 633 | |
| 634 | <h4>Configuration</h4> |
| 635 | |
| 636 | <ul> |
| 637 | <li> Alpha ev7, ev79: Add code to config.guess to detect these. Believe ev7 |
| 638 | will be "3-1307" in the current switch, but need to verify that. (On |
| 639 | OSF, current configfsf.guess identifies ev7 using psrinfo, we need to do |
| 640 | it ourselves for other systems.) |
| 641 | <li> Alpha OSF: Libtool (version 1.5) doesn't seem to recognise this system is |
| 642 | "pic always" and ends up running gcc twice with the same options. This |
| 643 | is wasteful, but harmless. Perhaps a newer libtool will be better. |
| 644 | <li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>, |
| 645 | but is that available only for M series chips or some such? Perhaps it |
| 646 | should be configured in some way. |
| 647 | <li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00. |
| 648 | <li> HPPA: gcc 3.2 introduces a <code>-mschedule=7200</code> etc parameter, |
| 649 | which could be driven by an exact hppa cpu type. |
| 650 | <li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc. |
| 651 | "hinv -c processor" gives lots of information on Irix. Standard |
| 652 | config.guess appends "el" to indicate endianness, but |
| 653 | <code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP. |
| 654 | <li> PowerPC: The function descriptor nonsense for AIX is currently driven by |
| 655 | <code>*-*-aix*</code>. It might be more reliable to do some sort of |
| 656 | feature test, examining the compiler output perhaps. It might also be |
| 657 | nice to merge the aix.m4 files into powerpc-defs.m4. |
| 658 | <li> config.m4 is generated only by the configure script, it won't be |
| 659 | regenerated by config.status. Creating it as an <code>AC_OUTPUT</code> |
| 660 | would work, but it might upset "make" to have things like <code>L$</code> |
| 661 | get into the Makefiles through <code>AC_SUBST</code>. |
| 662 | <code>AC_CONFIG_COMMANDS</code> would be the alternative. With some |
| 663 | careful m4 quoting the <code>changequote</code> calls might not be |
| 664 | needed, which might free up the order in which things had to be output. |
| 665 | <li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code> |
| 666 | scheme. Though we probably wouldn't be using its assembler support we |
| 667 | could try to use those variables in compatible ways. |
| 668 | <li> <code>GMP_LDFLAGS</code> could probably be done with plain |
| 669 | <code>LDFLAGS</code> already used by automake for all linking. But with |
| 670 | a bit of luck the next libtool will pass pretty much all |
| 671 | <code>CFLAGS</code> through to the compiler when linking, making |
| 672 | <code>GMP_LDFLAGS</code> unnecessary. |
| 673 | <li> mpn/Makeasm.am uses <code>-c</code> and <code>-o</code> together in the |
| 674 | .S and .asm rules, but apparently that isn't completely portable (there's |
| 675 | an autoconf <code>AC_PROG_CC_C_O</code> test for it). So far we've not |
| 676 | had problems, but perhaps the rules could be rewritten to use "foo.s" as |
| 677 | the temporary, or to do a suitable "mv" of the result. The only danger |
| 678 | from using foo.s would be if a compile failed and the temporary foo.s |
| 679 | then looked like the primary source. Hopefully if the |
| 680 | <code>SUFFIXES</code> are ordered to have .S and .asm ahead of .s that |
| 681 | wouldn't happen. Might need to check. |
| 682 | </ul> |
| 683 | |
| 684 | |
| 685 | <h4>Random Numbers</h4> |
| 686 | <ul> |
| 687 | <li> <code>_gmp_rand</code> is not particularly fast on the linear |
| 688 | congruential algorithm and could stand various improvements. |
| 689 | <ul> |
| 690 | <li> Make a second seed area within <code>gmp_randstate_t</code> (or |
| 691 | <code>_mp_algdata</code> rather) to save some copying. |
| 692 | <li> Make a special case for a single limb <code>2exp</code> modulus, to |
| 693 | avoid <code>mpn_mul</code> calls. Perhaps the same for two limbs. |
| 694 | <li> Inline the <code>lc</code> code, to avoid a function call and |
| 695 | <code>TMP_ALLOC</code> for every chunk. |
| 696 | <li> Perhaps the <code>2exp</code> and general LC cases should be split, |
| 697 | for clarity (if the general case is retained). |
| 698 | </ul> |
| 699 | <li> <code>gmp_randstate_t</code> used for parameters perhaps should become |
| 700 | <code>gmp_randstate_ptr</code> the same as other types. |
| 701 | <li> Some of the empirical randomness tests could be included in a "make |
| 702 | check". They ought to work everywhere, for a given seed at least. |
| 703 | </ul> |
| 704 | |
| 705 | |
| 706 | <h4>C++</h4> |
| 707 | <ul> |
| 708 | <li> <code>mpz_class(string)</code>, etc: Use the C++ global locale to |
| 709 | identify whitespace. |
| 710 | <br> |
| 711 | <code>mpf_class(string)</code>: Use the C++ global locale decimal point, |
| 712 | rather than the C one. |
| 713 | <br> |
| 714 | Consider making these variant <code>mpz_set_str</code> etc forms |
| 715 | available for <code>mpz_t</code> too, not just <code>mpz_class</code> |
| 716 | etc. |
| 717 | <li> <code>mpq_class operator+=</code>: Don't emit an unnecessary |
| 718 | <code>mpq_set(q,q)</code> before <code>mpz_addmul</code> etc. |
| 719 | <li> Put various bits of gmpxx.h into libgmpxx, to avoid excessive inlining. |
| 720 | Candidates for this would be, |
| 721 | <ul> |
| 722 | <li> <code>mpz_class(const char *)</code>, etc: since they're normally |
| 723 | not fast anyway, and we can hide the exception <code>throw</code>. |
| 724 | <li> <code>mpz_class(string)</code>, etc: to hide the <code>cstr</code> |
| 725 | needed to get to the C conversion function. |
| 726 | <li> <code>mpz_class string, char*</code> etc constructors: likewise to |
| 727 | hide the throws and conversions. |
| 728 | <li> <code>mpz_class::get_str</code>, etc: to hide the <code>char*</code> |
| 729 | to <code>string</code> conversion and free. Perhaps |
| 730 | <code>mpz_get_str</code> can write directly into a |
| 731 | <code>string</code>, to avoid copying. |
| 732 | <br> |
| 733 | Consider making such <code>string</code> returning variants |
| 734 | available for use with plain <code>mpz_t</code> etc too. |
| 735 | </ul> |
| 736 | </ul> |
| 737 | |
| 738 | <h4>Miscellaneous</h4> |
| 739 | <ul> |
| 740 | <li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document |
| 741 | what range of values the generated cofactors can take, and preferably |
| 742 | ensure the definition uniquely specifies the cofactors for given inputs. |
| 743 | A basic extended Euclidean algorithm or multi-step variant leads to |
| 744 | |x|<|b| and |y|<|a| or something like that, but there's probably |
| 745 | two solutions under just those restrictions. |
| 746 | <li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than |
| 747 | <code>mpz_tdiv_qr_ui</code>. (Of course dividing multiple primes at a |
| 748 | time would be better still.) |
| 749 | <li> The various test programs use quite a bit of the main |
| 750 | <code>libgmp</code>. This establishes good cross-checks, but it might be |
| 751 | better to use simple reference routines where possible. Where it's not |
| 752 | possible some attention could be paid to the order of the tests, so a |
| 753 | <code>libgmp</code> routine is only used for tests once it seems to be |
| 754 | good. |
| 755 | <li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a |
| 756 | return to a previous k at certain sizes. This arises basically due to |
| 757 | the step effect caused by size multiples effectively used for each k. |
| 758 | Looking at a graph makes it fairly clear. |
| 759 | <li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest |
| 760 | on the string returned by <code>mpf_get_str</code>. Perhaps some variant |
| 761 | of <code>mpf_get_str</code> could be made which would better suit. |
| 762 | </ul> |
| 763 | |
| 764 | |
| 765 | <h4>Aids to Development</h4> |
| 766 | <ul> |
| 767 | <li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf |
| 768 | function to check the validity of each <code>mp?_t</code> parameter, in |
| 769 | particular to check they've been <code>mp?_init</code>ed. This might |
| 770 | catch elementary mistakes in user programs. Care would need to be taken |
| 771 | over <code>MPZ_TMP_INIT</code>ed variables used internally. If nothing |
| 772 | else then consistency checks like size<=alloc, ptr not |
| 773 | <code>NULL</code> and ptr+size not wrapping around the address space, |
| 774 | would be possible. A more sophisticated scheme could track |
| 775 | <code>_mp_d</code> pointers and ensure only a valid one is used. Such a |
| 776 | scheme probably wouldn't be reentrant, not without some help from the |
| 777 | system. |
| 778 | <li> tune/time.c could try to determine at runtime whether |
| 779 | <code>getrusage</code> and <code>gettimeofday</code> are reliable. |
| 780 | Currently we pretend in configure that the dodgy m68k netbsd 1.4.1 |
| 781 | <code>getrusage</code> doesn't exist. If a test might take a long time |
| 782 | to run then perhaps cache the result in a file somewhere. |
| 783 | <li> tune/time.c could choose the default precision based on the |
| 784 | <code>speed_unittime</code> determined, independent of the method in use. |
| 785 | <li> Cray vector systems: CPU frequency could be determined from |
| 786 | <code>sysconf(_SC_CLK_TCK)</code>, since it seems to be clock cycle |
| 787 | based. Is this true for all Cray systems? Would like some documentation |
| 788 | or something to confirm. |
| 789 | </ul> |
| 790 | |
| 791 | |
| 792 | <h4>Documentation</h4> |
| 793 | <ul> |
| 794 | <li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits. |
| 795 | <li> <code>mpn_get_str</code> isn't terribly clear about how many digits it |
| 796 | produces. It'd probably be possible to say at most one leading zero, |
| 797 | which is what both it and <code>mpz_get_str</code> currently do. But |
| 798 | want to be careful not to bind ourselves to something that might not suit |
| 799 | another implementation. |
| 800 | <li> <code>va_arg</code> doesn't do the right thing with <code>mpz_t</code> |
| 801 | etc directly, but instead needs a pointer type like <code>MP_INT*</code>. |
| 802 | It'd be good to show how to do this, but we'd either need to document |
| 803 | <code>mpz_ptr</code> and friends, or perhaps fallback on something |
| 804 | slightly nasty with <code>void*</code>. |
| 805 | </ul> |
| 806 | |
| 807 | |
| 808 | <h4>Bright Ideas</h4> |
| 809 | |
| 810 | <p> The following may or may not be feasible, and aren't likely to get done in the |
| 811 | near future, but are at least worth thinking about. |
| 812 | |
| 813 | <ul> |
| 814 | <li> Reorganize longlong.h so that we can inline the operations even for the |
| 815 | system compiler. When there is no such compiler feature, make calls to |
| 816 | stub functions. Write such stub functions for as many machines as |
| 817 | possible. |
| 818 | <li> longlong.h could declare when it's using, or would like to use, |
| 819 | <code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be |
| 820 | included in libgmp only in that case, the same as is effectively done for |
| 821 | <code>__clz_tab</code>. Likewise udiv.asm and perhaps cntlz.asm. This |
| 822 | would only be a very small space saving, so perhaps not worth the |
| 823 | complexity. |
| 824 | <li> longlong.h could be built at configure time by concatenating or |
| 825 | #including fragments from each directory in the mpn path. This would |
| 826 | select CPU specific macros the same way as CPU specific assembler code. |
| 827 | Code used would no longer depend on cpp predefines, and the current |
| 828 | nested conditionals could be flattened out. |
| 829 | <li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's |
| 830 | sort of supposed to return the low 31 (or 63) bits. But this is |
| 831 | undocumented, and perhaps not too important. |
| 832 | <li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate |
| 833 | say an extra 16 limbs over what's needed, so as to reduce the chance of |
| 834 | having to do a reallocate if the <code>mpz_t</code> grows a bit more. |
| 835 | This could only be an option, since it'd badly bloat memory usage in |
| 836 | applications using many small values. |
| 837 | <li> <code>mpq</code> functions could perhaps check for numerator or |
| 838 | denominator equal to 1, on the assumption that integers or |
| 839 | denominator-only values might be expected to occur reasonably often. |
| 840 | <li> <code>count_trailing_zeros</code> is used on more or less uniformly |
| 841 | distributed numbers in a couple of places. For some CPUs |
| 842 | <code>count_trailing_zeros</code> is slow and it's probably worth handling |
| 843 | the frequently occurring 0 to 2 trailing zeros cases specially. |
| 844 | <li> <code>mpf_t</code> might like to let the exponent be undefined when |
| 845 | size==0, instead of requiring it 0 as now. It should be possible to do |
| 846 | size==0 tests before paying attention to the exponent. The advantage is |
| 847 | not needing to set exp in the various places a zero result can arise, |
| 848 | which avoids some tedium but is otherwise perhaps not too important. |
| 849 | Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on |
| 850 | exp==0, maybe elsewhere too. |
| 851 | <li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__ |
| 852 | ((malloc))</code> on this, though don't know if it'd do much. GCC 3.0 |
| 853 | allows that attribute on functions, but not function pointers (see info |
| 854 | node "Attribute Syntax"), so would need a new autoconf test. This can |
| 855 | wait until there's a GCC that supports it. |
| 856 | <li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from |
| 857 | <code>mpn_add_1</code> and one from <code>mpn_sub_1</code>. If those two |
| 858 | routines were opened up a bit maybe that code could be shared. When a |
| 859 | copy needs to be done there's no carry to append for the add, and if the |
| 860 | copy is non-empty no high zero for the sub. |
| 861 | </ul> |
| 862 | |
| 863 | |
| 864 | <h4>Old and Obsolete Stuff</h4> |
| 865 | |
| 866 | <p> The following tasks apply to chips or systems that are old and/or obsolete. |
| 867 | It's unlikely anything will be done about them unless anyone is actively using |
| 868 | them. |
| 869 | |
| 870 | <ul> |
| 871 | <li> Sparc32: The integer based udiv_nfp.asm used to be selected by |
| 872 | <code>configure --nfp</code> but that option is gone now that autoconf is |
| 873 | used. The file could go somewhere suitable in the mpn search if any |
| 874 | chips might benefit from it, though it's possible we don't currently |
| 875 | differentiate enough exact cpu types to do this properly. |
| 876 | <li> VAX D and G format <code>double</code> floats are straightforward and |
| 877 | could perhaps be handled directly in <code>__gmp_extract_double</code> |
| 878 | and maybe in <code>mpn_get_d</code>, rather than falling back on the |
| 879 | generic code. (Both formats are detected by <code>configure</code>.) |
| 880 | </ul> |
| 881 | |
| 882 | |
| 883 | <hr> |
| 884 | |
| 885 | </body> |
| 886 | </html> |
| 887 | |
| 888 | <!-- |
| 889 | Local variables: |
| 890 | eval: (add-hook 'write-file-hooks 'time-stamp) |
| 891 | time-stamp-start: "This file current as of " |
| 892 | time-stamp-format: "%:d %3b %:y" |
| 893 | time-stamp-end: "\\." |
| 894 | time-stamp-line-limit: 50 |
| 895 | End: |
| 896 | --> |