Blame - third_party/gmp/doc/tasks.html - RealtimeRoboticsGroup/test

blob: 9a25bef1a3eb53df3a837eff3a81d9b2a8ffa5d9 [file] [log] [blame]

Austin Schuh	dace2a6	2020-08-18 10:56:48 -0700	[diff] [blame^]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
				2	<html>
				3	<head>
				4	<title>GMP Itemized Development Tasks</title>
				5	<link rel="shortcut icon" href="favicon.ico">
				6	<link rel="stylesheet" href="gmp.css">
				7	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
				8	</head>
				9
				10	<center>
				11	<h1>
				12	GMP Itemized Development Tasks
				13	</h1>
				14	</center>
				15
				16	<font size=-1>
				17	<pre>
				18	Copyright 2000-2004, 2006, 2008, 2009 Free Software Foundation, Inc.
				19
				20	This file is part of the GNU MP Library.
				21
				22	The GNU MP Library is free software; you can redistribute it and/or modify
				23	it under the terms of either:
				24
				25	* the GNU Lesser General Public License as published by the Free
				26	Software Foundation; either version 3 of the License, or (at your
				27	option) any later version.
				28
				29	or
				30
				31	* the GNU General Public License as published by the Free Software
				32	Foundation; either version 2 of the License, or (at your option) any
				33	later version.
				34
				35	or both in parallel, as here.
				36
				37	The GNU MP Library is distributed in the hope that it will be useful, but
				38	WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
				39	or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
				40	for more details.
				41
				42	You should have received copies of the GNU General Public License and the
				43	GNU Lesser General Public License along with the GNU MP Library. If not,
				44	see https://www.gnu.org/licenses/.
				45	</pre>
				46	</font>
				47
				48	<hr>
				49	<!-- NB. timestamp updated automatically by emacs -->
				50	This file current as of 29 Jan 2014. An up-to-date version is available at
				51	<a href="https://gmplib.org/tasks.html">https://gmplib.org/tasks.html</a>.
				52	Please send comments about this page to gmp-devel<font>@</font>gmplib.org.
				53
				54	<p> These are itemized GMP development tasks. Not all the tasks
				55	listed here are suitable for volunteers, but many of them are.
				56	Please see the <a href="projects.html">projects file</a> for more
				57	sizeable projects.
				58
				59	<p> CAUTION: This file needs updating. Many of the tasks here have
				60	either already been taken care of, or have become irrelevant.
				61
				62	<h4>Correctness and Completeness</h4>
				63	<ul>
				64	<li> <code>_LONG_LONG_LIMB</code> in gmp.h is not namespace clean. Reported
				65	by Patrick Pelissier.
				66	<br>
				67	We sort of mentioned <code>_LONG_LONG_LIMB</code> in past releases, so
				68	need to be careful about changing it. It used to be a define
				69	applications had to set for long long limb systems, but that in
				70	particular is no longer relevant now that it's established automatically.
				71	<li> The various reuse.c tests need to force reallocation by calling
				72	<code>_mpz_realloc</code> with a small (1 limb) size.
				73	<li> One reuse case is missing from mpX/tests/reuse.c:
				74	<code>mpz_XXX(a,a,a)</code>.
				75	<li> Make the string reading functions allow the `0x' prefix when the base is
				76	explicitly 16. They currently only allow that prefix when the base is
				77	unspecified (zero).
				78	<li> <code>mpf_eq</code> is not always correct, when one operand is
				79	1000000000... and the other operand is 0111111111..., i.e., extremely
				80	close. There is a special case in <code>mpf_sub</code> for this
				81	situation; put similar code in <code>mpf_eq</code>. [In progress.]
				82	<li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies. It should
				83	not use just whole limbs, but partial limbs. [In progress.]
				84	<li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance
				85	garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
				86	of a <code>long</code> is not detected.
				87	<li> <code>mpf_add</code> doesn't check for a carry from truncated portions of
				88	the inputs, and in that respect doesn't implement the "infinite precision
				89	followed by truncate" specified in the manual.
				90	<li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global
				91	variables with pointers to <code>mpz_add</code> etc, which doesn't work
				92	when those routines are coming from a DLL (because they're effectively
				93	function pointer global variables themselves). Need to rearrange perhaps
				94	to a set of calls to a test function rather than iterating over an array.
				95	<li> <code>mpz_pow_ui</code>: Detect when the result would be more memory than
				96	a <code>size_t</code> can represent and raise some suitable exception,
				97	probably an alloc call asking for <code>SIZE_T_MAX</code>, and if that
				98	somehow succeeds then an <code>abort</code>. Various size overflows of
				99	this kind are not handled gracefully, probably resulting in segvs.
				100	<br>
				101	In <code>mpz_n_pow_ui</code>, detect when the count of low zero bits
				102	exceeds an <code>unsigned long</code>. There's a (small) chance of this
				103	happening but still having enough memory to represent the value.
				104	Reported by Winfried Dreckmann in for instance <code>mpz_ui_pow_ui (x,
				105	4UL, 1431655766UL)</code>.
				106	<li> <code>mpf</code>: Detect exponent overflow and raise some exception.
				107	It'd be nice to allow the full <code>mp_exp_t</code> range since that's
				108	how it's been in the past, but maybe dropping one bit would make it
				109	easier to test if e1+e2 goes out of bounds.
				110	</ul>
				111
				112
				113
				114	<h4>Machine Independent Optimization</h4>
				115	<ul>
				116	<li> <code>mpf_cmp</code>: For better cache locality, don't test for low zero
				117	limbs until the high limbs fail to give an ordering. Reduce code size by
				118	turning the three <code>mpn_cmp</code>'s into a single loop stopping when
				119	the end of one operand is reached (and then looking for a non-zero in the
				120	rest of the other).
				121	<li> <code>mpf_mul_2exp</code>, <code>mpf_div_2exp</code>: The use of
				122	<code>mpn_lshift</code> for any size<=prec means repeated
				123	<code>mul_2exp</code> and <code>div_2exp</code> calls accumulate low zero
				124	limbs until size==prec+1 is reached. Those zeros will slow down
				125	subsequent operations, especially if the value is otherwise only small.
				126	If low bits of the low limb are zero, use <code>mpn_rshift</code> so as
				127	to not increase the size.
				128	<li> <code>mpn_dc_sqrtrem</code>, <code>mpn_sqrtrem2</code>: Don't use
				129	<code>mpn_add_1</code> and <code>mpn_sub_1</code> for 1 limb operations,
				130	instead <code>ADDC_LIMB</code> and <code>SUBC_LIMB</code>.
				131	<li> <code>mpn_sqrtrem2</code>: Use plain variables for <code>sp[0]</code> and
				132	<code>rp[0]</code> calculations, so the compiler needn't worry about
				133	aliasing between <code>sp</code> and <code>rp</code>.
				134	<li> <code>mpn_sqrtrem</code>: Some work can be saved in the last step when
				135	the remainder is not required, as noted in Paul's paper.
				136	<li> <code>mpq_add</code>, <code>mpq_sub</code>: The gcd fits a single limb
				137	with high probability and in this case <code>binvert_limb</code> could
				138	be used to calculate the inverse just once for the two exact divisions
				139	"op1.den / gcd" and "op2.den / gcd", rather than letting
				140	<code>mpn_bdiv_q_1</code> do it each time. This would require calling
				141	<code>mpn_pi1_bdiv_q_1</code>.
				142	<li> <code>mpn_gcdext</code>: Don't test <code>count_leading_zeros</code> for
				143	zero, instead check the high bit of the operand and avoid invoking
				144	<code>count_leading_zeros</code>. This is an optimization on all
				145	machines, and significant on machines with slow
				146	<code>count_leading_zeros</code>, though it's possible an already
				147	normalized operand might not be encountered very often.
				148	<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
				149	most significant limb (if <code>GMP_LIMB_BITS</code> &lt= 52 bits).
				150	(Peter Montgomery has some ideas on this subject.)
				151	<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
				152	products with fewer operations.
				153	<li> Consider inlining <code>mpz_set_ui</code>. This would be both small and
				154	fast, especially for compile-time constants, but would make application
				155	binaries depend on having 1 limb allocated to an <code>mpz_t</code>,
				156	preventing the "lazy" allocation scheme below.
				157	<li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe
				158	<code>mpz_[cft]div_r_ui</code>. A <code>__gmp_divide_by_zero</code>
				159	would be needed for the divide by zero test, unless that could be left to
				160	<code>mpn_mod_1</code> (not sure currently whether all the risc chips
				161	provoke the right exception there if using mul-by-inverse).
				162	<li> Consider inlining: <code>mpz_fits_s*_p</code>. The setups for
				163	<code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it
				164	might, unfortunately, be necessary to forcibly include <limits.h>
				165	since there's no apparent way to get <code>SHRT_MAX</code> with an
				166	expression (since <code>short</code> and <code>unsigned short</code> can
				167	be different sizes).
				168	<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very fast on one
				169	or two limb moduli, due to a lot of function call overheads. These could
				170	perhaps be handled as special cases.
				171	<li> Make sure <code>mpz_powm_ui</code> is never slower than the corresponding
				172	computation using <code>mpz_powm</code>.
				173	<li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code>
				174	using the division method when they're small, since the REDC form of a
				175	small multiplier is normally a full size product. Probably would need a
				176	new tuned parameter to say what size multiplier is "small", as a function
				177	of the size of the modulus.
				178	<li> <code>mpn_gcd</code> might be able to be sped up on small to moderate
				179	sizes by improving <code>find_a</code>, possibly just by providing an
				180	alternate implementation for CPUs with slowish
				181	<code>count_leading_zeros</code>.
				182	<li> <code>mpf_set_str</code> produces low zero limbs when a string has a
				183	fraction but is exactly representable, eg. 0.5 in decimal. These could be
				184	stripped to save work in later operations.
				185	<li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should
				186	use <code>mpn_and_n</code> etc for the benefit of the small number of
				187	targets with native versions of those routines. Need to be careful not to
				188	pass size==0. Is some code sharing possible between the <code>mpz</code>
				189	routines?
				190	<li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands
				191	unless it's really necessary (currently only sizes are tested, not
				192	whether r really is u or v).
				193	<li> <code>mpf_add</code>: Under the check for v having no effect on the
				194	result, perhaps test for r==u and do nothing in that case, rather than
				195	currently it looks like an <code>MPN_COPY_INCR</code> will be done to
				196	reduce prec+1 limbs to prec.
				197	<li> <code>mpf_div_ui</code>: Instead of padding with low zeros, call
				198	<code>mpn_divrem_1</code> asking for fractional quotient limbs.
				199	<li> <code>mpf_div_ui</code>: Eliminate <code>TMP_ALLOC</code>. When r!=u
				200	there's no overlap and the division can be called on those operands.
				201	When r==u and is prec+1 limbs, then it's an in-place division. If r==u
				202	and not prec+1 limbs, then move the available limbs up to prec+1 and do
				203	an in-place there.
				204	<li> <code>mpf_div_ui</code>: Whether the high quotient limb is zero can be
				205	determined by testing the dividend for high<divisor. When non-zero, the
				206	division can be done on prec dividend limbs instead of prec+1. The result
				207	size is also known before the division, so that can be a tail call (once
				208	the <code>TMP_ALLOC</code> is eliminated).
				209	<li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and
				210	shift the dividend on-the-fly, since this should cost nothing on
				211	superscalar processors and avoid the need for temporary copying in
				212	<code>mpn_tdiv_qr</code>.
				213	<li> <code>mpf_sqrt</code>: If r!=u, and if u doesn't need to be padded with
				214	zeros, then there's no need for the tp temporary.
				215	<li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and
				216	<code>num2*den1</code> products limb-by-limb from high to low and look at
				217	each step for values differing by more than the possible carry bit from
				218	the uncalculated portion.
				219	<li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply
				220	and compare. The benefits of karatsuba and higher multiplication
				221	algorithms are lost, but if it's assumed only a few high limbs will be
				222	needed to determine an order then that's fine.
				223	<li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>,
				224	<code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc
				225	instead of the functions, so they get inlined on all compilers, not just
				226	gcc and others with <code>inline</code> recognised in gmp.h.
				227	<code>__GMPN_ADD_1</code> etc are meant mostly to support application
				228	inline <code>mpn_add_1</code> etc and if they don't come out good for
				229	internal uses then special forms can be introduced, for instance many
				230	internal uses are in-place. Sometimes a block of code is executed based
				231	on the carry-out, rather than using it arithmetically, and those places
				232	might want to do their own loops entirely.
				233	<li> <code>__gmp_extract_double</code> on 64-bit systems could use just one
				234	bitfield for the mantissa extraction, not two, when endianness permits.
				235	Might depend on the compiler allowing <code>long long</code> bit fields
				236	when that's the only actual 64-bit type.
				237	<li> tal-notreent.c could keep a block of memory permanently allocated.
				238	Currently the last nested <code>TMP_FREE</code> releases all memory, so
				239	there's an allocate and free every time a top-level function using
				240	<code>TMP</code> is called. Would need
				241	<code>mp_set_memory_functions</code> to tell tal-notreent.c to release
				242	any cached memory when changing allocation functions though.
				243	<li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially
				244	inlined. If the current chunk has enough room then a couple of pointers
				245	can be updated. Only if more space is required then a call to some sort
				246	of <code>__gmp_tmp_increase</code> would be needed. The requirement that
				247	<code>TMP_ALLOC</code> is an expression might make the implementation a
				248	bit ugly and/or a bit sub-optimal.
				249	<pre>
				250	#define TMP_ALLOC(n)
				251	((ROUND_UP(n) > current->end - current->point ?
				252	__gmp_tmp_increase (ROUND_UP (n)) : 0),
				253	current->point += ROUND_UP (n),
				254	current->point - ROUND_UP (n))
				255	</pre>
				256	<li> <code>__mp_bases</code> has a lot of data for bases which are pretty much
				257	never used. Perhaps the table should just go up to base 16, and have
				258	code to generate data above that, if and when required. Naturally this
				259	assumes the code would be smaller than the data saved.
				260	<li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used
				261	if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted
				262	otherwise, to save space.
				263	<li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which
				264	are of course fast, it seems a little silly to make a second pass over
				265	the <code>mpn_get_str</code> output to convert to ASCII. Perhaps combine
				266	that with the bit extractions.
				267	<li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of
				268	A), and A<B, then the code ends up generating the cofactor T (of B) and
				269	deriving S from that. Perhaps it'd be possible to arrange to get S in
				270	the first place by calling <code>mpn_gcdext</code> with A+B,B. This
				271	might only be an advantage if A and B are about the same size.
				272	<li> <code>mpz_n_pow_ui</code> does a good job with small bases and stripping
				273	powers of 2, but it's perhaps a bit too complicated for what it gains.
				274	The simpler <code>mpn_pow_1</code> is a little faster on small exponents.
				275	(Note some of the ugliness in <code>mpz_n_pow_ui</code> is due to
				276	supporting <code>mpn_mul_2</code>.)
				277	<br>
				278	Perhaps the stripping of 2s in <code>mpz_n_pow_ui</code> should be
				279	confined to single limb operands for simplicity and since that's where
				280	the greatest gain would be.
				281	<br>
				282	Ideally <code>mpn_pow_1</code> and <code>mpz_n_pow_ui</code> would be
				283	merged. The reason <code>mpz_n_pow_ui</code> writes to an
				284	<code>mpz_t</code> is that its callers leave it to make a good estimate
				285	of the result size. Callers of <code>mpn_pow_1</code> already know the
				286	size by separate means (<code>mp_bases</code>).
				287	<li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly.
				288	</ul>
				289
				290
				291	<h4>Machine Dependent Optimization</h4>
				292	<ul>
				293	<li> <code>invert_limb</code> on various processors might benefit from the
				294	little Newton iteration done for alpha and ia64.
				295	<li> Alpha 21264: <code>mpn_addlsh1_n</code> could be implemented with
				296	<code>mpn_addmul_1</code>, since that code at 3.5 is a touch faster than
				297	a separate <code>lshift</code> and <code>add_n</code> at
				298	1.75+2.125=3.875. Or very likely some specific <code>addlsh1_n</code>
				299	code could beat both.
				300	<li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>,
				301	<code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>.
				302	<li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
				303	and <code>mpn_submul_1</code> for the 21164. This should use both integer
				304	multiplies and floating-point multiplies. For the floating-point
				305	operations, the single-limb multiplier should be split into three 21-bit
				306	chunks, or perhaps even better in four 16-bit chunks. Probably possible
				307	to reach 9 cycles/limb.
				308	<li> Alpha: GCC 3.4 will introduce <code>__builtin_ctzl</code>,
				309	<code>__builtin_clzl</code> and <code>__builtin_popcountl</code> using
				310	the corresponding CIX <code>ct</code> instructions, and
				311	<code>__builtin_alpha_cmpbge</code>. These should give GCC more
				312	information about scheduling etc than the <code>asm</code> blocks
				313	currently used in longlong.h and gmp-impl.h.
				314	<li> Alpha Unicos: Apparently there's no <code>alloca</code> on this system,
				315	making <code>configure</code> choose the slower
				316	<code>malloc-reentrant</code> allocation method. Is there a better way?
				317	Maybe variable-length arrays per notes below.
				318	<li> Alpha Unicos 21164, 21264: <code>.align</code> is not used since it pads
				319	with garbage. Does the code get the intended slotting required for the
				320	claimed speeds? <code>.align</code> at the start of a function would
				321	presumably be safe no matter how it pads.
				322	<li> ARM V5: <code>count_leading_zeros</code> can use the <code>clz</code>
				323	instruction. For GCC 3.4 and up, do this via <code>__builtin_clzl</code>
				324	since then gcc knows it's "predicable".
				325	<li> Itanium: GCC 3.4 introduces <code>__builtin_popcount</code> which can be
				326	used instead of an <code>asm</code> block. The builtin should give gcc
				327	more opportunities for scheduling, bundling and predication.
				328	<code>__builtin_ctz</code> similarly (it just uses popcount as per
				329	current longlong.h).
				330	<li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
				331	for s2 < 2^32 (or perhaps for any zero 16-bit s2 chunk). Not sure how
				332	much this can improve the speed, though, since the symmetry that we rely
				333	on is lost. Perhaps we can just gain cycles when s2 < 2^16, or more
				334	accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
				335	<li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to
				336	<code>mpn_addmul_1</code>.
				337	<li> UltraSPARC/64: Write <code>umul_ppmm</code>. Using four
				338	"<code>mulx</code>"s either with an asm block or via the generic C code is
				339	about 90 cycles. Try using fp operations, and also try using karatsuba
				340	for just three "<code>mulx</code>"s.
				341	<li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>.
				342	Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do.
				343	<li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 < 2^16.
				344	<li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if
				345	possible (see commented out code in longlong.h). This is unlikely to
				346	save more than a couple of cycles, so perhaps isn't worth bothering with.
				347	<li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code>
				348	or anything to indicate V9 support when -mcpu=v9 is selected. See
				349	gcc/config/sol2-sld-64.h. Will need to pass something through from
				350	./configure to select the right code in longlong.h. (Currently nothing
				351	is lost because <code>mulx</code> for multiplying is commented out.)
				352	<li> UltraSPARC/32: <code>mpn_divexact_1</code> and
				353	<code>mpn_modexact_1c_odd</code> can use a 64-bit inverse and take
				354	64-bits at a time from the dividend, as per the 32-bit divisor case in
				355	mpn/sparc64/mode1o.c. This must be done in assembler, since the full
				356	64-bit registers (<code>%gN</code>) are not available from C.
				357	<li> UltraSPARC/32: <code>mpn_divexact_by3c</code> can work 64-bits at a time
				358	using <code>mulx</code>, in assembler. This would be the same as for
				359	sparc64.
				360	<li> UltraSPARC: <code>binvert_limb</code> might save a few cycles from
				361	masking down to just the useful bits at each point in the calculation,
				362	since <code>mulx</code> speed depends on the highest bit set. Either
				363	explicit masks or small types like <code>short</code> and
				364	<code>int</code> ought to work.
				365	<li> Sparc64 HAL R1 <code>popc</code>: This chip reputedly implements
				366	<code>popc</code> properly (see gcc sparc.md). Would need to recognise
				367	it as <code>sparchalr1</code> or something in configure / config.sub /
				368	config.guess. <code>popc_limb</code> in gmp-impl.h could use this (per
				369	commented out code). <code>count_trailing_zeros</code> could use it too.
				370	<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
				371	<code>mpn_mul_1</code>. The current code runs at 11 cycles/limb. It
				372	should be possible to saturate the cache, which will happen at 8
				373	cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 < 2^32;
				374	it should be possible to make them run at about 5 cycles/limb.
				375	<li> PPC601: See which of the power or powerpc32 code runs better. Currently
				376	the powerpc32 is used, but only because it's the default for
				377	<code>powerpc*</code>.
				378	<li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
				379	<code>mpn_mul_1</code>. Use both integer and floating-point operations,
				380	possibly two floating-point and one integer limb per loop. Split operands
				381	into four 16-bit chunks for fast fp operations. Should easily reach 9
				382	cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
				383	(using one int + two fp).
				384	<li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop
				385	as <code>mpn_lshift</code>. Some judicious use of m4 might let the two
				386	share source code, or with a register to control the loop direction
				387	perhaps even share object code.
				388	<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
				389	for important machines. Helping the generic sqr_basecase.c with an
				390	<code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
				391	<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
				392	Will bring time from 1.75 to 1.25 cycles/limb.
				393	<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1. (See
				394	Pentium code.)
				395	<li> X86: Good authority has it that in the past an inline <code>rep
				396	movs</code> would upset GCC register allocation for the whole function.
				397	Is this still true in GCC 3? It uses <code>rep movs</code> itself for
				398	<code>__builtin_memcpy</code>. Examine the code for some simple and
				399	complex functions to find out. Inlining <code>rep movs</code> would be
				400	desirable, it'd be both smaller and faster.
				401	<li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come
				402	down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
				403	<code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README.
				404	<li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code>
				405	might benefit from some destination prefetching.
				406	<li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a
				407	mul-by-inverse, hoping for maybe 30 c/l.
				408	<li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to
				409	do something branch-free for unaligned startups, and shaving one insn
				410	from the loop with alternative indexing might save a cycle.
				411	<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
				412	The pipeline is now extremely deep, perhaps unnecessarily deep.
				413	<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
				414	<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
				415	<code>mpn_sqr_basecase</code>. This should use a "vertical multiplication
				416	method", to avoid carry propagation. splitting one of the operands in
				417	11-bit chunks.
				418	<li> Pentium: <code>mpn_lshift</code> by 31 should use the special rshift
				419	by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the
				420	special lshift by 1. This would be best as a jump across to the other
				421	routine, could let both live in lshift.asm and omit rshift.asm on finding
				422	<code>mpn_rshift</code> already provided.
				423	<li> Cray T3E: Experiment with optimization options. In particular,
				424	-hpipeline3 seems promising. We should at least up -O to -O2 or -O3.
				425	<li> Cray: <code>mpn_com</code> and <code>mpn_and_n</code> etc very probably
				426	wants a pragma like <code>MPN_COPY_INCR</code>.
				427	<li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>,
				428	<code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small
				429	and could be inlined to avoid function calls.
				430	<li> Cray: Variable length arrays seem to be faster than the tal-notreent.c
				431	scheme. Not sure why, maybe they merely give the compiler more
				432	information about aliasing (or the lack thereof). Would like to modify
				433	<code>TMP_ALLOC</code> to use them, or introduce a new scheme. Memory
				434	blocks wanted unconditionally are easy enough, those wanted only
				435	sometimes are a problem. Perhaps a special size calculation to ask for a
				436	dummy length 1 when unwanted, or perhaps an inlined subroutine
				437	duplicating code under each conditional. Don't really want to turn
				438	everything into a dog's dinner just because Cray don't offer an
				439	<code>alloca</code>.
				440	<li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize.
				441	Does it? <code>bits_per_digit</code> and the inner loop over bits in a
				442	limb might prevent it. Perhaps special cases for binary, octal and hex
				443	would be worthwhile (very possibly for all processors too).
				444	<li> S390: <code>BSWAP_LIMB_FETCH</code> looks like it could be done with
				445	<code>lrvg</code>, as per glibc sysdeps/s390/s390-64/bits/byteswap.h.
				446	This is only for 64-bit mode or something is it, since 32-bit mode has
				447	other code? Also, is it worth using for <code>BSWAP_LIMB</code> too, or
				448	would that mean a store and re-fetch? Presumably that's what comes out
				449	in glibc.
				450	<li> Improve <code>count_leading_zeros</code> for 64-bit machines:
				451	<pre>
				452	if ((x &gt&gt 32) == 0) { x &lt&lt= 32; cnt += 32; }
				453	if ((x &gt&gt 48) == 0) { x &lt&lt= 16; cnt += 16; }
				454	... </pre>
				455	<li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps
				456	be used in <code>__GMP_EXTERN_INLINE</code>. What would be the right way
				457	to identify suitable versions of that compiler?
				458	<li> IRIX <code>cc</code> is rumoured to have an <code>_int_mult_upper</code>
				459	(in <code><intrinsics.h></code> like Cray), but it didn't seem to
				460	exist on some IRIX 6.5 systems tried. If it does actually exist
				461	somewhere it would very likely be an improvement over a function call to
				462	umul.asm.
				463	<li> <code>mpn_get_str</code> final divisions by the base with
				464	<code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse
				465	on suitable machines. This ends up happening for decimal by presenting
				466	the compiler with a run-time constant, but the same for other bases would
				467	be good. Perhaps use could be made of the fact base<256.
				468	<li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a
				469	structure like <code>div_t</code> to avoid going through memory, in
				470	particular helping RISCs that don't do store-to-load forwarding. Clearly
				471	this is only possible if the ABI returns a structure of two
				472	<code>mp_limb_t</code>s in registers.
				473	<br>
				474	On PowerPC, structures are returned in memory on AIX and Darwin. In SVR4
				475	they're returned in registers, except that draft SVR4 had said memory, so
				476	it'd be prudent to check which is done. We can jam the compiler into the
				477	right mode if we know how, since all this is purely internal to libgmp.
				478	(gcc has an option, though of course gcc doesn't matter since we use
				479	inline asm there.)
				480	</ul>
				481
				482	<h4>New Functionality</h4>
				483	<ul>
				484	<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
				485	<li> Let `0b' and `0B' mean binary input everywhere.
				486	<li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation.
				487	Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let
				488	<code>_mpz_realloc</code> do the initial alloc. Set
				489	<code>z->_mp_d</code> to a dummy that <code>mpz_get_ui</code> and
				490	similar can unconditionally fetch from. Niels Möller has had a go at
				491	this.
				492	<br>
				493	The advantages of the lazy scheme would be:
				494	<ul>
				495	<li> Initial allocate would be the size required for the first value
				496	stored, rather than getting 1 limb in <code>mpz_init</code> and then
				497	more or less immediately reallocating.
				498	<li> <code>mpz_init</code> would only store magic values in the
				499	<code>mpz_t</code> fields, and could be inlined.
				500	<li> A fixed initializer could even be used by applications, like
				501	<code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient
				502	for globals.
				503	</ul>
				504	The advantages of the current scheme are:
				505	<ul>
				506	<li> <code>mpz_set_ui</code> and other similar routines needn't check the
				507	size allocated and can just store unconditionally.
				508	<li> <code>mpz_set_ui</code> and perhaps others like
				509	<code>mpz_tdiv_r_ui</code> and a prospective
				510	<code>mpz_set_ull</code> could be inlined.
				511	</ul>
				512	<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>. Make sure
				513	format is portable between 32-bit and 64-bit machines, and between
				514	little-endian and big-endian machines. A format which MPFR can use too
				515	would be good.
				516	<li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn
				517	logops and copys available in gmp.h, either as library functions or
				518	inlines, with the availability of library functions instantiated in the
				519	generated gmp.h at build time.
				520	<li> <code>mpz_set_str</code> etc variants taking string lengths rather than
				521	null-terminators.
				522	<li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>,
				523	<code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions,
				524	if they could share code with the current such functions (which should be
				525	possible).
				526	<li> <code>mpz_and_ui</code> etc might be of use sometimes. Suggested by
				527	Niels Möller.
				528	<li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully
				529	accept 0x, 0b etc when base==0. Perhaps the exponent could default to
				530	decimal in this case, with a further 0x, 0b etc allowed there.
				531	Eg. 0xFFAA@0x5A. A leading "0" for octal would match the integers, but
				532	probably something like "0.123" ought not mean octal.
				533	<li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented
				534	feature of gmp.h, so applications could know whether to
				535	<code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>.
				536	<li> <code>GMP_PRIdMP_LIMB</code> and similar defines following C99
				537	<inttypes.h> might be of use to applications printing limbs. But
				538	if <code>GMP_LONG_LONG_LIMB</code> or whatever is added then perhaps this
				539	can easily enough be left to applications.
				540	<li> <code>gmp_printf</code> could accept <code>%b</code> for binary output.
				541	It'd be nice if it worked for plain <code>int</code> etc too, not just
				542	<code>mpz_t</code> etc.
				543	<li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base,
				544	for both integer and float conversions. A base either in the format
				545	string or as a parameter with <code>*</code> should be allowed. Maybe
				546	<code>&13b</code> (b for base) or something like that.
				547	<li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float
				548	conversions, eg. <code>"%.4Qf"</code>. This would be merely for
				549	convenience, but still might be useful. Rounding would be the same as
				550	for an <code>mpf_t</code> (ie. currently round-to-nearest, but not
				551	actually documented). Alternately, perhaps a separate
				552	<code>mpq_get_str_point</code> or some such might be more use. Suggested
				553	by Pedro Gimeno.
				554	<li> <code>mpz_rscan0</code> or <code>mpz_revscan0</code> or some such
				555	searching towards the low end of an integer might match
				556	<code>mpz_scan0</code> nicely. Likewise for <code>scan1</code>.
				557	Suggested by Roberto Bagnara.
				558	<li> <code>mpz_bit_subset</code> or some such to test whether one integer is a
				559	bitwise subset of another might be of use. Some sort of return value
				560	indicating whether it's a proper or non-proper subset would be good and
				561	wouldn't cost anything in the implementation. Suggested by Roberto
				562	Bagnara.
				563	<li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between
				564	<code>mpf_t</code> and <code>long double</code>, suggested by Dan
				565	Christensen. Other <code>long double</code> routines might be desirable
				566	too, but <code>mpf</code> would be a start.
				567	<br>
				568	<code>long double</code> is an ANSI-ism, so everything involving it would
				569	need to be suppressed on a K&R compiler.
				570	<br>
				571	There'd be some work to be done by <code>configure</code> to recognise
				572	the format in use, MPFR has a start on this. Often <code>long
				573	double</code> is the same as <code>double</code>, which is easy but
				574	pretty pointless. A single float format detector macro could look at
				575	<code>double</code> then <code>long double</code>
				576	<br>
				577	Sometimes there's a compiler option for the size of a <code>long
				578	double</code>, eg. xlc on AIX can use either 64-bit or 128-bit. It's
				579	probably simplest to regard this as a compiler compatibility issue, and
				580	leave it to users or sysadmins to ensure application and library code is
				581	built the same.
				582	<li> <code>mpz_sqrt_if_perfect_square</code>: When
				583	<code>mpz_perfect_square_p</code> does its tests it calculates a square
				584	root and then discards it. For some applications it might be useful to
				585	return that root. Suggested by Jason Moxham.
				586	<li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>,
				587	<code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for
				588	<code>long long</code>. These would aid interoperability, though a
				589	mixture of GMP and <code>long long</code> would probably not be too
				590	common. Since <code>long long</code> is not always available (it's in
				591	C99 and GCC though), disadvantages of using <code>long long</code> in
				592	libgmp.a would be
				593	<ul>
				594	<li> Library contents vary according to the build compiler.
				595	<li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the
				596	application compiler could take the <code>long long</code>
				597	prototypes.
				598	<li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> might be wanted to
				599	indicate whether the functions are available. (Applications using
				600	autoconf could probe the library too.)
				601	</ul>
				602	It'd be possible to defer the need for <code>long long</code> to
				603	application compile time, by having something like
				604	<code>mpz_set_2ui</code> called with two halves of a <code>long
				605	long</code>. Disadvantages of this would be,
				606	<ul>
				607	<li> Bigger code in the application, though perhaps not if a <code>long
				608	long</code> is normally passed as two halves anyway.
				609	<li> <code>mpz_get_ull</code> would be a rather big inline, or would have
				610	to be two function calls.
				611	<li> <code>mpz_get_sll</code> would be a worse inline, and would put the
				612	treatment of <code>-0x10..00</code> into applications (see
				613	<code>mpz_get_si</code> correctness above).
				614	<li> Although having libgmp.a independent of the build compiler is nice,
				615	it sort of sacrifices the capabilities of a good compiler to
				616	uniformity with inferior ones.
				617	</ul>
				618	Plain use of <code>long long</code> is probably the lesser evil, if only
				619	because it makes best use of gcc. In fact perhaps it would suffice to
				620	guarantee <code>long long</code> conversions only when using GCC for both
				621	application and library. That would cover free software, and we can
				622	worry about selected vendor compilers later.
				623	<br>
				624	In C++ the situation is probably clearer, we demand fairly recent C++ so
				625	<code>long long</code> should be available always. We'd probably prefer
				626	to have the C and C++ the same in respect of <code>long long</code>
				627	support, but it would be possible to have it unconditionally in gmpxx.h,
				628	by some means or another.
				629	<li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>.
				630	Suggested by Alexander Kruppa.
				631	</ul>
				632
				633
				634	<h4>Configuration</h4>
				635
				636	<ul>
				637	<li> Alpha ev7, ev79: Add code to config.guess to detect these. Believe ev7
				638	will be "3-1307" in the current switch, but need to verify that. (On
				639	OSF, current configfsf.guess identifies ev7 using psrinfo, we need to do
				640	it ourselves for other systems.)
				641	<li> Alpha OSF: Libtool (version 1.5) doesn't seem to recognise this system is
				642	"pic always" and ends up running gcc twice with the same options. This
				643	is wasteful, but harmless. Perhaps a newer libtool will be better.
				644	<li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>,
				645	but is that available only for M series chips or some such? Perhaps it
				646	should be configured in some way.
				647	<li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
				648	<li> HPPA: gcc 3.2 introduces a <code>-mschedule=7200</code> etc parameter,
				649	which could be driven by an exact hppa cpu type.
				650	<li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc.
				651	"hinv -c processor" gives lots of information on Irix. Standard
				652	config.guess appends "el" to indicate endianness, but
				653	<code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP.
				654	<li> PowerPC: The function descriptor nonsense for AIX is currently driven by
				655	<code>--aix*</code>. It might be more reliable to do some sort of
				656	feature test, examining the compiler output perhaps. It might also be
				657	nice to merge the aix.m4 files into powerpc-defs.m4.
				658	<li> config.m4 is generated only by the configure script, it won't be
				659	regenerated by config.status. Creating it as an <code>AC_OUTPUT</code>
				660	would work, but it might upset "make" to have things like <code>L$</code>
				661	get into the Makefiles through <code>AC_SUBST</code>.
				662	<code>AC_CONFIG_COMMANDS</code> would be the alternative. With some
				663	careful m4 quoting the <code>changequote</code> calls might not be
				664	needed, which might free up the order in which things had to be output.
				665	<li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code>
				666	scheme. Though we probably wouldn't be using its assembler support we
				667	could try to use those variables in compatible ways.
				668	<li> <code>GMP_LDFLAGS</code> could probably be done with plain
				669	<code>LDFLAGS</code> already used by automake for all linking. But with
				670	a bit of luck the next libtool will pass pretty much all
				671	<code>CFLAGS</code> through to the compiler when linking, making
				672	<code>GMP_LDFLAGS</code> unnecessary.
				673	<li> mpn/Makeasm.am uses <code>-c</code> and <code>-o</code> together in the
				674	.S and .asm rules, but apparently that isn't completely portable (there's
				675	an autoconf <code>AC_PROG_CC_C_O</code> test for it). So far we've not
				676	had problems, but perhaps the rules could be rewritten to use "foo.s" as
				677	the temporary, or to do a suitable "mv" of the result. The only danger
				678	from using foo.s would be if a compile failed and the temporary foo.s
				679	then looked like the primary source. Hopefully if the
				680	<code>SUFFIXES</code> are ordered to have .S and .asm ahead of .s that
				681	wouldn't happen. Might need to check.
				682	</ul>
				683
				684
				685	<h4>Random Numbers</h4>
				686	<ul>
				687	<li> <code>_gmp_rand</code> is not particularly fast on the linear
				688	congruential algorithm and could stand various improvements.
				689	<ul>
				690	<li> Make a second seed area within <code>gmp_randstate_t</code> (or
				691	<code>_mp_algdata</code> rather) to save some copying.
				692	<li> Make a special case for a single limb <code>2exp</code> modulus, to
				693	avoid <code>mpn_mul</code> calls. Perhaps the same for two limbs.
				694	<li> Inline the <code>lc</code> code, to avoid a function call and
				695	<code>TMP_ALLOC</code> for every chunk.
				696	<li> Perhaps the <code>2exp</code> and general LC cases should be split,
				697	for clarity (if the general case is retained).
				698	</ul>
				699	<li> <code>gmp_randstate_t</code> used for parameters perhaps should become
				700	<code>gmp_randstate_ptr</code> the same as other types.
				701	<li> Some of the empirical randomness tests could be included in a "make
				702	check". They ought to work everywhere, for a given seed at least.
				703	</ul>
				704
				705
				706	<h4>C++</h4>
				707	<ul>
				708	<li> <code>mpz_class(string)</code>, etc: Use the C++ global locale to
				709	identify whitespace.
				710	<br>
				711	<code>mpf_class(string)</code>: Use the C++ global locale decimal point,
				712	rather than the C one.
				713	<br>
				714	Consider making these variant <code>mpz_set_str</code> etc forms
				715	available for <code>mpz_t</code> too, not just <code>mpz_class</code>
				716	etc.
				717	<li> <code>mpq_class operator+=</code>: Don't emit an unnecessary
				718	<code>mpq_set(q,q)</code> before <code>mpz_addmul</code> etc.
				719	<li> Put various bits of gmpxx.h into libgmpxx, to avoid excessive inlining.
				720	Candidates for this would be,
				721	<ul>
				722	<li> <code>mpz_class(const char *)</code>, etc: since they're normally
				723	not fast anyway, and we can hide the exception <code>throw</code>.
				724	<li> <code>mpz_class(string)</code>, etc: to hide the <code>cstr</code>
				725	needed to get to the C conversion function.
				726	<li> <code>mpz_class string, char*</code> etc constructors: likewise to
				727	hide the throws and conversions.
				728	<li> <code>mpz_class::get_str</code>, etc: to hide the <code>char*</code>
				729	to <code>string</code> conversion and free. Perhaps
				730	<code>mpz_get_str</code> can write directly into a
				731	<code>string</code>, to avoid copying.
				732	<br>
				733	Consider making such <code>string</code> returning variants
				734	available for use with plain <code>mpz_t</code> etc too.
				735	</ul>
				736	</ul>
				737
				738	<h4>Miscellaneous</h4>
				739	<ul>
				740	<li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document
				741	what range of values the generated cofactors can take, and preferably
				742	ensure the definition uniquely specifies the cofactors for given inputs.
				743	A basic extended Euclidean algorithm or multi-step variant leads to
				744	\|x\|<\|b\| and \|y\|<\|a\| or something like that, but there's probably
				745	two solutions under just those restrictions.
				746	<li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than
				747	<code>mpz_tdiv_qr_ui</code>. (Of course dividing multiple primes at a
				748	time would be better still.)
				749	<li> The various test programs use quite a bit of the main
				750	<code>libgmp</code>. This establishes good cross-checks, but it might be
				751	better to use simple reference routines where possible. Where it's not
				752	possible some attention could be paid to the order of the tests, so a
				753	<code>libgmp</code> routine is only used for tests once it seems to be
				754	good.
				755	<li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a
				756	return to a previous k at certain sizes. This arises basically due to
				757	the step effect caused by size multiples effectively used for each k.
				758	Looking at a graph makes it fairly clear.
				759	<li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest
				760	on the string returned by <code>mpf_get_str</code>. Perhaps some variant
				761	of <code>mpf_get_str</code> could be made which would better suit.
				762	</ul>
				763
				764
				765	<h4>Aids to Development</h4>
				766	<ul>
				767	<li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf
				768	function to check the validity of each <code>mp?_t</code> parameter, in
				769	particular to check they've been <code>mp?_init</code>ed. This might
				770	catch elementary mistakes in user programs. Care would need to be taken
				771	over <code>MPZ_TMP_INIT</code>ed variables used internally. If nothing
				772	else then consistency checks like size<=alloc, ptr not
				773	<code>NULL</code> and ptr+size not wrapping around the address space,
				774	would be possible. A more sophisticated scheme could track
				775	<code>_mp_d</code> pointers and ensure only a valid one is used. Such a
				776	scheme probably wouldn't be reentrant, not without some help from the
				777	system.
				778	<li> tune/time.c could try to determine at runtime whether
				779	<code>getrusage</code> and <code>gettimeofday</code> are reliable.
				780	Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
				781	<code>getrusage</code> doesn't exist. If a test might take a long time
				782	to run then perhaps cache the result in a file somewhere.
				783	<li> tune/time.c could choose the default precision based on the
				784	<code>speed_unittime</code> determined, independent of the method in use.
				785	<li> Cray vector systems: CPU frequency could be determined from
				786	<code>sysconf(_SC_CLK_TCK)</code>, since it seems to be clock cycle
				787	based. Is this true for all Cray systems? Would like some documentation
				788	or something to confirm.
				789	</ul>
				790
				791
				792	<h4>Documentation</h4>
				793	<ul>
				794	<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
				795	<li> <code>mpn_get_str</code> isn't terribly clear about how many digits it
				796	produces. It'd probably be possible to say at most one leading zero,
				797	which is what both it and <code>mpz_get_str</code> currently do. But
				798	want to be careful not to bind ourselves to something that might not suit
				799	another implementation.
				800	<li> <code>va_arg</code> doesn't do the right thing with <code>mpz_t</code>
				801	etc directly, but instead needs a pointer type like <code>MP_INT*</code>.
				802	It'd be good to show how to do this, but we'd either need to document
				803	<code>mpz_ptr</code> and friends, or perhaps fallback on something
				804	slightly nasty with <code>void*</code>.
				805	</ul>
				806
				807
				808	<h4>Bright Ideas</h4>
				809
				810	<p> The following may or may not be feasible, and aren't likely to get done in the
				811	near future, but are at least worth thinking about.
				812
				813	<ul>
				814	<li> Reorganize longlong.h so that we can inline the operations even for the
				815	system compiler. When there is no such compiler feature, make calls to
				816	stub functions. Write such stub functions for as many machines as
				817	possible.
				818	<li> longlong.h could declare when it's using, or would like to use,
				819	<code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be
				820	included in libgmp only in that case, the same as is effectively done for
				821	<code>__clz_tab</code>. Likewise udiv.asm and perhaps cntlz.asm. This
				822	would only be a very small space saving, so perhaps not worth the
				823	complexity.
				824	<li> longlong.h could be built at configure time by concatenating or
				825	#including fragments from each directory in the mpn path. This would
				826	select CPU specific macros the same way as CPU specific assembler code.
				827	Code used would no longer depend on cpp predefines, and the current
				828	nested conditionals could be flattened out.
				829	<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's
				830	sort of supposed to return the low 31 (or 63) bits. But this is
				831	undocumented, and perhaps not too important.
				832	<li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate
				833	say an extra 16 limbs over what's needed, so as to reduce the chance of
				834	having to do a reallocate if the <code>mpz_t</code> grows a bit more.
				835	This could only be an option, since it'd badly bloat memory usage in
				836	applications using many small values.
				837	<li> <code>mpq</code> functions could perhaps check for numerator or
				838	denominator equal to 1, on the assumption that integers or
				839	denominator-only values might be expected to occur reasonably often.
				840	<li> <code>count_trailing_zeros</code> is used on more or less uniformly
				841	distributed numbers in a couple of places. For some CPUs
				842	<code>count_trailing_zeros</code> is slow and it's probably worth handling
				843	the frequently occurring 0 to 2 trailing zeros cases specially.
				844	<li> <code>mpf_t</code> might like to let the exponent be undefined when
				845	size==0, instead of requiring it 0 as now. It should be possible to do
				846	size==0 tests before paying attention to the exponent. The advantage is
				847	not needing to set exp in the various places a zero result can arise,
				848	which avoids some tedium but is otherwise perhaps not too important.
				849	Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on
				850	exp==0, maybe elsewhere too.
				851	<li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__
				852	((malloc))</code> on this, though don't know if it'd do much. GCC 3.0
				853	allows that attribute on functions, but not function pointers (see info
				854	node "Attribute Syntax"), so would need a new autoconf test. This can
				855	wait until there's a GCC that supports it.
				856	<li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from
				857	<code>mpn_add_1</code> and one from <code>mpn_sub_1</code>. If those two
				858	routines were opened up a bit maybe that code could be shared. When a
				859	copy needs to be done there's no carry to append for the add, and if the
				860	copy is non-empty no high zero for the sub.
				861	</ul>
				862
				863
				864	<h4>Old and Obsolete Stuff</h4>
				865
				866	<p> The following tasks apply to chips or systems that are old and/or obsolete.
				867	It's unlikely anything will be done about them unless anyone is actively using
				868	them.
				869
				870	<ul>
				871	<li> Sparc32: The integer based udiv_nfp.asm used to be selected by
				872	<code>configure --nfp</code> but that option is gone now that autoconf is
				873	used. The file could go somewhere suitable in the mpn search if any
				874	chips might benefit from it, though it's possible we don't currently
				875	differentiate enough exact cpu types to do this properly.
				876	<li> VAX D and G format <code>double</code> floats are straightforward and
				877	could perhaps be handled directly in <code>__gmp_extract_double</code>
				878	and maybe in <code>mpn_get_d</code>, rather than falling back on the
				879	generic code. (Both formats are detected by <code>configure</code>.)
				880	</ul>
				881
				882
				883	<hr>
				884
				885	</body>
				886	</html>
				887
				888	<!--
				889	Local variables:
				890	eval: (add-hook 'write-file-hooks 'time-stamp)
				891	time-stamp-start: "This file current as of "
				892	time-stamp-format: "%:d %3b %:y"
				893	time-stamp-end: "\\."
				894	time-stamp-line-limit: 50
				895	End:
				896	-->