假设您有一个浮点类型的变量。 32 位或 64 位并不重要。
您将最大可表示值分配给变量。典型的编程语言对此有一个常数。
如何确定必须添加到变量中的 least 值,以便它“捕捉”到无穷大?
我知道 C 中的
nextafter*
和 nexttoward*
、Rust 中的 next_up
等函数。这些是相关的,但没有给我我需要的值。
这里是C语言中
float
的解决方案。它使用FLT_ROUNDS
,在程序执行过程中可能会改变。如果程序这样做,则此代码应使用 #pragma STDC FENV_ACCESS ON
通知编译器它取决于浮点环境。
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
#if !defined INFINITY
printf("Infinity is not representable, so no value added to %a can produce infinity.\n", FLT_MAX);
#else
/* a will be set to the largest value that can be added that will
not produce infinity, and b will be set to the smallest value that
will produce infinity.
*/
float a, b;
switch (FLT_ROUNDS)
{
case 0: // Toward zero.
case 3: // Downward, toward negative infinity.
{
// With rounding downward or toward zero, no finite value will round to +infinity.
a = FLT_MAX;
b = INFINITY;
break;
}
case 1: // To nearest, ties to even.
case 4: // To nearest, ties away from zero.
{
// Determine ULP at FLT_MAX;
float u = FLT_MAX - nexttowardf(FLT_MAX, 0);
// The smallest value that will produce infinity is half an ULP.
b = u/2;
a = nexttowardf(b, 0);
break;
}
case 2: // Upward, toward positive infinity.
{
// With rounding upward, adding any non-negative value will produce infinity.
a = 0;
b = FLT_TRUE_MIN;
break;
}
case -1: // Indeterminable, or, rather, the implementation will not tell us.
{
// Check whether boundary is between FLT_MAX and INFINITY.
if (FLT_MAX + FLT_MAX < INFINITY)
{
/* Adding FLT_MAX does not produce infinity, so infinity
is the smallest value that does.
*/
a = FLT_MAX;
b = INFINITY;
}
else
{
// Otherwise, do a binary search.
a = 0;
b = FLT_MAX;
float middle;
while (nexttowardf(a, b) != b)
{
middle = (a + b) / 2;
if (FLT_MAX + middle < INFINITY)
a = middle;
else
b = middle;
};
}
break;
}
default:
{
printf("FLT_ROUNDS is %d, which does not conform to the C 2024 standard.\n", FLT_ROUNDS);
exit(EXIT_FAILURE);
}
}
printf("The smallest value that will produce infinity is %a.\n", b);
printf("Demonstration:\n");
printf("\t%a + %a = %a.\n", FLT_MAX, a, FLT_MAX + a);
printf("\t%a + %a = %a.\n", FLT_MAX, b, FLT_MAX + b);
#endif
}