The ANSI standard for C has a provision that allows expressions to be evaluated in single-precision arithmetic if there is no double (or long double) operand in the expression. The C compiler supports this provision.
Floating point constants are double-precision, unless explicitly stated to be float. For example, in the statements
float a,b; ... a = b + 1.0;because the constant 1.0 has type double, b is promoted to double before the addition and the result is converted back to float. However, the constant can be made explicitly a float:
a = b + 1.0f;or
a = b + (float) 1.0;In this case, the statement can potentially be compiled to a single instruction. Single-precision operations tend to be faster than double-precision operations.
Whether a computation can be done in single-precision is decided based on the operands of each operator. Consider the following:
float s; double d;s * s is computed to produce a single-precision result, which is promoted to double-precision and added to d. Note that using single-precision (as versus double-precision) arithmetic can result in loss of precision, as illustrated in the following example.
d = d + s * s;
float f = 8191.f * 8191.f; /* evaluate as a float */ double d = 8191. * 8191. ; /* evaluate as a double */ printf ("As float: %f\nAs double: %f\n", f, d);The result is:
As float: 67092480.000000 As double: 67092481.000000Also, long int variables (same as int) have more precision than float variables. Consider the following example:
int i,j; i = 0x7ffffff; j = i * 1.0; printf("j = %x\n", j); j = i * 1.0f; printf("j = %x\n", j);The first printf() statement outputs
7ffffff, while the second prints
0. The second printf() prints
0because the nearest float to 0x7fffffff has a value of 0x80000000. When the value is converted to an integer, the result is 0, and a floating point imprecise result exception occurs. A trap occurs if this exception was enabled.
A function that is declared to return a float may actually return either a float or a double. If the function declaration is a prototype declaration in which at least one of the parameters is float, the function returns a float. Otherwise, it returns a double with precision limited to that of a float. (All of this is transparent.) For example:
float retflt(float); /* actually returns a float */ float retdbl1(); /* actually returns a double */ float retdbl2(int); /* actually returns a double */Arguments work as follows:
double takeflt(float x); /* takes a float */
double takedbl(x) float x; /* takes a double */