书名：Hands-On System Programming with C++
作者名：Dr. Rian Quinn
本章字数：1227字
更新时间：2021-07-02 14:42:37

Integer types

To further explain how the default C and C++ types are defined by their environment, and not by their size, let's look at the integer types. There are three main integer types—short int, int, and long int (excluding long long int, which on Windows is actually a long int).

A short int is typically smaller than an int, and on most platforms, represents 2 bytes. For example, go through the following code:

#include <iostream>

int main(void)
{
    auto num_bytes_signed = sizeof(signed short int);
    auto min_signed = std::numeric_limits<signed short int>().min();
    auto max_signed = std::numeric_limits<signed short int>().max();

    auto num_bytes_unsigned = sizeof(unsigned short int);
    auto min_unsigned = std::numeric_limits<unsigned short int>().min();
    auto max_unsigned = std::numeric_limits<unsigned short int>().max();

    std::cout << "num bytes (signed): " << num_bytes_signed << '\n';
    std::cout << "min value (signed): " << min_signed << '\n';
    std::cout << "max value (signed): " << max_signed << '\n';

    std::cout << '\n';

    std::cout << "num bytes (unsigned): " << num_bytes_unsigned << '\n';
    std::cout << "min value (unsigned): " << min_unsigned << '\n';
    std::cout << "max value (unsigned): " << max_unsigned << '\n';
}

// > g++ scratchpad.cpp; ./a.out
// num bytes (signed): 2
// min value (signed): -32768
// max value (signed): 32767

// num bytes (unsigned): 2
// min value (unsigned): 0
// max value (unsigned): 65535

As shown in the previous example, the code gets the min, max, and size of both a signed short int and an unsigned short int. The results of this code demonstrates that on an Intel 64-bit CPU running Ubuntu, a short int, whether it is signed or unsigned, returns a 2 byte representation.

Intel CPUs provide an interesting advantage over other CPU architectures, as an Intel CPU is known as a complex instruction set computer (CISC), meaning that the Intel instruction set architecture (ISA) provides a long list of complicated instructions, designed to provide both compilers and by-hand authors of Intel assembly with advanced features. Among these features is the ability for an Intel processor to perform arithmetic logic unit (ALU) operations (including memory-based operations) at the byte level, even though most Intel CPUs are either 32-bit or 64-bit. Not all CPU architectures provide this same level of granularity.

To explain this better, let's look at the following example involving a short int:

#include <iostream>

int main(void)
{
    short int s = 42;

    std::cout << s << '\n';
    s++;
    std::cout << s << '\n';
}

// > g++ scratchpad.cpp; ./a.out
// 42
// 43

In the previous example, we take a short int, set it to the value 42, output this value to stdout using std::cout, increment the short int by 1, and then output the result to stdout using std::cout again. This is a simple example, but under the hood, a lot is occurring. In this case, a 2 byte value, executing on a system that contains 8 byte (that is, 64 bit) registers must be initialized to 42, stored in memory, incremented, and then stored in memory again to be output to stdout. All of these operations must involve CPU registers to perform these actions.

On an Intel-based CPU (either 32-bit or 64-bit), these operations likely involve the use of the 2 byte versions of the CPU's registers. Specifically, Intel's CPUs might be 32-bit or 64-bit, but they provide registers that are 1, 2, 4, and 8 bytes in size (specifically on 64-bit CPUs). In the previous example, this means that the CPU loads a 2 byte register with 42, stores this value to memory (using a 2 byte memory operation), increments the 2 byte register by 1, and then stores the 2 byte register back into memory again.

On a reduced instruction set computer (RISC), this same operation might be far more complicated, as 2 byte registers do not exist. To load, store, increment, and store again only 2 bytes of data would require the use of additional instructions. Specifically, on a 32 bit CPU, a 32 bit value would have to be loaded into a register, and when this value is stored in memory, the upper 32 bit (or lower, depending on alignment) would have to be saved and restored to ensure that only 2 bytes of memory were actually being affected. The additional alignment checks, that is, memory reading, masking, and storing, would result in a substantial performance impact if a lot of operations were taking place.

For this reason, C and C++ provide the default int type, which typically represents a CPU register. That is, if the architecture is 32 bit, an int is 32 bit and vice versa (with the exception of 64 bit, which will be explained shortly). It should be noted that CISC architectures, such as Intel, are free to implement ALU operations with granularity smaller than the CPU's register size however they wish, which means that under the hood, the same alignment checks and masking operations could still be taking place. The take home point is that unless you have a very specific reason to use a short int (for which there are a few reasons to do so; a topic we will discuss at the end of this chapter), instead of an int, an int type is, in most cases, more efficient than using a smaller type; even if you don't need a full 4 or 8 bytes, it's still faster.

Let's look at the int type:

#include <iostream>

int main(void)
{
    auto num_bytes_signed = sizeof(signed int);
    auto min_signed = std::numeric_limits<signed int>().min();
    auto max_signed = std::numeric_limits<signed int>().max();

    auto num_bytes_unsigned = sizeof(unsigned int);
    auto min_unsigned = std::numeric_limits<unsigned int>().min();
    auto max_unsigned = std::numeric_limits<unsigned int>().max();

    std::cout << "num bytes (signed): " << num_bytes_signed << '\n';
    std::cout << "min value (signed): " << min_signed << '\n';
    std::cout << "max value (signed): " << max_signed << '\n';

    std::cout << '\n';

    std::cout << "num bytes (unsigned): " << num_bytes_unsigned << '\n';
    std::cout << "min value (unsigned): " << min_unsigned << '\n';
    std::cout << "max value (unsigned): " << max_unsigned << '\n';
}

// > g++ scratchpad.cpp; ./a.out
// num bytes (signed): 4
// min value (signed): -2147483648
// max value (signed): 2147483647

// num bytes (unsigned): 4
// min value (unsigned): 0
// max value (unsigned): 4294967295

In the previous example, an int is showing as 4 bytes on a 64 bit Intel CPU. The reason for this is backward compatibility, meaning that on some RISC architectures, the default register size, resulting in the most efficient processing, might not be an int but rather a long int. The problem is that to determine this in real time is painful (as the instructions being used are done so at compile-time). Let's look at the long int to explain this further:

#include <iostream>

int main(void)
{
    auto num_bytes_signed = sizeof(signed long int);
    auto min_signed = std::numeric_limits<signed long int>().min();
    auto max_signed = std::numeric_limits<signed long int>().max();

    auto num_bytes_unsigned = sizeof(unsigned long int);
    auto min_unsigned = std::numeric_limits<unsigned long int>().min();
    auto max_unsigned = std::numeric_limits<unsigned long int>().max();

    std::cout << "num bytes (signed): " << num_bytes_signed << '\n';
    std::cout << "min value (signed): " << min_signed << '\n';
    std::cout << "max value (signed): " << max_signed << '\n';

    std::cout << '\n';

    std::cout << "num bytes (unsigned): " << num_bytes_unsigned << '\n';
    std::cout << "min value (unsigned): " << min_unsigned << '\n';
    std::cout << "max value (unsigned): " << max_unsigned << '\n';
}

// > g++ scratchpad.cpp; ./a.out
// num bytes (signed): 8
// min value (signed): -9223372036854775808
// max value (signed): 9223372036854775807

// num bytes (unsigned): 8
// min value (unsigned): 0
// max value (unsigned): 18446744073709551615

As shown in the preceding code, on a 64 bit Intel CPU running on Ubuntu, the long int is an 8 byte value. This is not true on Windows, which represents a long int as 32 bit, with the long long int being 64 bits (once again for backward compatibility).

When system programming, the size of the data you are working with is usually extremely important, and as shown in this section, unless you know exactly what CPU, operating system, and mode your application will be running on, it's nearly impossible to know the size of your integer types when using the default types provided by C and C++. Most of these types should not be used when system programming with the exception of int, which almost always represents a data type with the same bit width as the registers on your CPU, or at a minimum, a data type that doesn't require additional alignment checks and masking to perform simple arithmetic operations. In the next section, we will discuss additional types that overcome these size issues, and we will discuss their pros and cons.