Why use streams?

Presented with a fancy new language feature, design pattern, or software module, a novice developer may begin using it because it is new and fancy. An experienced developer, on the other hand, might ask, why is this required?

Streams are required because files are big. A few simple examples can demonstrate their necessity. To begin, let's say we want to copy a file. In Node, a naive implementation looks like this:

// First attempt
console.log('Copying...');
let block = fs.readFileSync("source.bin");
console.log('Size: ' + block.length);
fs.writeFileSync("destination.bin", block);
console.log('Done.');

It's very straightforward.

The call to readFileSync() blocks while Node copies the contents of source.bin, a file in the same folder as the script, into memory, returning a ByteBuffer here named block.
Once we have block, we can check and print out its size. Then, the code hands block to writeFileSync, which copies the memory block to the contents of a newly made or overwritten file, destination.bin.

This code assumes the following things:

It's OK to block the event loop (it's not!)
We can read the whole file into memory (we can't!)

As you will recall from the previous chapter, Node processes one event after another, a single event at a time. Good asynchronous design allows a Node program to appear to be doing all sorts of things simultaneously, both to connected software systems and human users alike, while simultaneously offering developers in the code a straightforward presentation of logic that's easy to reason about and resistant to bugs. This is true, especially when compared to multithreaded code that might be written to solve the same task. Your team may have even turned to Node to make an improved replacement to such a classically multithreaded system. Also, good asynchronous design never blocks the event loop.

Blocking the event loop is bad because Node can't do anything else, while your one blocking line of code is blocking. The example prior, written as a rudimentary script that copies a file from one place to another, might work just fine. It would block the terminal of the user while Node copies the file. The file might be small enough that there's little time to wait. If not, you could open another shell prompt while you're waiting. In this way, it's really no different from familiar commands like cp or curl.

From the computer's perspective, this is quite inefficient, however. Each file copy shouldn't require its own operating system process.

Additionally, incorporating the previous code into a larger Node project could destabilize the system as a whole.

Your server-side Node app might be simultaneously letting three users log in, while sending large files to another two. If that app executes the previous code as well, two downloads will stick, and three browser throbbers will spin.

So, let's try to fix this, one step at a time:

// Attempt the second
console.log('Copying...');
fs.readFile('source.bin', null, (error1, block) => {
  if (error1) {
    throw error1;
  }
  console.log('Size: ' + block.length);
  fs.writeFile('destination.bin', block, (error2) => {
    if (error2) {
      throw error2;
    }
    console.log('Done.');
  });
});

At least now we're not using Node methods that have Sync in their titles. The event loop can breathe freely again.

But still:

How about big files? (Big explosions)
That's quite a pyramid you've got there (of doom)

Try the code prior with a 2 GB (2.0 x 2^30, or 2,147,483,648 byte) source file:

RangeError: "size" argument must not be larger than 2147483647
 at Function.Buffer.allocUnsafe (buffer.js:209:3)
 at tryCreateBuffer (fs.js:530:21)
 at Object.fs.readFile (fs.js:569:14)
 ...

If you're watching a video on YouTube at 1080p, 2 GB will last you about an hour. The previous RangeError happens because 2,147,483,647 is 1111111111111111111111111111111 in binary, the largest 32-bit signed binary integer. Node uses that type internally to size and address the contents of a ByteBuffer.

What happens if you hand our poor example? Smaller, but still very large, files are less deterministic. When it works, it does because Node successfully gets the required amount of memory from the operating system. The memory footprint of the Node process grows by the file size during the copy operation. Mice may turn to hourglasses, and fans may noisily spin up. Would promises help?:

// Attempt, part III
console.log('Copying...');
fs.readFileAsync('source.bin').then((block) => {
  console.log('Size: ' + block.length);
  return fs.writeFileAsync('destination.bin', block);
}).then(() => {
 console.log('Done.');
}).catch((e) => {
  // handle errors
});

No, essentially. We've flattened the pyramid, but the size limitation and memory issues remain in force.

What we really need is some code that is both asynchronous, and also piece by piece, grabbing a little part of the source file, shuttling it over to the destination file for writing, and repeating that cycle until we're done, like a bucket brigade from antique fire fighting.

Such a design would let the event loop breathe freely the entire time.
This is exactly what streams are:

// Streams to the rescue
console.log('Copying...');
fs.createReadStream('source.bin')
.pipe(fs.createWriteStream('destination.bin'))
.on('close', () => { console.log('Done.'); });

In practice, scaled network applications are typically spread across many instances, requiring that the processing of data streams be distributed across many processes and servers. Here, a streaming file is simply a stream of data partitioned into slices, where each slice can be viewed independently irrespective of the availability of others. You can write to a data stream, or listen on a data stream, free to dynamically allocate bytes, to ignore bytes, to reroute bytes. Streams of data can be chunked, many processes can share chunk handling, chunks can be transformed and reinserted, and data flows can be precisely emitted and creatively managed.

Recalling our discussion on modern software and the Rule of Modularity, we can see how streams facilitate the creation of independent share-nothing processes that do one task well, and in combination, can compose a predictable architecture whose complexity does not preclude an accurate appraisal of its behavior. If the interfaces to data are uncontroversial, the data map can be accurately modeled, independent of considerations about data volume or routing.

Managing I/O in Node involves managing data events bound to data streams. A Node Stream object is an instance of EventEmitter. This abstract interface is implemented in numerous Node modules and objects, as we saw in the previous chapter. Let's begin by understanding Node's Stream module, then move on to a discussion of how network I/O in Node is handled via various Stream implementations; in particular, the HTTP module.