Readers unfamiliar with the way flash SSDs actually write data will be somewhat surprised at how different it is
from hard disk drives (HDD).
HDDs are like all other magnetic storage media storing data as bits, whereas flash SSDs store data as blocks. When data is changed on a HDD, the bits are simply altered. But, frankly, when data changes on flash SSDs, it is a bit bizarre to those not versed in the underlying NAND technology.
Unlike magnetic HDDs, flash or NAND has to be in a state in which data can be written. There is no "write-in-place" capability as there is with HDDs. If data is already written on the NAND, then that data must be erased for the NAND to receive new data. The erasure is a process that destroys a thin layer of material.
DRAM and NVRAM are much faster than flash with latencies measured in nanoseconds versus microseconds.
A brief explanation of the NAND's organization should clarify it somewhat; although, it will still feel awkward.
NAND memory essentially comprises two types of structures known as pages and blocks. A page is most commonly 4/2 KB (it can be other sizes, but this is the most common) and represents a read and write unit. Pages are grouped into blocks of 32/128 KB or 128/512 KB. NAND reads and writes are performed on page level. In contrast, erases are performed on block level.
Before a write can occur on a written page, the entire block containing that page must be erased. This is a very inefficient, lengthy operation. It is also referred to as write amplification. Comparing their respective latencies makes this evidently clear: read @ ~55μs; write @ ~500μs; erase @ ~900μs. Writes should be evenly spread across the whole volume for longevity, this is known as wear leveling.
SSD controllers must keep track of both written and unwritten blocks. They do so to provide an even wear leveling across all of the blocks and to track which blocks have invalid data. Blocks with invalid data are designated for erasure. This is known as garbage collection. These designated blocks are erased before they are needed to avoid the high erasure latency.
Once the blocks are erased, they are put back into the pool of blocks available for writes. If garbage collection does not keep up with volume of writes, then performance degrades precipitously. This is one reason why most SSDs are frequently 20% overprovisioned. A 200 GB SSD will typically actually have 256 GB of NAND to avoid running out of available write blocks.
Page and block size are configurable in most systems by the storage administrator. Larger blocks tend to be less effectively used than smaller blocks. The upside is lower write latency for random blocks because of far fewer blocks to be tracked by the SSD controller. Small block sizes deliver more efficient utilization but at the cost of higher latency. Large blocks are better for sequential writes. Small blocks are better for random writes.
When SSDs are configured for large blocks and writes are highly randomized, the wear-life of the SSD is shortened. Large amounts of unwritten NAND (unwritten unutilized portion of the blocks) are erased more frequently. Thus, the SSD administrator's challenge becomes balancing for performance (latency) for random and sequential writes while optimizing wear life. This is a nontrivial, labor-intensive challenge.
Several hybrid- and SSD-only storage system vendors such as Nexsan, Nimble, Nimbus, Oracle, Tegile and others as well as some storage stack software such as NexentaStor have gone to great lengths to solve this challenge. (NOTE: the ones that do are not shy in noting it in their marketing materials.)
Using clever algorithms, these systems load up on DRAM or NVRAM, then leverage this memory to cache SSD writes. DRAM and NVRAM are much faster than flash with latencies measured in nanoseconds versus microseconds. That high-speed cache enables the storage system to take what would have been various size random writes, aggregate them and turn them into sequential writes. Sequentializing flash SSD writes changes the game. It enables the storage system to get the best of all worlds: very high block utilization on large pages, very low write amplification, combined with the lowest possible latency.
It is crucial to make sure that systems with this functionality are not subject to data in the cache being lost before the writes are committed to the SSD or HDD media because of an unexpected power loss. It's not a problem with NVRAM. NVRAM is DRAM with a super capacitor (supercap) or battery backup that provides enough power for the cache to complete the writes to the SSD or HDD.
Shrewd use of DRAM/NVRAM caching provides solid state drive storage systems with better performance, longer flash SSD wear-life and a better user experience.