Data Deduplication

Data Deduplication

Technologies of compression and deduplication play an especially important role in the reality of exponential growth of data volumes.

Data deduplication continues its extensive development in area of storage systems for backups, archives, virtual machine images etc.

Its usage not only allows saving storage capacity, but also increasing speed of saving and restoring the data. Various storage systems will soon have deduplication functions as a built-in feature.

First there ware magnet tapes, and it was the sole means of storing backups. Then capacious yet inexpensive hard disk drives appeared. Several years ago disk systems enhanced with data deduplication mechanisms were introduced. Such systems are capable of excluding repetitive duplication of identical fragments and now they hold an intermediate position between tape and hard disks.

Each of these technologies has their respective advantages and downsides. Fortunately, downsides of one technology can be compensated by advantages of others if we combine them all together in a D2D2T (disk-to-disk-to-tape) hierarchical system enhanced with aforementioned deduplication.

There are a lot of intuitive approaches to implementing data deduplication. It can be performed in arbitrary combinations on either file or block level real-time, or during further processing of copied data. Regardless of particular choice, the essence is basically the same: prior to saving a new data fragment, the system records its “fingerprint” using one of several algorithms. After that, if the system encounters the fragment with already known “fingerprint”, such fragment is not saved; just the path to its existing copy is saved instead.

Process reliability obviously depends on the degree of fingerprint uniqueness, which in its turn depends on the algorithm selected.

If we compare different approaches in terms of efficiency, then a usual compression with known archiving algorithms cuts the data volume to a half on average. File level deduplication used in CAS (Content Addressed Storage) allows cutting the volume to three or four times less, while switching to blocks or even smaller portions (called chunks) has potential of reducing the volume up to 20 times.

Technologies of block data deduplication are usually provided by companies specializing in virtual tape libraries, such as Avamar (purchased by EMC Corporation), Symantec Puredisk, Asigra, Data Domain, Diligent Technologies, Falconstor, Sepaton and Quantum. Network Appliance offers proprietary solutions as well.

Technologies of file data deduplication are offered by EMC in Centera product line, and by Hitachi Data Systems owing to purchase of Archivas and Caringo.