Data storage needs continue to grow exponentially, e.g. driven through IoT and increasing digitization. Deduplication enables data centers to use the available storage capacity more efficiently by eleminating redundant data. The key principle is to store identical blocks only once and keep reference pointers if multiple objects or files contain the same data. However, efficient and scalable deduplication concepts are required to maintain high throughput and low latency when targeting large-scale storage systems.
Building upon previous research of our workgroup, we will integrate a deduplication engine for online workloads into Seagate's recently released open-source object storage system CORTX. In contrast to existing object storage systems, deduplication will be partly performed on the client side to scale processing performance within the number of accessing clients. The project will also empirically analysis different limiting factors affecting deduplication ratio and performance, such as inline vs. post-processing, the impact of distinct workload patterns with varying locality, and the capabilities of different storage technologies.
10/2020 - 10/2021