Deduplication for CORTX

dedup_cortx

Data storage needs continue to grow exponentially, e.g. driven through IoT and increasing digitization. Deduplication enables data centers to use the available storage capacity more efficiently by eleminating redundant data. The key principle is to store identical blocks only once and keep reference pointers if multiple objects or files contain the same data. However, efficient and scalable deduplication concepts are required to maintain high throughput and low latency when targeting large-scale storage systems.

Building upon previous research of our workgroup, we will integrate a deduplication engine for online workloads into Seagate's recently released open-source object storage system CORTX. In contrast to existing object storage systems, deduplication will be partly performed on the client side to scale processing performance within the number of accessing clients. The project will also empirically analysis different limiting factors affecting deduplication ratio and performance, such as inline vs. post-processing, the impact of distinct workload patterns with varying locality, and the capabilities of different storage technologies.

Project Partners

Seagate Technology

Funding Period

10/2020 - 10/2021

External Links

Cortx Github Repository

Contact

Prof. Dr.-Ing. André Brinkmann
Nicolas Krauter
Patrick Raaf

Publications

2024

Patrick Raaf, Andre Brinkmann, Eric Borba, Hossein Asadi, Sai Narasimhamurthy, John Bent, Mohamad El-batal, and Reza Salkhordeh. 2024. From SSDs Back to HDDs: Optimizing VDO to Support Inline Deduplication and Compression for HDDs as Primary Storage Media. ACM TRANSACTIONS ON STORAGE 20, 4. DOI Author/Publisher URL