“What data do I keep?”

The data explosion continues to rebound across the world, with unintended consequences for all involved. The pressures of the last decade – trying to minimize data load on storage while complying with mandatory retention and hold requirements – are colliding with the current zeitgeist of AI and trying to keep all the data that could feasibly be relevant in the future. Since keeping everything is not viable, the question has rapidly become “what data do I keep?”

Starting with the obvious

The floor, as it were, is still that which is required by law. Mandatory retention provides a bottom limit for what data must be maintained, as well as guidelines for how long. Even if no other data is to be maintained, this is the data that must be kept and must be either produced or deleted on request.

Most of this data ends up being so-called “low-touch,” meaning it is not accessed or used regularly and instead is maintained purely for compliance purposes until the retention period runs out, at which point it will generally be automatically deleted.

Beyond the basics

In the AI era, more and more organizations are retaining data indefinitely; this gives them a deep data pool to feed to possible automatization tools in order to derive utility and insights from, but it massively ups the complexity of any data management system.

As late as last year, many of the experts in the field would have focused on tiering data and optimizing for costs. Archiving properly was a matter of “targeting files that haven’t been accessed or modified for a certain period or isolating data that an inactive user ID owns, perhaps tied to a departed employee,” and shuttling them into increasingly low-cost deeper storage.

Identifying and classifying the data just on these measures was hard enough. But now, enter automatization tools. Suddenly, the low-touch data you would deep-freeze might become relevant down the line. Any AI trying to pull insights from your trends is only as valuable as the statistical data you can feed it. But identifying what data holds valuable insights and what is just noise can be tricky.

Solutions and complications

Some organizations have tried to brute force the problem by simply retaining nearly everything. With storage costs rapidly spiking, though, this is not necessarily a long-term viable approach. Furthermore, more data is a security risk; it not only exposes you to compliance issues but it also increases target surface for data breaches.

Automated data discovery and classification is the way to go forward; at a certain scale, manually managing data retention and sorting is not viable. Archives have to take on the brunt of this work, with an automated sorting system that applies the correct retention policies based on the data classification and metadata, as well as makes identifying it for future use easier and more efficient.

This ability is what will define modern archives going forward. It’s up to organizations worldwide to adapt to the new conditions, and keep evolving as the AI transformation pushes on.

Your Data in Your hands – With TECH-ARROW

by Matúš Koronthály

Image generated by Canva