bitpuf blog harboring dark data

Are You Harboring Dark Data?

Dark Matters

We’ve all heard the refrain about poor data: garbage in, garbage out. A less well-recognized issue concerns data that is collected and stored but not used.

Many companies draw on only a fraction of the data they posses and often fail to derive anything useful from it. The explosion in data analytics will help redress this gap, enabling organizations to identify patterns, make predictions, and personalize products and services. The advanced analytics market is projected to grow to nearly $30B by 2019.

But data analytics rely on seeing the data that is being analyzed and some shades of big data are difficult to discern. Most organizations retain vast quantities of this darker stuff. Some estimate that as much as 90% of big data is so-called “dark data.” Though not always shady, it is not always a valuable resource either.

Hidden in the cloud or dark matter of cyberspace, either way the dark data you harbor can be an unseen force—for better or worse.

What is Dark Data and Why Does It Matter?

Dark data refers to data that is collected and stored, then neglected. It may include data of minimal value or great potential.

  1. Sometimes the term refers to data that is undetected and therefore unusable. Often this is simply a matter of unstructured information contained in text-heavy documents or files that are not tagged or annotated in any systematic way (imagine an encyclopedia without an index).

Email is a prime example. Though often archived as a matter of policy, it is unlikely to be cataloged in a content-management system. Because there are privacy laws specific to email, knowing where it resides and what it contains is paramount.

Undetected dark data might also include personal files, like music and video that employees store on company machines, or worse on unsanctioned cloud apps on third-party servers. The storage costs accumulate quickly.

According to a study of companies in the UK, “A typical midsize company with 500 terabytes of data wastes nearly a million pounds [$1.5 million] each year maintaining trivial files, including … personal photos stored by 57 percent of employees, personal ID and legal documents by 53 percent, as well as music, games and videos, stored by 45 percent, 43 percent and 29 percent respectively.”

  1. Other times “dark” implies dangerous, meaning that it exposes information systems to significant risks. This includes data that is redundant, obsolete, or trivial, also known as ROT (an apt acronym). When retained beyond its useful life, it remains vulnerable to misuse.
  1. On a less sinister note, dark data can also refer to data that is simply inaccessible. In some cases it holds promise but requires transformation first, either from an outdated digital format or from a non-digital one.

There is a treasure trove of information locked up in libraries, museums, and research collections: e.g., objects, photographs, even metadata in card catalogs. These are unequivocally worth preserving in digital form, contributing as they do to innovation and scholarship.

Got ROT? Deal with Your Databerg in Four Steps

Whether perceived as a business risk or potential asset, caring for this all of this data is a Herculean task.

“Databergs” threaten to rip a hole in information systems. ROT alone is projected to cost organizations $891B by 2020 in storage, migration, and security.

The intangible costs of data protection are equally significant. Trust is considered the “cornerstone of the digital economy,” yet the reputational and financial risks of data breaches are too often recognized after a hacking incident not before.

Minimizing these risks is essential. How?

  • First and foremost prevent unauthorized access.

Though no one wants to talk about it, most data breaches are the result of accidental or deliberate unauthorized access by employees. What to do? Training in data ethics and implementing and executing clear information governance policies.

  • Second, address digital decay and excise the rotten bits.

Data is delicate with a relatively short shelf-life: it must be periodically accessed and migrated to ensure its integrity.

Storage media can be unstable and prone to corruption or defect, but they must be readable in the future. File formats, especially proprietary formats, quickly become outdated, as the applications needed to view them become incompatible with current operating systems and devices.

  • Third, eliminate data that isn’t needed rather than storing it indefinitely, which is wasteful and risky if no one is monitoring it.

One person’s ROT might be another’s loot and it should be periodically purged. (Obsolescence and triviality make a strong case for temporary content–neverlasting as we like to call it!)

Notwithstanding the costs and risks of keeping data that holds no value, determining what to retain also has legal and cultural implications.

Culturally, we must ask: what will we commit to the digital record? Legally, we must comply with data protection and privacy regulations.

  • Finally, keep assets secure.

Cybersecurity today must protect not only the data itself, but the data used to authenticate access to it (biometrics both physical and behavioral hold promise in some applications but can also be stolen for nefarious use).

Dark Data Checklist

In the simplest terms, any approach to caring for dark data will involve:

  • Identifying it—locating it, classifying it, etc.
  • Ensuring appropriate access, now and in the future, in terms of both authorization and integrity
  • Evaluating it and eliminating what isn’t needed, e.g., ROT, unrecoverable, and sensitive data
  • Protecting what is kept

And this will mean addressing some rotten habits:

  • Hoarding data
  • Misusing the corporate cloud and third-party storage apps
  • Failing to differentiate between valuable data and ROT
  • Failing to annotate data when it is captured or created
  • Racking up storage costs and leaving a huge environmental footprint

Keeping Private Data in the Dark

With nearly every move we make online generating a steady stream of digital bits, dark data touches all of us.

We can support the use of data in research where it contributes to the common good while also holding data brokers accountable (especially when they sell health data without our specific consent).

And we can support the use of personal data to customize an offer when it benefits all parties involved.

But we must insist that possessing and using sensitive data conveys a big responsibility. To help ensure that it is well protected and treated ethically, organizations should focus on obtaining transparent consent, collecting only what is needed, selecting what’s valuable, and eliminating the rest.

We at bitpuf have chosen not to collect your data, because we believe in keeping your personal information in the dark, off the record, in a word private.

If you believe privacy matters…

Sign up for bitpuf!


Image source: Pixabay