In biological ecosystems, organisms flourish in favorable environments and languish in unfavorable ones. Coral is suited for life in tropical waters not the desert. Different kinds of data also require different conditions to thrive.
As the data-driven economy matures, the forces shaping it are differentiating data (and by extension content) into more refined morphological branches, where particular species occupy the habitats (or market niches) with which they have co-evolved.
Intrinsic to this evolution is the systematic classification of the data being collected, manipulated, analyzed, bought and sold: health data (clinical, genetic, pharmacological), open data, personal data, scientific data, student data, financial data, systems data, metadata, audit data, business-critical data, sensor data, polling data.
In spite of this rich lexicon, too often we refer to data in generalized terms, as a vaguely homogeneous mass. This is a mistake.
Data is anything but uniform.
Fish Need Water
Derived from a variety of sources, used in innumerable applications, data requires disparate protection and maintenance routines to maximize its utility, protect its integrity, and honor its value.
Recent polls demonstrate that the non-expert public also recognizes differences among data types in the digital biome, choosing, for example, to avoid interactions online that require disclosing personal information.
This is not the same, however, as fully grasping the logical next step.
Data that is not uniform—in terms of format, lifespan, value, vulnerability, etc.—should not be treated uniformly.
This is especially true where sensitive, private, personal data is concerned.
Plant, Animal, or Mineral?
Just as different species inhabit different ecosystem niches suited to their particular physical and behavioral traits, different data types will thrive or vitiate under different conditions.
Data itself takes multiple forms: words, numbers, pixels, coded content organized in a table or grid. Some data is highly complex; some is simple. It’s critical to be aware of all facets of a given body of data and to understand what constitutes favorable or unfavorable conditions.
This requires unraveling data attributes and evaluating them.
Consider, for example, a digital image whose geo-location is embedded in its metadata. Though seemingly innocuous, when digital images are compiled and overlaid on a map, they can reveal more than you might think. Publically available images posted on social media have disclosed the precise addresses of pet owners and helped track down criminals.
When delivery to a recipient is all that’s needed or when time-sensitive data expires, archiving it doesn’t make sense. The distinction between permanent and impermanent content is of paramount importance given the volume of data we generate. Storing it might be cheap (in terms of dollars if not environmental impact), but managing and securing it all is not. Data that does need to be retained faces the challenges of digital preservation (such as long-term readability and bit rot).
Incompatible files, “not supported on this device,” are frustrating. Some file types are so specialized that the data they contain is accessible only to those with an exclusive key. Computational biologists, for example, need proprietary bioinformatics tools to visualize in 3-D the interactions of drugs and their genetic targets.
Source and access
As with the shift from Linnaean classification to Phylogenetic nomenclature, our classification of data types will be fluid, adapting to outside forces of science, business, and culture. General categories today outline both the origins of and access to different types of data, including open data, restricted data (e.g., data subject to government regulations like COPPA, HIPAA, and Privacy Shield), proprietary data, and personal/sensitive/confidential/private data (e.g., PII, ePHI, genetic data, student data).
Co-Evolution, Symbiosis, and Habitats in the Information Ecosystem
Data itself is but one element of the information ecosystem. The expertise needed to do something useful with it embodies one of the forces driving the co-evolution of data and its market, prompting the wild growth of analytics and data visualization, influencing our behavior online, shifting the fulcrum of the adtech-adblocking seesaw, and tightening our focus on data protection.
Consider for example how these factors interact within the information ecosystem:
- The proliferation of niche markets for analytics, defined, in part, by the questions being asked (SMEs, consumer products, biotech, medicine, travel, etc.)
- The ascendance of data science and data visualization
- as academic disciplines
- and as professions (e.g., data science, predictive analytics). In fact, an insufficient supply of qualified data scientists has led to the role itself becoming fragmented and outsourced with startups jumping into the fray at every stage of the process from data preparation to vertically-oriented data visualization
- The rising use of AI and algorithms to curate and filter
- The relationship of technology and people in the information ecosystem can be a symbiotic one. Software algorithms have not eclipsed the need for human cognition in filtering, pattern recognition, and analytics. Intelligence analysts still outperform machines when it comes to intuition and inference and are instrumental in improving machine learning.
- And while algorithms make information more discoverable in an environment of overload and can help separate signal from noise, they necessarily reflect the bias of their embedded assumptions.
- The increased nuance in notions of privacy and the rise of data privacy expertise
- As more and more sensitive data finds its way into the information ecosystem, and as regulations increasingly govern the use of that data, the demand for data privacy officers will reach 28,000 at a minimum in the coming years, according to the IAPP.
- Distinctions between legitimate monitoring and surveillance have become subtler.
- With consumers becoming less trusting and more reluctant to share their personal, browsing, and financial data, researchers continue to look for a balanced solution to the give-and-take of personalization and control, of public interest and privacy rights.
- The concomitant differentiation of security risks
- As with any ecosystem, parasites find ways to exploit available resources and feed off poorly protected data. Although hackers are often motivated by criminal intent, human error can also poke holes in data defenses.
- Experts’ approach to data protection is based on a variety of factors: whether data is likely to be captured via the network or an endpoint; by an insider or outsider; whether an attack is designed to steal the data or to introduce ransomware; and whether there is an actual breach of cyber-defenses or a manipulation of vulnerabilities (e.g., breaches at the IRS where data was accessed “not through a forcible compromise of the computer systems, but by hackers who correctly answered security questions that should have only been answerable by the actual individual.”).
What’s in a Name?
To more fully understand the next life stage of this evolution, we need a taxonomic language and generally accepted classification criteria.
Big data is patently imprecise. “Big” describes a certain volume, but it implies nothing about wide-ranging sources and uses.
- Software-analytics company SAS does refer to big data’s variability, to its velocity, variety, complexity, and its applicability to business decision-making, as does IBM.
- Many claim that big data explains the “why” (though not always, as philosopher Michael Lynch points out in The Internet of Us); while small data explains the “what.”
Small data is an even more egregious misnomer: according to most definitions, there are vast quantities of it, yet there is no consensus on what it is.
- Some say it is generated by the connected devices in the Internet of Things (i.e., specific attributes derived from sensors detecting current states like those in so-called smart cities), or that it derives from the digital breadcrumbs of our online lives used for both customer segmentation and personalized medicine, or the “lean data” processed from data streams to eliminate all but the relevant elements.
- Others, like author Martin Lindstrom, say it is the subtle, detailed observations related to human behavior (such as what people around the world are eating) collected by people not machines.
If we imagine Data electronicum as the order, below this might be the families of big data and small data, which can themselves be divided into personal data and non-personal data, which in turn include myriad species each with numerous facets that define them.
Mapping out data types by classifying them might be construed merely as an exercise in content management. One might argue that nomenclature is nothing more than an abstraction. So why does data differentiation matter?
Bycatch or Targeted Breed?
It matters because we need to know what we are fishing for. Knowing what we are after affects the tools we use, the areas of the ocean we trawl, the season and time of day we cast our nets.
Imagine concentric rings expanding outward from the singular point of our identities.
Data furthest from us might be historical data or data over which we have little or no control like roadway images or government records.
Closer in are things like health information and records of financial transactions, student records and HR files: these are held by fewer institutions but still beyond our reach.
Closest are those things that we can or might wish to control like personal photos, email, and text messages.
All of this data is nominally “ours.” It is, after all, about us. And it is often generated by us. But much of it is collected, shared, or stored unnecessarily, without our consent, or without a fair exchange.
Have you ever wondered why a weather app needs to check your location every 10 minutes? Or whether a photo-sharing app really needs access to the physical addresses, birthdays, and other notes recorded in your digital rolodex?
All of the bycatch in data collectors’ nets crowds out legitimate uses, exposes latent data to misuse, and expends resources unnecessarily.
Defining data with greater precision by citing attributes relevant to a specific use will help us control more and waste less.
Issues of controlling and protecting data and using it efficiently pivot on the value we assign to it. When we make decisions about collecting, retaining, and securing data, we must first appraise it and consider its utility, usability, ROI.
Some correlations are completely spurious. A lot of data is unstructured and difficult to decipher or inaccessible because of regulatory restrictions. And in some cases, its shelf life is too short to bother with.
Sometimes what we catch isn’t worth the bait.
No Fish in a Tree
You won’t find a tropical plant in the Arctic. Fish belong in water.
A retail coupon serves no purpose once it has expired. Genetic data doesn’t belong in the digital vault of an online bank.
We need to stop treating data as uniform and develop and implement tools and policies that allow us to act on distinctions appropriately by looking at all of the characteristics that define a given type of data.
Differentiating data can be a win-win: fewer resources, greater security, better ROI.
Image: junko | Pixabay