In our quest to create fair, unbiased, and inclusive AI, or make informed data decisions, identifying appropriate data is crucial. Not all data is valid, and using the wrong data can lead to flawed insights and perpetuate biases. Let’s explore the characteristics of “good” data, how to question data effectively, and the ethical considerations that should guide our data use.
Data’s Moral Alignment
Let’s clear up a key point before we begin. Data is benign. Data is inert. Like specks of sand, data sits on the beach, waiting to be made into a sandcastle.
Any data or AI system is made up of three parts:
The data (the sand)
The model - rules and method of transforming the input of data to an intended outcome (the buckets and spades, the design, the moat and placement on the beach)
The human using it (me, as a 5-year-old determined to build a sandcastle so large I could live there)
Despite Apple, OpenAI, Meta and Google’s attempt to make us believe through Scarlett Johnannson-voiced1 chatbots, that AI has a conscious with its own thoughts, feelings, and intentions - it does not.
Data without context, without purpose provides little to no value. As we work our way up the Data-Information-Knowledge-Wisdom (DIKW) pyramid, data gains it’s value from context, application of experience and distilling the data to best address the best question.
Source: DataCamp
While path to data to wisdom may not aways be as linear as it looks, data is the bedrock.
It’s the second or third layers, where humans get involved, that the biases, morals and rogue-like actions start to appear, and a lot of AI regulation focuses on (including the newly released EU AI Act). It is less to do with the data itself, and more the intent of how / why the systems and humans are planning on using it.
So as we discuss ‘good’ data, it’s less about the data itself, and more on the characteristics, the sources, and how our intent.
Characteristics of “Good” Data
Imagine you’re developing a training program for your team. You have access to a vast amount of data on employee performance, but how do you ensure that the data you’re using is valid and unbiased?
Credibility is paramount. Data should come from reliable sources and be verifiable. Think of it like using a trusted industry report versus an unverified blog post. The former is more likely to provide accurate insights (except for this newsletter. I’m very credible… - Ed).
Unbiased data is essential to avoid perpetuating existing inequalities. For example, if historical performance data favours certain departments due to biased evaluation criteria, using this data without adjustments could continue this trend, unfairly disadvantaging other departments. This could be either gathering more data for other departments, or weighting data differently in analysis - this is a great quandary for data analysts and data scientists, as sampling data correctly can be complex.
Once you have the right data, the explainability is an important factor as well - it helps build trust with the audience.
Data should also be easy to interpret. Clear and concise data presentation helps stakeholders make informed decisions. Imagine trying to understand a complex chart with too many variables; it would be confusing and unhelpful.
Actionable data provides insights that can lead to meaningful actions. It’s like having a dashboard that not only shows current performance metrics but also highlights areas for improvement.
Lastly, data must be relevant to the problem at hand. Using outdated or irrelevant data is like trying to navigate a city with an old map; it won’t get you where you need to go.
How to Question Data
Critical thinking is essential when working with data. Start by asking, “Where is this data from?” Understanding the source helps assess its reliability. Reliable sources are crucial for credible data.
Consider whether the data is buried in too much irrelevant information. Filtering out noise ensures that your analysis is focused and accurate. Ask yourself, “Is this data being filtered in any way?” Knowing how data is processed can reveal potential biases or distortions.
Timeliness is another factor to consider. Is the data from an appropriate time? Outdated data can lead to inaccurate insights and decisions. Assessing the appropriateness of data for the specific context ensures relevance and accuracy.
Another factor when it comes to bias and how to understand data’s value is to consider if the data is influenced by external factors, such as political, economic, social, technological, legal, and environmental (PESTLE) factors. Understanding these influences helps in grasping the broader context of your data. Invisible Women by Caroline Criado Perez is an excellent study in this phenomenon.
Identifying appropriate data is a foundational step in any data or AI project. By focusing on the characteristics of good data, questioning data critically, and adhering to ethical principles, we can ensure that our insights are accurate, fair, and actionable.
As we continue to explore the world of data and AI, let’s strive to use data responsibly and ethically, promoting a more inclusive and equitable future.
I’d love to continue this conversation with you in the comments.
What does identifying appropriate data mean to you?
What questions do you ask before using data - or have you never questioned data before?
Bookshelf:
In this article I reference Invisible Women by Caroline Criado Perez.
Invisible Women by Caroline Criado Perez reveals how data bias in a world designed for men systematically disadvantages women. The book highlights the pervasive gender data gap and its profound impact on women’s lives, from everyday products to public policies.
Do you have other books you’d like to recommend?
If you’d like to share this with your colleagues or friends, I would be eternally grateful, as I’d like to make this conversation as diverse and engaging as possible.
Thank you for joining me in this exploration of data & AI from what (I hope) is perhaps a slightly different perspective than you may usually hear from.
Yours in data,
Neil
This is nothing to do with data or AI but invoking Scarlett Johannson gives me an opportunity to mention I recently watched a steam recording of a great theatre piece, Chris Grace As Scarlett Johansson.
This is helpful! I'm looking forward to more