Garbage In, Garbage Out: Why Data Quality is the Critical foundation for AI-Driven Insights

A guide to determine how to use data to solve your business problems and satisfy your stakeholders

5 min readNov 17, 2023

In my time as a data scientist and analyst, I’ve learned that data comes in all shapes, sizes, and forms. In 2023, we now have access to state-of-the-art generative AI and deep learning models that can uncover insights and patterns like never before. However, in our excitement over these new technologies, we sometimes forget that the data we use is still a pivotal factor in effectively solving real business problems.

No matter how advanced our algorithms become, the old computer science motto still applies: garbage in, garbage out. If the data going into our models is of poor quality, inaccurate, biased, or incomplete, the insights coming out will be fundamentally flawed. Fixing data quality issues must be the priority before applying any analytics or AI, no matter how sophisticated.

As this year comes to a close, it’s a good time for data scientists, analysts, data engineers, and software developers to refocus on the basics of our craft. That starts with obtaining, cleaning, and organizing data in ways that allow AI systems to genuinely extract meaningful, trustworthy insights that solve real-world business challenges. Mastering data quality and integrity has been and will always be the key to unlocking the full potential of AI.

Translating Data

Rather than only “speaking data” to your stakeholders, effective data professionals learn to “translate data” into the language of business impact.

🚨Avoid at all costs the generic data statements that may come from stakeholders like the following:

We need to implement master data management!
Our models require more relevant, clean training data.
We need to care about data quality

Unaware of the technical implications when making these statements can be confusing and an uphill battle when it comes to convincing management of what exactly you are trying to solve.

💡By contrast…

It is more important to address such statements by explaining data concepts in terms of how they directly affect business outcomes:

Improving data quality will reduce errors in our sales forecasts by 20%, allowing us to better predict revenue swings.
Master data management will reduce duplicates and errors that currently cost us $XXX per year
Better training data will improve the accuracy of our demand forecasting model by Z%

The key is relating data and technical concepts to tangible business goals and outcomes. This gives stakeholders a clear understanding of the value you aim to unlock.

Photo by Ricardo Gomez Angel on Unsplash

The Data Translator’s Guide to Stakeholder Success

Remember that data is abstract. Most people aren’t trained to visualize insights from raw data. Telling someone about a single row in a spreadsheet is easy. Summarizing thousands of rows is hard. This abstraction is what your stakeholders face when making requests.
Discover the business context. Ask about the reasons behind each request and how the end product will be used. Focus questions on the business needs, not data details.
Map business needs to data solutions. Use your expertise to translate business requirements into data requirements.
Explain solutions in business terms. Without technical jargon, be able to clearly explain:

The original business challenge
Your proposed data solution
How stakeholders can immediately apply those insights

The goal is to make complex data concepts tangible and actionable for business stakeholders. By bridging the language gap between data and business, you enable data-driven decision-making. When you help others win, you win too.

Generative AI Data

The need for quality data is not limited to traditional analytics but it also applies to generative AI systems. Just as garbage inputs produce garbage outputs in data analysis, poorly framed prompts will lead to inaccurate or nonsensical AI-generated responses.

Users must carefully craft prompts to provide sufficient context and set the appropriate tone. This “prompt engineering” is similar to the data wrangling that analysts undertake — clean, comprehensive inputs are essential for extracting meaningful insights, whether using AI or traditional analytics. High-quality data remains the foundation, determining the success or failure of advanced algorithms. So while generative AI is transforming what’s possible with language models, it does not alter the timeless principle: garbage in, garbage out.

Photo by Emiliano Vittoriosi on Unsplash

My Recommendation:

After reading Joe Reis and Matt Housley’s book on Fundamentals of Data Engineering, I was inspired to write an article sharing key takeaways with my machine learning publication’s audience. Though I drafted the piece months ago, it remained unfinished collecting e-dust. Now, given my latest work emphasizing quality data practices, it feels timely to revisit that draft and finally publish those insights here.

Taking a tiny excerpt from Joe Reis and Matt Housley’s book:

The practicalities of getting value from data are typically poorly understood, but the desire exists. Reports or analyses lack formal structure, and most requests for data are ad hoc. While it’s tempting to jump headfirst into ML at this stage, we don’t recommend it

What is important here is to understand what are the steps to take to define the quality of data, the goals, business outcomes, and buy-in from key stakeholders.

While the book dives deep into technical data engineering, its insights matter for any data practitioner. Quality data remains the crucial puzzle piece for successful machine learning. I highly recommend reading this book to learn their end-to-end perspective on transforming data into value. No matter your role, understanding data engineering best practices is key to unlocking the full potential of AI.

_________________________________________________________________

Please share if you find this insightful. Until next time ✌️