For many enterprises, data lakes offer a promising opportunity to leverage new data sources, explore new types of data science, and run self-service analytics with the speed and freedom that the rigors of traditional database approaches might not allow.
One might think you can just set up a data lake and then be off and running. It’s an appealing idea that you can take data as it sits in transactions systems and use it for analytics purposes without transformation, modeling, cataloging, or curating. Though, as with most things, if it seems too good to be true, it probably is.
The same goes for the idea of self-service analytics from a data lake. While it sounds appealing, the idea that all users who are given access to the data lake will be able to understand, model, and prepare meaningful insights has fallen short. In reality, there’s a lot to get right to get real value out of a data lake.
Why most data lake projects fail, and how to avoid the pitfalls
Like any database initiative, getting value from a data lake involves alignment of people, skills, technology, and governance. This ensures quality in the creation and management of data. Let’s look at some common reasons that data lake projects fail to realize business value, and what you can do to ensure that your business succeeds. This includes:
- Alignment with the business
- Lack of skills
- Data quality and consistency
- Data curation and cataloging
- Governance, access, and control
- Operationalization of the lake and its value
Alignment with the business
When it comes to data lakes, organizations need to answer an important question: Whose baby is this anyway? The data lake needs to align with the organization’s business needs and be supported with the help of IT. This is not an IT project. This is an enterprise solution requiring IT to work in line with business stakeholders to understand their business goals and objectives. At the same time, this is not a business project where IT is not required. IT is needed to understand the data, the systems, the meaning, and IT will own the operationalization of the solution moving forward. So whose baby is it? It’s everyone’s.
Lack of skills
A lack of skills is the single largest impediment to data lake success. These aren’t just architectural skills to construct or load the data lake. Skills are missing in every enterprise to understand, manipulate, and gain insight from the data in the data lake. Half of all respondents in an IBM survey stated they are still not sure where to get the “best” data in their enterprise.
Data quality and consistency
Data lakes are built as projects with a defined scope, discrete resources, and availability of skills. Standing up the lake, loading data, securing the lake, and providing access to users—and perhaps a tool to access the data—is a usual first step. Data in transactions systems is not clean. When stood up next to another system, it lacks consistency. The same data from the same source (over time) will present different values as systems, people, and policies change. Users expect to use good, clean (AKA “best”) data.
Data curation and cataloging
Data lakes should have all the data in the enterprise and be available to whomever needs it, and in the format that they need it. Here, data curation is essential. Without data curation, employees who need data have no idea what data is available. And finding the right data is often a trial and error process: calling multiple people and hoping they know where the right data is. Typically, without curation, nobody has the overview of what data is where, what the quality of that data is, and how to access it. Data is also replicated many times, each time in a unique form. With the right approach to data curation and cataloging, businesses can avoid these issues.
Governance, access, and control
Data lakes fall into one of two categories: 1) open to any and all in the enterprise, 2) locked down so that all access must be requested—typically eliminating the value for exploration and discovery. Data should be aligned by zones for security, retention, and quality. Zones will have data owners assigned with two-part authorization (data business owners and data system owners).
Operationalization of the data lake and its value
The data lake is limited if it has all the data but no one can use it. It’s limited if the scientists find amazing insights in a policy, campaign, or process but insights can’t be automated back to sources. The data lake is limited if it has everyone looking for data for self-service and coming up with slightly different answers. On the other hand, the data lake that is part of an arsenal of data tools—from transacting sources, to warehouses, to marts—allows for the right size and right solutions. It is a key part of the enterprise data estate. The data lake that provides value to business users is easy to understand, manage, maintain, secure, extend, and gain insights from. Everybody wins!
Create a clean data lake with help from Lighthouse
There is a lot more to consider than just picking a data lake platform and throwing your data in it. Here at Lighthouse, we have the experience to help you find the value in your data lake project. We assist with data strategy, health assessment of current lakes, tool selection, training, as well as value workshops. We reinforce people, process, and technology as pillars for the data lake.
Avoid your data swamp – let’s work together and move towards having a crystal-clear data lake that can create real value!