Since I started my career about 25 years ago, data warehousing and the analytics tools it supports have come a long way. From the early days of relational database management systems (RDBMS) to the launch of high-performance analytics appliances and the emergence of cloud-first systems, each era has created new opportunities, solved old challenges, and produced a few of their own.
Each era has not replaced the previous. Instead, evolution has happened on a continuum. While many businesses have moved to the cloud, today many others are running analytics appliances and using RDBMS and the like. With this in mind, let’s take a look at the past, present, and future of data warehousing. This is a good opportunity to benchmark your own progress, to see where you are and think about the next best steps to meet the unique data and analytics needs of your enterprise. Let’s get started.
In the beginning, we had transactional systems
I started my career programming mainframes. These were self-contained systems with data, compute, and an application all in one monolithic system. Data wasn’t coming from everywhere in different formats. Instead, it all had to fit into one structure. Life was simpler, and it was good. But change was on the horizon.
Along came relational database management systems. This was data warehousing 1.0. These early rounds of data warehousing in the 1990s sought to take transactional databases—like Oracle, SQL Server, MySQL, and PostgreSQL—and bend them to the will of an aggregate-aware data warehouse. Since it took a lot of effort to do this, the database administrator (DBA) became a critical role for organizations. That’s because it took a lot of advanced skills and activities like partitioning, indexing, striping, and clustering to get the benefits of an RDBMS system into a data warehouse.
This effort solved a lot of problems with monolithic systems. Companies began to bring in other sets of data from different counterparties and transactional systems and build out data warehouses. And again life was good. Companies had a single view of their world. But these databases were expensive. Only the biggest companies could afford them, and they were complex and rigid in their approach. Over the years, these systems got bolted onto. The first year they ran really well, but they became slower and more complex with each passing year.
The early 2000s and the age of appliances
In the early 2000s, vendors started offering purpose-built data warehouse appliances, and the two rock stars were Teradata and Netezza. Instead of buying a piece of hardware that required a lot of configuration and tuning to extract data warehouse benefits, these appliances came preloaded. They were aggregate aware, they had multiple parallel processing, and they had GPU accelerators to make accessing information much faster for users. This eliminated the need for extensive DBA skills. And the appliance could be up and running in days versus the weeks or months it took to enable older infrastructure.
And, again, life was good. People began using these appliances for analytics purposes – to rapidly access structured data and queries. Unlike RDBMS approaches, analytics appliances introduced the concept of ELT: Extract data, load it raw, and let the system do the transformation. For example, Netezza allowed companies to load large sets of data, and over time Netezza would self-tune the database, identifying usage patterns and delivering relevant reporting to users.
This approach limited the need for database administrators, customizations, tuning, and logical integrations. The appliance covered a lot of sins. It allowed people to do things faster, get to business solutions faster, and deal with IT a lot less.
The rise of the one-size-fits-all data lake
The appliance approach worked well on structured data. But around 2008, companies began to realize that there was huge value in unstructured data – things like images, PDFs, and large JSON and XML data sets. Along with this came data lakes, such as Hadoop and Mongo. This era represented the idea that if you put data into a large container in one place, users would be able to access any information that the business had (structured or unstructured), create linkages, and drive business value.
There were many advancements in this era, and many organizations created a lot of value for technical people, such as data scientists, who could work with this data. However, the idea that an ungroomed data lake could be used effectively in the hands of business end users was premature. Imagine going into a library and seeing all of the books on the floor. It would be very difficult to find what you need. In this situation, many business users began to build out their own sets of data using desktop tools such as Tableau, Qlik, and Excel as a workaround.
Answering the data lake challenge with the cloud
As the industry progressed, the market started building initial cloud databases to gain the benefits of structured transactional systems, simulate the performance of appliances, and enable the ability to integrate data. These systems had the advantage of being in the cloud, which provided variable scale for performance loads and enabled people to experiment, explore, and derive the benefits of these systems. Utilizing a couple of decades’ worth of knowledge on how people actually consume data, cloud database systems were built to give people the data for what they need, with the packaging and guiderails that business users needed to be able to understand and act on the data.
In addition, cloud database systems (like Azure, Redshift, and Big Query) started providing metadata on the data warehouse. This gave users a glimpse into where the data was coming from (data lineage), what the data meant (data governance), what the data should be (master data management), and what the possible uses for the data could be.
Where data warehousing on the cloud is headed now
Today’s data warehousing solutions, such as Snowflake and Netezza Performance Server, were born in the cloud and designed for structured and unstructured data. These solutions offer some key benefits. They are easy to set up and install. Someone else is building out the infrastructure since it’s cloud-based. Someone else is creating the database engine, compute, and store, as is the case with Snowflake and IBM.
A lot of the performance benefits of appliances are built into these systems, and they retain the ability to make use of unstructured data sets. Offerings that were part of the initial cloud offerings have also been built in from the start, such as governance, control, and master data management. Now, these systems offer even more.
For example, Snowflake enables users to easily create a snapshot of data without incurring disk space and allows users to use it for their own purposes, such as data science, reporting, or data warehousing. In addition, Snowflake offers the benefit of time series: Being able to set your database back to any point in time. Data sets can also be purchased and sold on the Snowflake marketplace. Add up all these benefits, and you now have the ability to design a data warehouse not just for today but for 2030. Netezza Performance Server has similar functions, but its greatest benefit is that it still has the legacy option. If customers aren’t 100 percent cloud-committed, Netezza offers on-premises and hybrid solution options. Netezza is built on the OpenShift containerized platform and is part of the IBM Cloud Pak for Data solution, making it very portable. It can be placed in different sets of hardware, placed in any cloud, and even moved between clouds.
Taking the best next step for your business
Data warehousing evolution exists on a continuum. When choosing IT solutions, it’s helpful to understand the benefits and drawbacks of each approach. From there, understanding your organization’s maturity and investment priorities can go a long way in helping you find the best solution for your data and analytics needs.