Will the cloud make Hadoop obsolete

IT strategy

In times of Big Data, MapReduce and NoSQL (Not only SQL), the concept of the data warehouse from the 80s seems outdated. As a proven means of data integration and provision, it has been established in companies for years. In view of the ever increasing amounts of data and ever higher demands on evaluations, however, it seems that it has to capitulate.

State-of-the-art from 198X

But let's go back in time a little. About 20 years ago, the data warehouse was introduced as a new concept to integrate information from the operational sources of a company and to provide analyzes and reports that can be used for the management of the company. Not much has changed in this respect to this day: data warehouses automatically combine and process data using extraction, transformation and loading processes. If you want to create copies of the data warehouse for special applications or business areas, you can also set up data marts. Data marts are usually designed multidimensionally and can therefore be optimally used by analytical applications.

Relational database systems are mature

Data warehouses or data marts represent the central database for all analyzes and reports of a company. Even if nothing has changed even in the age of big data, they still have a decisive disadvantage. It is true that they are based on relational database systems (RDBMS), which score among other things with a mature, highly developed database software, powerful query language (SQL), many available front ends and high reliability and consistency. However, the scaling of RDBMS can be problematic with extremely high data volumes: Commercial database software quickly becomes expensive (license costs) and performance drops, which slows down ETL processes and query speed. With large models, expanding the database schema becomes difficult, and the frequency of data generation and storage can also be a challenge. If unstructured data is also to be processed, relational databases quickly reach their limits because they are not designed for this.

For this reason, different technological approaches have emerged to circumvent these limitations. I consider some of them below.

New players: analytical databases

Analytical databases extend the data warehouse in a simple way. Although they are also based on relational database systems, they are optimized for quick queries and are therefore particularly suitable for large amounts of data and analytical applications. The evaluation is carried out with SQL or BI software, so that controllers and business analysts can also work here.

Analytical databases use column orientation, massive parallel processing (MPP), data compression and in-memory storage as techniques. That means you can be with them

  • Can use data that would be too slow to process and too expensive to store

  • Can use SQL for queries

  • Can use BI frontends

  • gets implementations quickly and has little administration effort

  • does not need a complex hardware architecture.

With the help of data marts, analytical databases can easily be combined with data warehouses. An alternative is to load the data marts from the data warehouse, whereby they access its integrative layer. But that means a lot of effort for modeling and development.

Extension to NoSQL databases

NoSQL databases such as MongoDB or Cassandra are open source - schema-free, distributed, horizontally scalable and have a non-relational model. This helps to avoid weaknesses in relational databases, e.g. in terms of scalability and performance. If you add more servers, large amounts of data can be processed comparatively inexpensively. Resilience can be improved by replicating on multiple servers. Schema extensions are also more easily possible, and their simpler schemas offer more agility and flexibility in adapting and extending them.

Even if NoSQL systems offer many solutions for big data, their query languages ​​cannot compete with the possibilities of SQL. BI applications use SQL as their query language. As a rule, however, NoSQL databases do not offer direct SQL access. Some of them already make it possible to create reports directly from the database via special interfaces. However, its range of functions is significantly limited in comparison to SQL. However, NoSQL databases can still be used sensibly in combination with the data warehouse: you benefit from the advantages of NoSQL in the application databases, while for analyzes the data is loaded into the data warehouse, where the relational databases show their strengths.

1 + 1 = 3: Hadoop and data warehouse

The two basic problems when using conventional data warehouse technologies are the large amount of unstructured data and the huge amounts of data, which cause operating costs to rise rapidly. That is why Hadoop was developed. Due to its special architecture, it can be extremely scaled and also process very large amounts of data with high performance. This makes Hadoop ideal for batch-oriented processing. Standard hardware is completely sufficient, which means that the price / performance ratio of Hadoop is very good.

Because Hadoop is not a database, but is based on HDFS and MapReduce, it is a data archive and at the same time a platform for data analysis and processing. The system offers the basic functions of a data warehouse such as aggregations or summation and averaging. Combined with a data warehouse, however, the benefits of Hadoop increase even more: The results of the Hadoop processing can be stored in the data warehouse or in data marts, where they can then be used with all the advantages of a data warehouse platform. The actual raw data is only in the Hadoop system. Tools for high-performance SQL access to Hadoop data are currently being developed. With the current state of the art, however, the combination with the classic data warehouse approach makes more sense.

Big data technologies in interaction

How can all these technologies work together profitably? Let's take an example: A company runs a website that generates a lot of traffic. It wants to collect the data in order to analyze visitor behavior and better understand customer behavior. A combined approach is absolutely necessary here. The raw data for the analyzes come from the logs of the web server. Because NoSQL databases can efficiently handle the huge amounts of log data, the data of the web application is best stored directly there. In this way, new data objects can also be easily added.

In our example, hundreds of millions of new data records are created every day. Traditional data warehouse techniques take too much time to process such amounts of data. Because of this, they are placed in Hadoop. This enables powerful batch processing for data preparation. When filing, the log data is condensed into page levels, hosts or hours and enriched with information from third-party sources.

In the next step, the data is aggregated and loaded into the data warehouse / analytical database. Here you can now use SQL and other technologies to achieve high-performance queries, e.g. joins, groupings, complex filters or online analytical processing (OLAP).

Coexistence of the data warehouse with the big data stores

The example shows: Even in times of Hadoop, MapReduce and Co, the concept of the data warehouse is more topical than ever. The only challenge is to complement and expand the concept in such a way that its weaknesses are compensated for. So it is wrong to say goodbye to the tried and tested approach of the data warehouse. Rather, the combination of old and new technologies enables better and faster analyzes and a solid database. (bw)