ELI5: what is a data lake?

The world runs on data. The growing range of technological functionalities couldn’t exist without data. Data helps businesses operate.

But data cannot be used if it isn’t collected and stored somewhere. Indeed, as we collect ever more data, we need somewhere suitably centralised and flexible to store it.

And that’s where data lakes come in.

What is a data lake?

A data lake is a centralised storage platform for storing all types of data in its raw format. (That is, pre-transformation.) This includes structured, semi-structured, and unstructured data.

Essentially, a data lake is a comprehensive repository for data of any type, which can then be used and manipulated as needed.

Data lakes ingest data rapidly. They allow collection to occur from any number of locations and support huge volumes of incoming data.

“If you think of a Data Mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

James Dixon, founder of Pentaho Corp, who coined the term “Data Lake” in 2010

N.B: A data mart is a small segment of data from a single source, structured to serve a single line of the business.

What is an alternative to a data lake?

The main alternative to a data lake storage method is a data warehouse.

Unlike the data lake approach, the data warehouse method requires businesses to prepare and transform data before they store it. (I.e., format it to match the storage protocol.)

Data warehouses store relational data, and typically serve as a core part of a business’s intelligence practices. The data is curated and ready to use by those with access to it.

However, it’s important to note that it’s not a question of whether a data lake is ‘better’ than the warehouse alternative. Both bring benefits to a business. A data lake can serve to enhance the data warehouse.

What is a data lake good for?

So, why might you opt to use a data lake method of storage?

Data lakes can store data at any scale. Additionally, there’s no need for transformation directly after data ingestion. This saves time and effort, as only the data that you read and use undergoes transformation efforts. This also means that a data lake is a lower-cost storage method for large volumes of data.

Plus, data lakes are flexible. The fact that they store data in its raw format enables the use of newer analytical methods, like machine learning and exploratory data science. For this reason, lakes are often used by data analysts and data developers. (And increasingly by business analysts.)

Finally, because they’re centralised, data lakes can help to prevent siloed data. Your organisational data is all there, in its raw format, for any department in the business to use as needed. (In line with data security and protection procedures, of course.)

TL;DR: What is a data lake?

To summarise: a data lake is a big place to store all sorts of data. It’s fed by lots of different ‘streams’ of data, such as social media, ‘smart’ IoT devices, feedback forms, and so on.

It’s like a big data soup that anyone with access can then take a bit of, and use as they see fit.

Further reading