Data Lakes: The Future of Data Warehousing?

Print Friendly, PDF & Email

In this special guest feature, Adwait Joshi, CEO of DataSeers, sees data lakes as a modern take on big data. When you think of a lake, you cannot define its shape and size, nor can you define what lives in it and how. Lakes just form—even if they are man-made, there is still an element of randomness to them and it’s this randomness that helps us in situations where the future is, well, sort of unpredictable. Adwait Joshi is a Chief Seer and an expert on big data analytics. His firm DataSeers provides a big data appliance for banks that uses the concept of a data lake. FinanSeer is a big data appliance that ingests multiple data sources and creates powerful analytics that help drive reconciliation processes, BSA/AML Regulatory Compliance Monitoring, Complex Fraud Detection and a full 360 view on consumer and business data.

The term Big Data has been around since 2005, but what does it actually mean? Exactly how big is big? We are creating data every second. It’s generated across all industries and by myriad devices, from computers to industrial sensors to weather balloons and countless other sources. According to a recent study conducted by Data Never Sleeps, there are a quintillion bytes of data generated each minute, and the forecast is that our data will only keep growing at an unprecedented rate.

We have also come to realize just how important data really is. Some liken its value to something as precious to our existence as water or oil, although those aren’t really valid comparisons. Water supplies can fall and petroleum stores can be depleted, but data isn’t going anywhere. It only continues to grow—not just in volume, but in variety and velocity. Thankfully, over the past decade, data storage has become cheaper, faster and more easily available, and as a result, where to store all this information isn’t the biggest concern anymore. Industries that work in the IoT and faster payments space are now starting to push data through at a very high speed and that data is constantly changing shape.

In essence, all this gives rise to a “data demon.” Our data has become so complex that normal techniques for harnessing it often fail, keeping us from realizing data’s full potential.

Most organizations currently treat data as a cost center. Each time a data project is spun off, there is an “expense” attached to it. It’s contradictive—on the one side, we’re proclaiming that data is our most valuable asset, but on the other side we perceive it as a liability. It’s time to change that perception, especially when it comes to banks. The volumes of data financial institutions have can be used to create tremendous value. Note that I’m not talking about “selling the data,” but leveraging it more effectively to provide crisp analytics that deliver knowledge and drive better business decisions.

What’s stopping people from converting data from an expense to an asset, then?  The technology and talent exist, but the thought process is lacking.

Data warehouses have been around for a long time and traditionally were the only way to store large amounts of data that’s used for analytical and reporting purposes. However, a warehouse, as the name suggests, immediately makes one think of a rigid structure that’s limited. In a physical warehouse, you can store products in three dimensions: length, breadth and height. These dimensions, though, are limited by your warehouse’s architecture. If you want to add more products, you must go through a massive upgrade process. Technically, it’s doable, but not ideal. Similarly, data warehouses present a bit of rigidity when handling constantly changing data elements.

Data lakes are a modern take on big data. When you think of a lake, you cannot define its shape and size, nor can you define what lives in it and how. Lakes just form—even if they are man-made, there is still an element of randomness to them and it’s this randomness that helps us in situations where the future is, well, sort of unpredictable. Lakes expand and contract, they change over periods of time, and they have an ecosystem that’s home to various types of animals and organisms. This lake can be a source of food (such as fish) or fresh water and can even be the locale for water-based adventures. Similarly, a data lake contains a vast body of data and is able to handle that data’s volume, velocity and variety.

When the mammoth data organizations like Yahoo, Google, Facebook and LinkedIn started to realize that their data and data usage were drastically different and that it was almost impossible to use traditional methods to analyze it, they had to innovate. This in turn gave rise to technologies like document-based databases and big data engines like Hadoop, Spark, HPCC Systems and others. These technologies were designed to allow the flexibility one needs when handling unpredictable data inputs.

“If you’re at the earliest stage of maturity, you’re used to asking questions of a SQL or NoSQL database or data warehouse in the form of reports,” said Flavio Villanustre, VP of Technology for HPCC Systems and CISO at LexisNexis Risk Solutions. “In a modern data lake that has a deep learning capability with anomaly detection, you also get new insights that could have a profound effect on your company or customers, such as the discovery of a security breach or other crimes in progress, the early warning signs of a disease outbreak or fraud.”

Jeff Lewis is SVP of Payments at Sutton Bank, a small community bank that’s challenging the status quo for other banks in the payments space. “Banks have to learn to move on from data warehouses to data lakes. The speed, accuracy and flexibility of information coming out of a data lake is crucial to the increased operational efficiency of employees and to provide a better regulatory oversight,” said Lewis. “Bankers are no longer old school and are ready to innovate with the FinTechs of the world. A data centric thought process and approach is crucial for success.”

Data lakes are a natural choice to handle the complexity of such data, and the application of machine learning and AI are also becoming more common, as well. From using AI to clean and augment incoming data, to running complex algorithms to correlate different sources of information to detect complex fraud, there is an algorithm for just about everything. And now, with the help of distributed processing, these algorithms can be run on multiple clusters and the workload can be spread across nodes.

One thing to remember is that you should be building a data lake and not a data swamp. It’s hard to control a swamp. You cannot drink from it, nor can you navigate it easily. So, when you look at creating a data lake, think about what the ecosystem looks like and who your consumers are. Then, embark on a journey to build a lake on your own.

Sign up for the free insideAI News newsletter.

Speak Your Mind

*

Comments

  1. Frank Quintana says

    Data lakes are extensions of the corporation Data Warehouse. It is a mistake to think that the Data Lake and the DW are to separated things this mindset are the reason of data silos and data swamps. By definition, a data warehouse is a place where you store corporation data any kind of data relational, NoSQL, structured, semistructured, unstructured, distributed or monolithic data.
    Big Data has been defined as the 3Vs Velocity, Volume, and Variety. But there are at least three other Vs of great importance Value, Veracity and Visibility and these 3Vs are the big challenges of the Data Lake environment to avoid the Data Lake to become Data Swamps

    • With all due respect, your “by definition” of a data warehouse is actually what leads to a data swamp. Methodologies like Kimball with structure were created to take raw data from the sources you think a EDW should hold. The data lake is a reasonable concept to keep all ‘first creation’ data from any source and companies like Snowflake are making it work. There sequence should be that all ‘in action’, transactional, ‘event driven’ data be backed up into the data lake, from there, business must work with IT to decide what data is absolutely useful to the bottom line and future growth and work on those priorities; data warehouse to flow into data marts. The name of the game isn’t how much data I can store. It has always been how much data do I need to sell, create, invent “X”

  2. Snowflake is trying to be the target of all data sources, building bridges for ingestion. But they are not a data lake. They are part of the data lake. Sure you can ingest everything, even JSON as-is. But at the expense of compute, which can get really expensive.

  3. Having your own warehouse, it is worth investing in its development on many levels. A software company may be very useful, which, after learning about the company’s needs, as well as the specificity of the company, will adjust optimal solutions to improve the operation of such a place. Firmao CRM includes ERP.