Data lakes: Is it really worth it for organizations to sink Information in them?

Data Science   |   
Published April 3, 2020   |   

Perhaps the single biggest problem with data lakes is the fact that it seems like nobody can clearly define what they’re best used for. Computer scientists will say that a data lake is nothing more than a repository of data that’s stored in a natural raw format, but that definition sounds awfully close to a file system. There’s no reason that server managers will want to explore some exotic solution when a Btrfs volume or an Apache Flafka implementation would do the job.
That being said, there are a number of reasons you’d genuinely want to create a data lake for your own purposes. An analytic information storage unit that’s geared toward retrieval by technicians coupled with an area for offloading cold data would be an ideal use case, for instance.
Other use cases are springing up all of the time as a result of the push for decentralization.

Decentralized Data Storage in Lakes

Different divisions of a company or different departments may end up setting up separate data lakes, which could work better than a traditional silo structure. As query and analytical workloads continue to grow in size, it would be best to split data structures apart. Data lakes in particular can be split into multiple potentially overlapping volumes that are easier to maintain and provide better coverage.
Organizations that want to take advantage of decentralized storage have to make a big decision, however. Technicians have to consider the whole cloud data lake vs on-premises data lake question, which at times can get quite heated.
While an on-premise deployment can actually be more secure than a cloud-based one, that doesn’t mean its always the best option. Considering advances in 256-bit encryption, a cloud-based option can provide almost the same level of security. There’s always the risk, however, that a remote server might go down at an inopportune time, which has kept on-premise deployment a popular option for many individuals who manage a plentiful amount of local resources.
Best of all, it’s become even easier to maintain these over time.

Keeping Existing Data Lakes Flowing

In spite of the fact that they may represent major deployment hazards, data lakes are extremely easy to maintain once they’re in place. You might be able to write a shell script or perhaps a for loop in R or Objective C that can parse unstructured data for you and place it into organized containers.
Since data lakes are completely neutral when it comes to the information that’s stored in them, it’s possible to run these subroutines on almost any type of stored material. Completely unstructured data modules, like discrete documents, can be sorted into several different categories with just a few simple tools such as awk and grep.
Semi-structured data modules, like XML files and JSON logs, are also ideal candidates for this kind of treatment. Companies that have to manage financial data might want to store it in CSV format and then use an automated conversion script to place this information into editable spreadsheets that end-users can manipulate like regular objects.
Over time, however, adding multiple distributed data lakes to an array of existing resources will start to create more silos. Each new silo makes it a bit harder for companies to derive value from their data, which is why individuals tasked with maintaining these repositories are always on the lookout for new tools.

Dealing with Data Silos

A few data analysts have suggested integrating all silos with a data science tool or a virtualization scheme. In these cases, however, technicians will have to think about the performance of individual queries and latency. Time-to-solution agility has also been a major barrier, which is perhaps one of the biggest reasons that data silos have become such a sticking point in so many organizations.
Fortunately, the same shell scripting tools that are changing the way people control lakes can also help them manipulate silos. In a virtualized environment, people could dynamically provision cloud-based storage to cache the most commonly accessed pieces of data. This would make it easier for the underlying infrastructure to process queries in a timely manner.
Once again, security becomes an issue in this case. With more servers trying to maintain the loads that make up a data lake, there become more instances where something could eventually leak. Cybersecurity specialists are starting to look into some fairly innovative ways to tackle this problem.
End-to-end encryption ensures that data remains secure as long as it stays in a private lake, but once it passes to a client machine it could be exposed to potential attack vectors. A possible solution is to write a routine that checks to ensure that certain security systems are in place on the client device before loading requested data. This kind of technology can be made streamlined enough so that the inspection only adds a fraction of a second onto the resolution of any specific query.
A few GitHub projects have done this with OpenSSL in Ruby, though it’s theoretically possible to write for almost any platform. Since security and performance issues have been mostly sorted out, most problems that remain related to the data lake question have more to do with helping users find the information they’re looking for than anything else.

Connecting People & Information

An overwhelming majority of criticism heaped on data lake technology has to do with the way that some people just dump huge amounts of information into a framework like Apache Hadoop and call it a day. Some of the worst offenders have been called managers of digital graveyards, since they soon lose track of what’s there.
However, these kinds of complaints could theoretically be made about any software project. Moving forward, the real challenges won’t be related to implementation but rather making full use of opportunities that data lakes provide. Considering how innovative many programmers in the field are, it’s no doubt that these will soon help to connect people and the information they seek like never before.