Recently, I had an opportunity to interview Oren Eini, CEO and founder of Hibernating Rhinos which provides RavenDB, an open source document-oriented NoSQL designed especially for the .NET/Windows platform.
Oren has more than 20 years of experience in the development world with a strong focus on the Microsoft and .NET ecosystem. Recognized as one of Microsoft’s Most Valuable Professionals since 2007, Oren is also the author of “DSLs in Boo: Domain Specific Languages in .NET.” He frequently speaks at industry conferences such as DevTeach, JAOO, QCon, Oredev, NDC, Yow! and Progressive.NET.
You can read the complete interview below:
1. In this digitalized world, data has become one of the most valuable assets. and therefore, the way data is stored, organized, and process is critical to business’ success. As companies are bombarded with more and more data, data storage and analytics are growing more complex. Can you tell us some of the common database management challenges businesses face today?
The primary issue, I believe, is just the sheer size of the data. I’m not necessarily talking about Big Data and the complexities of managing a data set measured in hundreds of terabytes. I’m talking about the number of databases and data silos that you have in an organization. Since everything is digital, you have business-critical functionality that resides in an Excel spreadsheet on a shared drive and historical data of customer purchases in a server that no one wants to go near from fear of accepting ownership on.
Just figuring out what the organization as a whole knows can be a complex task. Data slipping through the cracks is sadly common.
Attempts to create a centralized repository for the entire company are also doomed to fail. Different portions of the company have very different ideas about what seems obvious things are. For example, the Billing department has a very different notion of what a Customer is than the Marketing department. Trying to make the data fit a common mold does everyone a disservice.
2. How do we go about overcoming these challenges? Do you think choosing an effective database management solution is the first step? And why?
The first step is to define, at the organizational level, data ownership, and responsibility rules. At the most basic level, Billing owns the concept of whatever a Customer is in an OverduePayment status and Marketing owns the Interests of a Customer. The idea is not to create silos of information in the organization but to have an explicit acknowledgment of the different needs. Once that is done, you can define proper data flow in the organization.
The Billing department will make its view of a Customer available to the rest of the organization while retaining the freedom to change how it is shaped inside the department.
I use Billing and Marketing departments and the notion of a Customer as this example to be able to talk about the business first, which is important. To move it to a slightly more technical manner, we are talking about services and data flow contracts. I’ll refer you to Bezos’ Mandate and how it transformed Amazon. The idea is simple: instead of treating the entire organization as a single whole, which is nearly impossible past a certain size, treat it as a bunch of cooperating organizations that have very clear boundaries between them.
Once you have those boundaries, and you have a good idea of the flow of data in the organization, you can have your plumbers come in and do stuff to it like redirecting the data flow to a central location for analysis.
Having such published interfaces helps a lot when the time comes to change how some things behave. As long as the external behavior is the same, we are free to change how we process it.
3. In recent years, enterprises have adopted various types of NoSQL databases. With increasingly sensitive data being stored in NoSQL databases, security issues have become growing concerns. What is your take on this?
By and large, the most common reason for lack of security in NoSQL databases is operator negligence. I want to clearly separate two distinct issues here. We have NoSQL databases such as Redis, whose security model is explicitly about running in a trusted environment. There are some rudimentary security features for Redis, but the general assumption is that they are meant to serve only as the third or fourth line of defense.
Other NoSQL databases, such as MongoDB, are expected to run on hostile networks (i.e., the Internet). However, it is easy to setup up MongoDB with no security whatsoever. On the face of it, MongoDB comes in a secured configuration, allowing it to listen only to the local machine.
The very first thing that you’ll find when trying to connect to MongoDB remotely is a guide that explains how to enable remote access to MongoDB, without any security whatsoever.
To a certain degree, this is operator negligence. But given the sheer number of MongoDB databases that are left wide open, I believe that this is splitting hairs. In China an open MongoDB database had over 200 millions CVs just waiting for someone to snoop; a carelessly setup database has exposed Russia’s backdoors into over 2,000 companies.
With security, you don’t get a second chance.
RavenDB, in contrast, will simply refuse to run in a vulnerable configuration. You can run RavenDB with no security on the local machine, but if you try to expose the database to the Internet without the proper safeguards, the database will return an error explaining how you should properly set it up.
We fill in the maximum amount of gaps by assuming that most people will do the minimum amount of work required and make sure that when this happens, the final state is good, so we’ve got you covered.
4. Talking about RavenDB, can you name some of the most important features that add more value to the customers? How does RavenDB stand out among other vendors in terms of features and performance?
RavenDB has been running in production systems for over a decade. Some of the most powerful features we have dated back to our original version. The ability to react dynamically to the operational environment is the most obvious one. RavenDB continuously analyses what is going on (incoming queries, server load, etc) and reacts to that by changing resources allocation, internal structures, etc. The idea is that instead of having a full-time DBA babysit your database, your database can manage its own affairs.
When we started working on RavenDB, we wanted a database that had all the advantages of a relational database (fast, ACID, reliable) but none of the disadvantages (rigid schema, ongoing maintenance, high complexity).
When we started, I had no idea how big a task this was. Over the past 10 years, we have gained a lot of experience in building a database that can just work, without requiring you to pay much attention to it. We designed RavenDB to make it easier for us to implement things with a focus on performance. A recent benchmark on a Raspberry Pi (25$, 1 GHz, 1 GB RAM) machine clocked us at over 5,000 writes a second. On commodity hardware, we can get to over 100,000 writes per second and over 1,000,000 reads per second.
All of that is on a single node, but RavenDB has been a distributed database from the get-go. This means that you can set up a cluster in a few minutes (and do so in a secure manner, of course) and have a highly available and robust system.
We offer some unique features that aren’t available elsewhere. ETL is built-in inside of RavenDB and is heavily utilized by our customers to enable rich data flow. You don’t need to stitch together a solution from disparate pieces, it is all right there in the box and it Just Works.
The Subscription feature is one that I’m particularly proud of. It allows you to perform an ongoing query. The database will initially give you all the results that matched your query. Since you are still subscribed to this query, your database will send over any new documents that match your query as they are entered or updated to fit that query. This allows you to build robust business processes and backend systems with ease.
We have focused a lot of effort into making RavenDB into a multi-model database capable of handling documents, key-value, binary data, distributed counters and graph queries.
5. And finally, what is the future of database management systems? How is it going to change in the next 3-4 years?
You are going to see a lot more focus on multi-model databases. Instead of having to deploy a dedicated solution for each type of data you want and dealing with the complex integration between each of the pieces, the market is moving to an integrated solution that can offer a full suite of options in a single box.
The cloud will continue to be more important, but I wouldn’t be hasty to say goodbye to on-premise and distributed systems. We are seeing a lot of our customers do processing on the edge and on occasionally connected systems. I think you’ll see a shift of focus, where the data centers of the past would move to the cloud, but a lot of the actual processing would be distributed at the edge and on mobile devices. That requires a different manner of thinking about data distribution and how to push data to the cloud and pull data from the cloud.
There is going to be a lot more emphasis on the kind of distributed data processing that once was the exclusive range of the high-end systems.
It is certainly going to be very interesting to see how the landscape changes and how we build the tools and methodologies to handle ever-growing complexity and functionality.