5 common mistakes to avoid when de-duping your data

NoSQL |

Published February 14, 2019 |

Data is power and with that power comes great responsibility. One of the biggest obstacles in data is identifying duplicates and de-duping.

The aim of data-deduplication is to eliminate any redundant data in your business. Duplicates are created in all areas of your business such as the sales rep inputting a new record without checking the database first, a marketer uploading a list of potential buyers without checking if the record exists and a customer who inputs their information again as they forgot they have an account with you already.

Data deduplication ensures proper data management of such records, reduced data storage, more effective marketing communications, and better predictive analysis. Duplicate records can actually can a huge impact on machine learning and data science records by theoretically giving customers two times the predictive power and therefore create a bias in the outputs.

However, with every great idea comes risks and within a de-duplication strategy whereby data is being deleted most of the time, there can be inherent mistakes.

In-line or Post Processing

Inline deduplication processes de-dupe the data as it is processed. This means that it reduces the amount of data immediately which is great but often has performance issues with the amount of resource required to run such as strategy. However, it does mean you need far less raw disk space as the data is never actually sent over in the first place as the deduplication is carried out on the front end.

It is important that you make sure you have the processing power for inline deduplication and it doesn’t impact performance. The other mistake is to assume that there are zero cases for having duplicates. There are legitimate needs for having duplicates in your system. Reasons can be for billing, customer service, sales, and marketing reasons, therefore, it is a good idea to consult all departments that touch the data prior to implementing in-line processing.

Algorithms

Deduplication is only as good as the algorithms it is fed i.e. how are duplicate records discovered in the first place? Let’s assume we have 100 copies of a file on our systems because each employee had their own version. Instead of storing multiple copies, good practice tells you to only store one and have all the employees point to that. What if one of the employees makes a change to their own file meaning it is slightly different from the others? You run the risk of losing data. It is important to make sure that any rules you set make sense and don’t start removing unique datasets by mistake.

There are a few common algorithms used for data deduplication such as SHA-1 or MD5 and Binary Search Tree Structures which are worth reviewing to find what is most appropriate for you.

While de-duping data sets in the example above can be easily addressed by data scientists. For sales and marketing records, it is a bit more difficult. Consider, that different businesses define duplicates differently, it is no longer a task for the data scientist but rather for the heads of different departments. Therefore, the first step is to identify what makes a duplicate. For example, take a retail giant like Walmart. For distribution company, each Walmart location would be considered a unique record, however, for a software company selling into Walmart, they would consider all the locations as duplicates as they only want to sell into the head office. The same can be said for selling into P&G where some business sells individually into each brand. Therefore, they want to keep them all separate and apply the parent/child linking instead of de-duping to identify the different brands. Therefore, before de-duping make sure you have all rules defined prior to figuring out the algorithm to use to de-dupe the data.

Encryption

With data protection, it is often the case that security teams will have data encrypted as it comes into the business meaning it is impossible to dedupe it as everything is unique in this context. If you are using replication and encryption products in line with deduplication software, there is a very high chance that files will be replicated as it simply cannot pick them as unique storage blocks.

Data protection products are sometimes deduplication aware but it is vital that you consider how everything integrates together.

Manual deduplication

Most businesses will try to dedupe their database manually taking up a huge amount of resource and time with a large risk of human error. Beyond that, with vast data sets, it is virtually impossible for manual processes to pick up on everything.

For example, what if John Smith buys a pair of shoes on your website today. He goes back on tomorrow but registers as J Smith as he forgot his login details. Next week, he signs up again but with a different email address. I’ve only mentioned three data fields here, but it already starts to get complicated, so imagine if you have 200 fields of customer data, how do you ensure that is kept unique?

It is important to either construct full algorithms yourself if going about a manual process or acquiring data cleansing tools to do it for you, saving all that time and effort.

Backups

Deduplication can go wrong! Before removing duplicates, it is important that everything is backed up and you can resolve any issues quickly. Going back to our earlier example, what if we discover that John Smith and J Smith are in fact different people and need to get the account back? You need a process that can do just that, which is a legal requirement now in the EU (GDPR).

A data deduplication strategy is important as businesses grow their digital footprint. With so many channels of communication, just one duplicate record has the capacity to create bias and potentially lead to the wrong decisions. That said, it must be done properly to avoid the consequences of removing the wrong records or incorrectly feeding algorithms and reducing business speed. Ensure that data deduplication is fully formed within your data governance strategy.