Issues with data duplication and formatting still hurting data quality in 2019

Data Mining   |   
Published January 25, 2019   |   

I recently came across an interesting white paper from Observe Point. This white paper is titled Data Quality and the Digital World: a Web Analytics Demystified White Paper. The paper makes some bold claims, such as the fact that 80% of all web data is wrong.

This is a shocking statement that might need to be fact-checked by the colleagues of the authors. However, it does clearly address some legitimate concerns about data quality.

Since big data became a household word, organizational decision makers, as well as data and analytics experts, have been preoccupied with data scalability. They have depended too heavily on the philosophy that more is better when it comes to data in the digital era.

This has created some valid concerns about data quality. Some of these concerns were first raised a few years ago, but have still not been adequately addressed. Here are some of the biggest reasons that data quality remains an issue in 2019.

Data duplication

Data duplication is an incident when an organization has multiple copies of the same data source. To a layperson, data duplication sounds like an elementary problem that any competent data scientist or seasoned network administrator would be able to avoid. Unfortunately, data duplication is actually a very common problem. According to a 2013 report, 92% of organizations acknowledge that they are storing duplicate data.

Why is duplicate data a problem?

One of the most common explanations is human error on the part of your employees. You probably rely heavily on employees to provide data to your organization. These employees often suffer from fatigue after hours of engaging in tedious data entry tasks. This often leads to them entering multiple copies of the same data by mistake.

Data duplication can also occur when you are trying to syndicate data from various sources. This is a common issue with organizations that use webpage scraping tools to amass data from various websites. This is a common problem associated with data mining. You can easily see how this can become a problem if you are trying to collect data on competing businesses. You might scrape listings from the Better Business Bureau, Yelp and other listing services. These listings might have been tweaked slightly to be unique for search engine optimization purposes. Since the copies of the data aren’t identical, you might not be able to detect the similarities unless you use a sophisticated Hadoop-based querying tool with a high level of matching to look for semantic similarities.

Data duplication is also common when you are requesting input from your users. They are just as prone to making mistakes as your employees, although the reasons are different. Instead of accidentally submitting the same copies of data by mistake, they might intentionally submit slightly different data sets. One reason they might do this is if they are trying to earn loyalty rewards that you offer in exchange for feedback. This might also be a problem if they are running into technical issues and mistakenly think they need to submit their data multiple times for various reasons, such as thinking the system crashed while they were creating a new account.

Inconsistent data formatting

Data formatting problems are another common reason data quality may suffer. When data formatting is not uniform across your data pools, it can be very time-consuming to process. Even the most robust Hadoop data mining tool can take exponentially longer to complete the tasks. If the data formatting problems are severe enough, it may be impossible to mine and process queries at all.

The most common reason for this problem is lack of data homogeneity. In the age of big data, companies are capable of aggregating from many different sources. The problem is that these sources often have very different formats.

The good news is that there are new data standardization tools that minimize the risk of incompatibility. The downside is that organizations without reliable data scientists don’t understand that they need to invest in them. They also don’t realize that incompatible formatting is the reason that it takes so long for queries to be processed.

Incomplete data sets

Incomplete data is another common problem that still plagues organizations in 2019. There is an infinite number of possible reasons that an organization’s data may not be complete. These include:

  • Overly relying on human input. The people that were inputting the data may have failed to provide all necessary elements. In many cases, you can control for this by having form controls that require all fields to be submitted. The problem is that some data might need to be manually inputted. It may not be possible for existing data quality analytics tools to recognize that certain text fields are incomplete. Machine learning should make this easier in the future, but they will probably still be some problems that need to be purged during manual data audits.
  • Data sets may also be incomplete if data is compromised due to technical problems. These issues can stem from malware, incomplete system upgrades or server crashes. Sometimes previous versions of the data can be restored by accessing data archives. Unfortunately, it may not be possible to restore older versions of data without erasing new ones, unless you make sure that the newest copies are backed up first on a new server.
  • This is another data quality problem that human error may be responsible for. Network administrators might have been trying to access certain data sets and unwittingly deleted them or modified them beyond repair.

Some of these problems might be addressed by ensuring that it is properly backed up. Data validation is another important measure that should be taken to minimize the risk of incomplete data sets.

Data obsolescence

Data obsolescence is another common problem that doesn’t receive nearly as much attention as it warrants. This type of data is often overlooked because it is actually 100% accurate (or at least it was at the time that it was first stored). The problem is that the data no longer serves a purpose and should be purged from the system.

This may occur when companies have records of clients that they are no longer working with. It might also occur when data was accurately inputted at the time, but important details have since changed.

This is another problem that needs to be diagnosed during data audits. This is something that data scientists need to work with other professionals to do, because they might not have any way of knowing whether certain data is not accurate anymore. They will need to set an expected lifetime for all datasets and make sure that they are updated after that lapses.