Cleaning Big Data is Key, but What Makes it “Dirty?”

Previously in this series we looked at some of the keys to solving big data issues. In this installment, we explore one of the most overlooked aspects of crunching the numbers: cleaning big data.

Working behind the scenes, scrubbing and verifying data might not be glamorous, but it’s essential for building a solid foundation when working with large volumes of information.

The more time a business or marketer can spend cleaning big data, the bigger the benefits will be down the road. Purging bad data doesn’t just make the volume of information more manageable, but it helps to identify some of the key data points you’ll want to collect going forward.

Bad data can have a cascading effect, especially when it comes to customer information. One mistake in a customer’s contact info, shipping address or purchase order can wreak havoc on how your business meets their demands. Even small speed bumps can turn into major road blocks that slow down order fulfillment and disrupt the entire customer service process.

Before we get into the nitty-gritty of cleaning big data, we have to take a look at what makes it dirty in the first place. At Right On Interactive, our clients are often dealing with some combination of these common types of dirty data issues.

The first stems from user entry errors. This is just what it sounds like: data has been manually entered incorrectly somewhere in the collection process. These issues could be a simple as an incorrect phone number or address. In many cases fixing the issue involves the tedious process of verifying the data. That could mean physically calling customers or employees to verify their information. It could also involve cross-checking the information against a clean data set.

Many times big data can’t be easily verified. That’s why it’s vitally important to avoid making user entry errors in the first place. Taking the time to ensure data is being entered correctly is well worth the effort, because it will save enormous amounts of time in the long run. Fixing this problem after the fact requires data search and discovery tools along with the ability to make mass updates.

Another common dirty data problem stems from redundancy. This is often the case when multiple systems are tracking or collecting the same data sets, but they’re being labeled something different. Fixing this issue requires extensive data mapping, which can help eliminate redundancies in the collection process while also identifying gaps that need to be filled in the future.

One common cause of bad or dirty data is often the toughest to manage: data corruption. Data corruption happens internally during writing, reading, storing, processing, or transmitting original data.

If data is corrupted it can become inaccessible or incomplete. This can lead to disruptions in related applications, and error messages. In many of cases custom solutions or manual field editing must be performed in order to intelligently correct the data.  This can take a lot of time and effort.

Right On Interactive specializes in advances data collection, mining and cleanup. In the next installment of the series, we’ll explore some of the specific processes involved in cleaning and mining data.