Jumping the 3 Big Hurdles to Predictive Modeling, Part 1: Data Prep

As a retail marketer looking to compete in today’s customer-driven market, acknowledging the need to invest in predictive modeling is a no-brainer. Everything after that takes more brains than a zombie buffet.

No sane and decent retail marketer currently holds the opinion that data and analysis are unimportant. Or at least, we think that there’s no one out there like that. Maybe we’re in a bubble.

We’ve written extensively on the importance of being data-driven and customer-centric if you’re going to scale your business in The Age of Amazon. Such hits include:

But if you’re here, you probably already know that data and analysis hold the keys to success, and what you really want is for us to tell you how to get there. So let’s do that.

Oftentimes when we speak with organizations that are early on in the maturity curve—i.e., they’ve just started hiring data scientists or building their own in-house data science capabilities—there's this incredible rosy picture of what the endgame of data science will look like in their organization.

They think they'll spend the vast majority of time on exciting R&D projects, experimenting with models! neural networks! logistic regressions! and machine learning! Then they’ll spend some smaller percentage of time improving and maintaining the performance of those models.

Screen Shot 2019-02-27 at 10.10.22 AM

But check out this (admittedly unscientific) breakdown of how we see teams actually spending their time.

In reality, we find data science teams wearing many, many hats and spreading themselves thin across a number of activities involved in going from ideation—like, What is the business problem that we want to solve?—through having a fully functioning model that's producing output at scale.

Organizations that don't have experience in building models tend to be surprised after they've hired their first few data scientists. It’s only then that they realize there's actually a rather complicated supply chain of activities that have to take place to go from the raw data to action built on those insights.

Taking a 30,000-foot perspective, the three big hurdles are:

  1. Preparing the data

  2. Building the models

  3. Making the outputs useful

These are the fundamental hurdles that we’re going to unpack, so no matter if you choose to work with Custora or hire your own data science team, you understand what’s required to get to a useful output.


The First Challenge: Preparing Your Data for Predictive Modeling

At a high level, our advice is simple: preempt the rookie mistakes of data science by adhering to a clear, carefully crafted plan.

This must include a roadmap for quality control over the “raw materials” fueling your marketing analytics (by which we mean the oodles and oodles of data pouring in from online and offline channels). Even the most sophisticated tool will fail to deliver meaningful insights if it’s swallowing disorganized, incomplete, or otherwise subpar data.

The first hurdle you must clear when developing your brand’s predictive capabilities appears far before you dive into modeling proper: you must prepare your data for use.


1. Prioritize a Business Challenge

You need to start with a clear-eyed view of the business challenge you’re going to conquer. Without this in mind, you’re just spinning your wheels.

Many brands throw money at computation for computation’s sake and end up with the right models for the wrong questions. It’s a little like buying a bunch of hammers, then looking for some nails.

While investing in a team of in-house data scientists to bulldoze every hurdle in your path so you don’t have to think too hard about things might seem like a good idea, this kind of overzealousness will create more work than necessary and become a drain on resources.

As a retail marketing team, you’re approaching things backward if you find yourself asking, “How can we use this model?” or “Where can we make this work?” The ideation and creation of a data-driven predictive model should always follow the identification of an urgent problem — one whose solution the proposed model is tailor-made to produce.

The shape of these problems will vary depending on your specific circumstances, but one thing is for sure: gone are the days when overall retail success can be measured using entirely product- or channel-centric metrics. In the age of customer-centricity, you’re better off focusing on metrics that quantify the efficacy with which you attract and nurture a contingent of loyal repeat buyers.


2. Cleanse Your Data

Once you’ve zeroed in on the question(s) you’d like your model to answer, you can begin consolidating the data from which the model will draw.

In layman's terms, unifying and cleansing inputs means you are gathering and centralizing data from disparate sources and stitching it together into a cohesive unit.

Data cleansing has three major categories:

  1. Data standardization

  2. Data validation

  3. Data deduplication and consolidation.

Data standardization creates uniformity by grouping like values in a set together. In the world of e-commerce, there are countless instances where slight variations in data will carry the same operational value.

For example, when inputting a shipping address, Jayne Dough might write out “street” for her first purchase but then, for her second purchase, she abbreviates it as “st,” thus creating two records. Recognizing that these data points represent the same person ensures data is organized based on standardized criteria rather than meaningless differences.

Next, thorough validation processes guarantee that data makes sense against all governing business rules. An obvious glitch in data might appear if, for instance, the assigned date of return is actually earlier than the date of purchase. These are the kinds of things that need to be fixed before modeling can begin.  

Finally, data deduplication and consolidation eliminates redundant pieces of information and provides a retailer with a single definitive set of records. Should the same customer check out as a guest three times and input slightly different variations of her name or address each time, the model should recognize these variable inputs as coming from the same person and consolidate them into a single user profile.


3. Transform and Restructure Data

The final step in prepping your predictive model that can produce valid insights involves strategically restructuring your data. This step is essentially where you enhance the data to make it more useful for a predictive model.

For example, take an online wine store that, by law, must gather date of birth data on each customer. When modeling behavior, it wouldn’t be very useful to bucket customers by exact DOB. Pulling out a bit further, it also is probably too granular to be useful if you bucket customers by year of birth. What might help you, the wine vendor, in predicting customer lifetime value of a given set would be bucketing customers by age range. Establishing this range is a very simple example of data transformation.

Consider this: the difference in engagement between a user who has logged in to your app once and a user who has logged in 10 times is more notable than the difference between a user who has logged in 10 times and a user who has logged in 20 times.

The discrepancy between a 1- and a 10-session user is incredibly informative, as it indicates that the 1-time user was curious about your brand but ultimately didn’t find what they were looking for, whereas the 10-time user likely remains a valuable customer.

Conversely, the jump from 10 to 20 logins doesn’t carry nearly as many implications, as such increases typically point more to variations in customers’ phone usage or shopping patterns than their interest in or commitment to your brand.

In such a scenario, you can transform your datasets with logarithmic regression, ensuring analyses of the data will only highlight compelling behavioral deviations—say, the difference between 1 and 10 visits and the difference between 10 and 100 visits.

Likewise, restructuring your data can alchemize opaque information into meaningful, actionable insights. On the surface, a SKU and order date might not appear particularly valuable to a human observer, but a model equipped to glean information from cross-referencing these data points can convert an otherwise purely transactional insight into a customer-centric insight—a certain code on a certain day linked to a certain user becomes a “first denim purchase,” a useful and personal fact that you can draw on to inform your marketing strategy.

Once you’ve standardized, validated, deduplicated, transformed, and restructured your data, you’re ready to tackle the second challenge of predictive modeling: building the actual models.

This is Part One of a three-part series. If you want to skip around, here are links for Part Two and Part ThreeIf you’d like to learn about this topic in more detail, check out our webinar of the same name, Jumping the 3 Big Hurdles to Predictive Modeling.

Like this? You might also enjoy these.

How to Segment to Increase First & Second Repeat Purchases

Every retailer is interested in driving repeat purchases from one-time buyers....

3 Common Segmentation Mistakes (How NOT to Slice the Pie)

Your segments could be sabotaged by any one of these sneaky mistakes, but...

How to Slice the Pie, Part 2: Design Data-Rich Customer Segments

In the second part of our series on customer segmentation for retail, we will...