I had a call recently asking if I knew any good data scientists, as his company had a data problem and apparently hiring a data scientist was the answer. My response was so unprintably sarcastic and cutting that I felt sorry for him after as he tried to prise the melted phone off his ear. Most companies have a data problem. Some companies have worked out it is in fact a data opportunity. A few have monetised their (or in many cases our personal) data and are smugly looking down at the rest of us from the top of their money pile.
As I discussed in ‘The Real-Time Business’ blog, most companies already capture most of the data that could be valuable to them. Doing something meaningful with that data is where the majority flunk it. But it’s hardly rocket science. Let’s break the problem into its component parts. The standard characteristics for big data are normally stated as the 5 Vs: Volume, Variety, Velocity, Variability and Veracity. I prefer to think of it as the 5 Ds: Dump, Deviance, Diarrhoea, Dirty and Disappointment. Apologies if you were eating while reading this.
Data Dump. We all know that the amount of data being created these days is enormous, and has been increasing about 40% per year, or doubling every 2 years as my mainly nerdy readers will delightedly inform you. When I was struggling to change the large and very heavy 10Mb disk packs on the first S/360 I worked on, there was less than a terabyte of computer-stored data in the world. Now there is estimated to be around 10 ZB (which disappointingly isn’t Zebedees, but Zettabytes), or 10,000,000,000 Terabytes; not all of which, surprisingly, is porn and cat videos. Not that I look at either, of course. I’ll leave you clever folk to work out when we hit a Yottabyte, but unlike Moore’s Law, it’s unclear if and when this growth will ever slow down.
Deviant Behaviour. Besides having lots more data, we have increased the sources so besides standard customer and product details, we have enriched feeds from the digital world, social media, devices, and the Internet of Things, containing not only traditional text/numerical data but also audio, images, video, Pokémon, etc. These data can come with time and geo-based tagging allowing trends and journeys to be traced, visualised and replayed. A lot of the relationships between data are obvious, but some are not. Various tools and techniques are appearing to help with this – data mining, auto-discovery, machine learning, etc. the best ones are visual as the smartest pattern recognition tools still seem to be our eyes.
Digital Diarrhoea. Not only are there lots of data, but it is being spewed forth with increasing speed. Velocity implies it is going in one direction, but my experience is more one of Brownian motion with data flying randomly in all directions. It is no longer feasible to print off a report and read what is going on. We have been reduced to scrolling and swiping to try to keep up with the relentless onslaught of emails, sales data, invitations, Kardashians, etc., that few people get anything useful done if they plug themselves into these floods of data. It’s got so bad that seriously stressed people are paying for a digital detox – getting left in a field with no phone or tablet until the shakes and finger twitches subside and they notice their other senses kicking in: hearing, taste, and particularly smell from the field (or sometimes person) next to them.
Dirty Data. In a perfect digital world, all datasets would be complete, correct and consistent – as ‘Just A Minute’ insists: no hesitation, deviation, or repetition. Unfortunately, since mankind started collecting data, we have so far failed to eradicate the chaotic element that leads to poor quality data – people. Obviously you and I don’t make mistakes, or inadvertently put next door down to receive our junk mail instead of us. So let’s just blame everyone else. Whatever the truth it just takes a thick finger or individual to muck up data collection, data entry, or the piece of JavaScript that corrupts the information we collect and process. According to many of the companies I have dealt with, my name is actually Mo, Moët, Mee or Mog. How careless of me(e) to call myself Moe. And I don’t have a ‘Nob Title’ of ‘TIT Director’. The amount of trouble caused by dirty data ranges from irritating to fatal. A whole industry has sprung up to offer data cleansing or Scrub ‘n’ Match services to minimise or fix quality problems.
Disappointing Data. So you have collected your Zebedee of data, with Terrapins more coming in every day. Your suits have been convinced by the pseuds that there’s gold in them there data lakes, and the future of their company is in monetising their data. So you pull your CIO (Chief Insight Officer) or CDO (Chief Data Officer) in to your office and tell them to make it so. (BTW Consider yourself lucky if these roles haven’t yet appeared in your organisation – their appearance typically heralds a new false dawn when your company has forgotten what it does and your leaders have lost it.) So the CIO/CDO assembles their team of self-styled data scientists, data gurus and assorted data lackeys, buys the latest data mining/machine learning/predictive analytics tool, rents half of Amazon’s cloud storage estate and gets cracking on digging for gold. Six months of panning, sluicing and dredging later they discover what those of us who’ve either worked in mining, or even just watched Poldark, could have told them up front. Viable gold mines work on 1 part per million of gold to ore, platinum about 0.5ppm. And that is if you start with a rich vein of the stuff. Everything else is waste. So it is with data.
Deliverance. Amongst the facts and information of dubious quality, you may have some nuggets that you could use to increase some sales figures, or reduce churn, or to sell to someone else who might gratefully use your data to abuse your customers. What you really need to be wary of is some of the more dubious claims by this increasingly desperate money pit. With expressions such as ‘causal inference’ and ‘propensity model’, they will attempt to convince you that your core cohort is actually left-handed 30-49yo single white females from Bradford who have just read The Girl on the Train. They will have some selective data to support their claims presented in shiny graphs on PissPoorPoint by their data pseudo-scientists. Most of these findings are glaringly obvious (umbrellas sell best when it’s raining), or against common sense (drop umbrellas, and invest in head turbines to blow the rain off people). It would take a brave, or more likely desperate, exec to back many of these claims. What normally isn’t explained are the constraints of their method. If this was a true science:
· The method would be published and verified by some standards body
· The provenance and transformation of the data would be fully explained
· The confidence limits and probability ranges would be shown, and caveated
· The unknowns, gaps, and non-conforming data would be highlighted
· The results would be peer reviewed by independent experts
· Assumptions, constraints and dependencies would be documented
Unfortunately, my experience is that the CDO/CIO is likely to go running to the CMO or Sales VP as soon as they have a sniff of some potential find, without disclosing any of the gotchas, creating pressure to start building the head turbines. Of course, these desperate data disaster dispensers need to justify the high costs they are incurring, but they shouldn’t have been allowed to blow a wad without better commercial controls in the first place.
If you have an effective data management and assurance function delivering timely, accurate data, any competent line manager should be able to interrogate their data marts with the many self-serve visualisation and query tools available to spot the real opportunities and threats to their business and act accordingly. If you do want some science to validate the findings, hire a statistician who can give the range and peaks of probabilities from the sparse, crappy data you have which you can feed into your business planning.
John ‘Mining the Merciless’ Moe

Leave a comment