For IBM, a world leader in AI, as their Watson project has demonstrated, applying intelligence to storage is a natural. We’re facing a data onslaught like never before. We’ll be generating more data than we have capacity to store once IoT gets rolling.
How to plan, manage, and optimize enterprise storage to keep up with the data deluge.
Just as any software problem can be solved by adding a layer of indirection, any analytics problem can be solved by adding a layer of intelligence. Of course, we know a lot more about indirection than we do intelligence.
Wheat vs chaff
IBM researchers are demoing an intelligent storage system that works something like your brain: It’s easier to remember something important, like a beautiful sunset over the Grand Canyon, than the last time you waited for a traffic light.
In Cognitive Storage for Big Data (paywall), IBM researchers Giovanni Cherubini, Jens Jelitto, and Vinodh Venkatesan, of IBM Research-Zurich, describe their prototype system. The key is using machine learning to determine data value.
If you’re processing IoT data sets, the storage system’s AI would "know" what is important about prior data sets and apply those criteria – access frequency, protection level, divergence from norms, time value, etc. – to incoming data. As the system watches human interaction with the data set, it learns what is important to users and tiers, protects and stores data according to user needs.
The researchers used a learning algorithm known as the "Information Bottleneck" (IB):
. . . a supervised learning technique that has been used in the closely related context of document classication, where it has been shown to have lower complexity and higher robustness than other learning methods.
IB, essentially, correlates the information’s metadata values to cognitive relevance values with the goal of preserving the mutual information between the two. The greater the mutual information, the more valuable the data and, hence, the higher the level of protection, access, and so on.
The Storage Bits take
Enabling machine intelligence to delete less valuable data is an essential feature. And it’s the capability most likely to frighten users. Establishing human trust in machine intelligence is a major domain problem – see Will Smith’s character in I, Robot .
Sure, you can schlep unlikely-to-be-needed data off to low cost tape – IBM is a leading tape drive vendor – but the "store everything forever" algorithm doesn’t scale – and if something can’t go on forever, it won’t.
We delve into where IoT will have the biggest impact and what it means for the future of big data analytics.
Another issue – which is beyond the scope of the paper – is also scale-related: how large will the storage system need to be to justify the cost and overhead of cognition? Enterprise scale or purely web-scale?
There have been many attempts to add intelligence to storage systems. They’ve failed because the intelligence cost more than additional storage. Storage costs continue to fall faster than computational costs, creating a difficult economic dynamic for cognitive storage. Time for some algorithmic magic!
Nonetheless, the IBM team is doing important work. While the applications of machine intelligence are many, they aren’t infinite. Understanding its limits with respect to the foundation of any digital civilization – storage – is critical to our cultural legacy.