A small but growing number of IT teams have big plans to improve their businesses with data repositories in the public cloud.
Forward-leaning IT shops — and the vendors that push cheap storage — see huge potential in the next wave of business intelligence, which would use a range of cloud-based services to tap into an ever-expanding cache of structured and unstructured data. Still, the greatest payoffs for this model remain largely hypothetical, as most enterprises are still in the exploratory phase at best — both architecturally and culturally.
There are a range of managed products for IT shops that want to go this route, including Amazon Redshift, Google Cloud Platform’s BigQuery and Microsoft Azure SQL Data Warehouse. These data warehouses continue to simplify data exploration through more abstraction and integration with related services — in some cases, without the need to spin up instances. The ultimate goal is for companies of all sizes to emulate the successes of web-scale corporations that emphasize automation and squeeze more out of data collection.
Descartes Labs Inc., a satellite imagery company based in Los Alamos, N.M., had a mix of big data tools on premises and in the cloud before it moved primarily to Google Cloud Platform. The company stores large quantities of raw data because it doesn’t always know what questions its customers will ask. Now, it leans heavily on Google BigQuery, Bigtable and object storage to meet those demands.
The shift is part of a more developer-centric approach in which employees pick the best tool for the problem they’re trying to solve, said Tim Kelton, Descartes Labs’ co-founder.
“The biggest change, along with maybe microservices, is being able to have lots of different individual teams just in two minutes start something and say, ‘Does this work for the scenario we’re trying to work for?’ And not just say, ‘The adopted solution is Oracle SQL, and everything has to work toward that,'” Kelton said.
Enterprises have begun to incorporate data lakes, or massive pools of raw data, alongside more traditional data warehousing. At the same time, the cloud has emerged as a viable place to host data and as a space to experiment with advanced analytics on poly-structured data without heavy capital investments.
“It reflects a greater desire for the ability to accommodate types of data that we couldn’t really get our heads around before, or didn’t have the technology or capabilities to utilize,” said Adam Ronthal, a research director at Gartner.
Cloud data warehouse services are a boon to both providers and customers. The so-called hyperscale platforms — Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform — offer relatively cheap storage to take advantage of the workload gravity that comes with databases and other critical systems. From there, they’re positioned to sell a host of higher-level services seen as the future of cloud computing — services intended to belie the notion that these platforms are little more than commoditized VM hosting.
AOL replaced a Cloudera Hadoop environment with Amazon EMR and found significant savings compared to what it cost on premises; it now stores the payload in Amazon Simple Storage Service and uses EC2 Spot Instances to spin up and tear down nodes as needed. The next step will be to integrate with other AWS tools, such as Lambda for trigger-based functions and Kinesis Firehose for streaming data. With this, AOL hopes to achieve even greater efficiencies and inventory control.
“We’ve seen another revolution in what can be done because of building higher-level services,” said James LaPlaine, CIO at AOL.
Enterprise road to a cloud data warehouse not without potholes
Those higher-level services can be a rather sticky proposition, however. Egress costs can be prohibitive, and customers should have as much of their data on the cloud as possible to maximize benefits from those proprietary services. So, while a cloud data warehouse works great for startups that can start anew on their platform of choice, it can create myriad challenges for enterprises in the midst of a transition.
James LaPlaineCIO, AOL
In-house structured data often must be cleaned or rewritten. And, for this reason, AOL, like many other companies in the same situation, opted to keep much of its historical data on premises. These companies also must scale up to address massive, older data sets that traditionally reside on premises. In those scenarios, IT pros must consider not only the costs for compute and storage, but also for networking, as scaling access to that storage can get extremely expensive.
Other enterprises, such as The New York Times Co., have workloads in different public clouds.
“It is a lot simpler if everything lives in the same place, so we don’t have to have a Redshift cluster and also have data in BigQuery,” said Matt Digan, executive director of data engineering at The Times. “It’s not easy to join those two data sets.”
Enterprise IT shops also need different skill sets and must be ready for cultural changes. Beyond the most elite companies working on a planetary scale, it’s not realistic for anyone accustomed to traditional infrastructure to make that jump just yet, said Ted Chamberlin, an analyst with Gartner.
“For the average enterprise, it’s great to aspire to that, but most are going to have two to three to four years of transforming their enterprise and cutting out what they don’t want and moving to stateless,” Chamberlin said.
Future of cloud tied to data warehouses
Despite these challenges, enterprises that have started moving data warehouses to public cloud see a big payoff down the road.
In the past, The New York Times built its own Hadoop cluster and used a host of vendors for data warehouses, including Informatica, Oracle and AWS. Part of the problem with this approach was data was too siloed or too technical. The Times is in the midst of migrating to Google Cloud Platform, which it ultimately hopes will serve as a single receptacle for that data. This also would make it simpler for a range of employees to use the analytics tools.
“Our goal is to get the data to our users, whether that’s a data analyst or data scientist, or people who just need to make sense of something as quickly and accurately as possible,” Digan said.
The Times plans to roll the entire enterprise into the system and put everything into BigQuery to gain a unified view of its readers. Further down the road, Digan said he envisions the use of data services, machine learning models and APIs to build products — internal and external-facing — that will open the company to glean more insights about its readers and sales and, in turn, provide a more personalized experience for readers.
It’s that potential that most excites Digan, but it will be a learning experience that won’t happen overnight, as The Times determines the right questions to ask of its data.
“That’s something we’re going to learn as we go along,” Digan said. “Exploration is a lot easier now, so when analysts have a query, they’re enabled to look up those queries themselves without handholding, but we don’t quite know everything we’re going to get into.”
Trevor Jones is a news writer with SearchCloudComputing and SearchAWS. Contact him at firstname.lastname@example.org.