left-icon

HDInsight Succinctly®
by James Beresford

Previous
Chapter

of
A
A
A

CHAPTER 1

Platform Overview

Platform Overview


Microsoft’s Big Data Platforms

The world of data is changing in a big way and expectations about how to interact and analyze that data are changing as a result. Microsoft offers a broad and scalable portfolio of data storage capabilities for structured, unstructured, and streaming data—both on-premises and in the cloud.

Microsoft has been present in the traditional BI space through the SQL Server platform which scales quite satisfactorily into the hundreds of gigabytes range without too much need for specialist hardware or clever configuration. Since approximately 2010, Microsoft has also offered a couple of specialist appliances to scale higher: the SQL Server Fast Track Data Warehouse for anything up to 100 terabytes, and the SQL Server Parallel Data Warehouse (PDW) for anything entering the petabyte scale.

However, these platforms only deal with relational data and the open-source movement overtook Microsoft (and indeed many other vendors) with the emergence of Hadoop. Microsoft did have a similar platform internally called Dryad but, shortly before Dryad was expected to go live, it was dropped in favor of creating a distribution of Hadoop in conjunction with Hortonworks.[1] [2]

From that decision point, various previews of the platform were made available as on-premises or cloud versions. Early in 2013, the HDInsight name was adopted for the preview (replacing the original “Hadoop on Azure” name) and the cloud platform became generally available in October 2013. The on-premises version is, at the time of this writing, still in preview with no firm release date.

Aspects of these technologies are working their way back into the relational world: The 2.0 version of the Parallel Data Warehouse features support for Hadoop including a language called PolyBase that allows queries to include relational and nonrelational data in the same statements.[3]

Data Management and Storage

Data management needs have evolved from traditional relational storage to both relational and nonrelational storage, and a full-spectrum information management platform needs to support all types of data. To deliver insight on any data, a platform is needed that provides a complete set of capabilities for data management across relational, nonrelational and streaming data. The platform needs to be able to seamlessly move data from one type to another, and be able to monitor and manage all data regardless of the type of data or data structure it is. This has to occur without the application having to worry about scale, performance, security, and availability.

In addition to supporting all types of data, moving data to and from a nonrelational store (such as Hadoop) and a relational data warehouse is one of the key Big Data customer usage patterns. To support this common usage pattern, Microsoft provides connectors for high-speed data movement between data stored in Hadoop and existing SQL Server Data Warehousing environments, including SQL Server Parallel Data Warehouse.

There is a lot of debate in the market today over relational vs. nonrelational technologies. Asking the question, “Should I use relational or nonrelational technologies for my application requirements?” is asking the wrong question. Both are storage mechanisms designed to meet very different needs and the two should be considered as complementary.

Relational stores are good for structured data where the schema is known, which makes programming against a relational store require an understanding of declarative query languages like SQL. These platforms deliver a store with high consistency and transaction isolation.

In contrast, nonrelational stores are good for unstructured data where schema does not exist or where applying it is expensive and querying it is more programmatic. This platform gives greater flexibility and scalability—with a tradeoff of losing the ability to easily work with the data in an ACID manner; however, this is not the case for all NoSQL databases (for example, RavenDB).

As the requirements for both of these types of stores evolve, the key point to remember is that a modern data platform must support both types of data equally well, provide unified monitoring and management of data across both, and be able to easily move and transform data across all types of stores.

HDInsight and Hadoop

Microsoft’s Hadoop distribution is intended to bring the robustness, manageability, and simplicity of Windows to the Hadoop environment.

For the on-premises version, that means a focus on hardening security through integration with Active Directory, simplifying manageability through integration with System Center, and dramatically reducing time to set up and deploy via simplified packaging and configuration. These improvements will enable IT to apply consistent security policies across Hadoop clusters and manage them from a single pane of glass on System Center.

For the service on Windows Azure, Microsoft will further lower the barrier to deployment by enabling the seamless setup and configuration of Hadoop clusters through easy to use components of the Azure management portal.

Finally, they are not only shipping an open source-based distribution of Hadoop but are also committed to giving back those updates to the Hadoop community. Microsoft is committed to delivering 100-percent compatibility with Apache Hadoop application programming interfaces (APIs) so that applications written for Apache Hadoop should work on Windows.

Working closely with Hortonworks, Microsoft has submitted a formal proposal to contribute the Hadoop-based distribution on Windows Azure and Windows Server as changes to the Apache code base.[4] In addition, they are also collaborating on additional capabilities such as Hive connectivity, and an innovative JavaScript library developed by Microsoft and Hortonworks to be proposed as contributions to the Apache Software Foundation.

Hortonworks is focused on accelerating the development and adoption of Apache Hadoop. Together with the Apache community, they are making Hadoop more robust and easier to use for enterprises, and more open and extensible for solution providers.

As the preview has passed through, various features have come and gone. An original feature was the Console, a friendly web user interface that allowed job submission, access to Hive, and a JavaScript console that allowed querying of the File system and submission of Pig jobs. This functionality has gone but is expected to migrate into the main Azure Portal at some time (though what this means for the on-premises version is unclear). However, in its place has appeared a fully featured set of PowerShell cmdlets that allows remote submission of jobs and even creation of clusters.

One feature that has remained has been the ability to access Hive directly from Excel through an Open Database Connectivity (ODBC) driver. This has enabled the consumption of the output of Hadoop processes through an interface with which many users are familiar, and connects Hadoop with the data mashup capabilities of PowerPivot and rich visualizations of PowerView.

The platform continues to evolve and features are constantly arriving (and occasionally going). This book will do its best to capture the current state but, even as it was being written, content needed to be updated to deal with the ongoing changes.

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.