This section of the guide explores two aspects of Hadoop-based big data systems such as HDInsight: what they are (and why you should care), and how Microsoft is embracing open source technologies as part of its big data roadmap. It will help you to understand the core concepts of a big data solution, the technologies they typically use, and the advantages they offer in terms of managing huge volumes of data and gaining insights into the information it contains.
Big data is not a stand-alone technology, or just new type of data querying mechanism. It is a significant part of the Microsoft Business Intelligence (BI) and Analytics product range, and a vital component of the Microsoft data platform. Figure 1 shows an overview of the Microsoft data platform and enterprise BI product range, and the roles big data and HDInsight play within this.
The figure does not include all of Microsoft’s data-related products, and it doesn’t attempt to show physical data flows. For example, data can be ingested into HDInsight without going through an integration process, and a data store could be the data source for another process. Instead, the figure illustrates as layers the applications, services, tools, and frameworks that work together allow you to capture data, store it, process it, and visualize the information it contains. Notice that the big data technologies span both the Integration and Data stores layers.
Microsoft implements Hadoop-based big data solutions using the Hortonworks Data Platform (HDP), which is built on open source components in conjunction with Hortonworks. The HDP is 100% compatible with Apache Hadoop, and is compatible with open source community distributions. All of the components are tested in typical scenarios to ensure that they work together correctly, and that there are no versioning or compatibility issues. Developments are fed back into community through Hortonworks to maintain compatibility and to support the open source effort.
Microsoft and Hortonworks offer three distinct solutions based on HDP:
- HDInsight. This is a cloud-hosted service available to Azure subscribers that uses Azure clusters to run HDP, and integrates with Azure storage. For more information about HDInsight see What is Microsoft HDInsight? and the HDInsight page on the Azure website.
- Hortonworks Data Platform (HDP) for Windows. This is a complete package that you can install on Windows Server to build your own fully-configurable big data clusters based on Hadoop. It can be installed on physical on-premises hardware, or in virtual machines in the cloud. For more information see Microsoft Server and Cloud Platform on the Microsoft website and Hortonworks Data Platform.
- Microsoft Analytics Platform System. This is a combination of the massively parallel processing (MPP) engine in Microsoft Parallel Data Warehouse (PDW) with Hadoop-based big data technologies. It uses the HDP to provide an on-premises solution that contains a region for Hadoop-based processing, together with PolyBase—a connectivity mechanism that integrates the MPP engine with HDP, Cloudera, and remote Hadoop-based services such as HDInsight. It allows data in Hadoop to be queried and combined with on-premises relational data, and data to be moved into and out of Hadoop. For more information see Microsoft Analytics Platform System.
|A single-node local development environment for Hadoop-based solutions is available from Hortonworks. This is useful for initial development, proof of concept, and testing. For more details, see Hortonworks Sandbox.
In Figure 1, data typically flows upward from data sources, through data stores such as SQL Server and HDInsight, to reporting and analysis tools such as Excel, Office 365, and SQL Server Reporting Services (SSRS). Note that the data does not necessarily need to flow through every layer shown in Figure 1. In some scenarios, operations such as extract-transform-load (ETL) data integration and data validation may be carried out within HDInsight so that use of a separate ETL service such as Data Quality Services is not required. In addition, if the data is not being incorporated into a BI system but just passed directly to reporting and analysis tools, it will not be exposed through a corporate data model.
As an example of how Microsoft big data tools, and specifically HDInsight, integrate with other tools and frameworks, consider the following typical use cases:
- Simple iterative querying and visualization. You may simply want to load some unstructured data into HDInsight, combine it with data from external sources such as Azure Marketplace, and then analyze and visualize the results using Microsoft Excel and Power View. In this case, data from the data source will flow into HDInsight where queries and transformations generate the required result. This result flows through an ODBC connector or directly from Azure blob storage into a visualization tool such as Excel, where it is combined with data loaded directly by Excel from Azure Marketplace.
- Handling streaming data and exposing it through SharePoint. In this case streaming data collected from device sensors is fed through Microsoft StreamInsight or Azure Intelligent Systems Service for categorization and filtering, and can be used to display real-time values on a dashboard or to trigger changes in a process. The data is then transferred into an Azure HDInsight cluster for use in historical analysis. The output from queries that are run as periodic batch jobs in HDInsight is integrated at the corporate data model level with a data warehouse, and ultimately delivered to users through SharePoint libraries and web parts—making it available for use in reports, and in data analysis and visualization tools such as Excel.
- Exposing data as a business data source for an existing data warehouse system. This might be to produce a specific set of management reports on a regular basis. Semi-structured or unstructured data is loaded into HDInsight, queried and transformed within HDInsight, validated and cleansed using Data Quality Services, and stored in your data warehouse tables ready for use in reports. You may also use Master Data Services to ensure consistency between data representations of business elements across your organization.
These are just three examples of the countless permutations and capabilities of the Microsoft data platform and HDInsight. Your own requirements will differ, but the combination of services and tools makes it possible to implement almost any kind of big data solution using the elements of the platform. You will see many examples of the way that these applications, tools, and services work together in this guide.
The background to big data
Hadoop-based big data solutions provide a mechanism for storing vast quantities of structured, semi-structured, and unstructured data. They also deal with the issue of variable data formats by allowing you to store the data in its native form, and then apply a schema to it later when you need to query it. This means that you don’t inadvertently lose any information by forcing the data into a format that may later prove to be too restrictive. The topics What is big data? and Why should I care about big data? provide more details.
Big data solutions also provide a framework for efficiently executing distributed queries across these huge volumes of data, often multiple terabytes or petabytes in size. It also means that you can simply store the data now—even if you don’t know how, when, or even whether it will be useful—safe in the knowledge that, should the need arise in the future, you can extract any useful information it contains. The topic How do big data solutions work?explores the mechanisms that Hadoop-based solutions can use to analyze data.
Big data solutions can help you to discover information that you didn’t know existed, complement your existing knowledge about your business and your customers, and boost competitiveness. By using the cloud as the data store and HDInsight as the query mechanism you benefit from very affordable storage costs (at the time of writing, 1TB of Azure storage costs less than $40 per month), and the flexibility and elasticity of the “pay-as-you-go” model where you only pay for the resources you use.
You may choose to use a big data solution simply as an experimental platform for investigating data, or you may want to build a more comprehensive solution that integrates with your existing data management and BI systems. While there is no formal set of steps for designing and implementing big data solutions, there are several points that you should consider before you start. Ensuring that you think about these will help you to more quickly achieve the results you require, and can save considerable waste of time and effort. For details of the typical planning considerations for big data solutions, see Planning a big data solution.
For an overview and description of Microsoft big data see Microsoft Server and Cloud Platform.
For more information about HDInsight see the HDInsight page on the Azure website.
Documentation for HDInsight is available on the Tutorials and Guides page.
To sign up for Azure services go to the HDInsight Service page.
The page Get started using Azure HDInsight will help you begin working with HDInsight.
The official site for the Apache Hadoop framework and tools is the Apache Hadoop website.
You can download the free eBook “Introducing Microsoft Azure HDInsight” from the Microsoft Press Blog.
There are also many popular blogs that cover big data and HDInsight topics:
- Alexei Khalyako: http://alexeikh.wordpress.com/category/bigdata/
- Benjamin Guinebertière: http://blogs.msdn.com/benjguin
- Brian Mitchell: http://brianwmitchell.com/
- Brian Swan: http://blogs.msdn.com/brian_swan
- Carl Nolan: http://blogs.msdn.com/b/carlnol/
- Cindy Gross http://blogs.msdn.com/b/cindygross/archive/tags/hadoop/
- Denny Lee: http://dennyglee.com/
- Lara Rubbelke: http://sqlblog.com/blogs/lara_rubbelke/default.aspx
- Matt Winkler: http://blogs.msdn.com/b/mwinkle/
- Murshed Zaman: http://murshedsqlcat.wordpress.com
- Teo Lachev: http://prologika.com/CS/blogs/blog/archive/tags/Hadoop/default.aspx
- Microsoft Support for HDInsight: http://blogs.msdn.com/b/bigdatasupport/
- Hortonworks: http://hortonworks.com/blog/