Hadoop for Windows Succinctly^®
by Dave Vickers

CHAPTER 5

When Data Scales, Does BI Fail?

Preparing the Dataset

We need to see how BI tools for Hadoop fare when faced with a larger data load in Windows. For this reason, we'll create the code for the 4-GB Land Registry database and its .csv file format.

Code Listing 23: Code for creating csv file format and Land Registry table in SQL Server 2019

-- CREATE EXTERNAL FILE FORMAT FOR CSV FILE IN SQLSERVER2019

USE Hadoop4windows

CREATE EXTERNAL FILE FORMAT CSVformat WITH (

FORMAT_TYPE = DELIMITEDTEXT,

FORMAT_OPTIONS

(

FIELD_TERMINATOR =',',

DATE_FORMAT = 'MM/dd/yyyy',

STRING_DELIMITER = '"',

USE_TYPE_DEFAULT = TRUE));

-- CREATE EXTERNAL TABLE FOR UK LAND REGISTRY DATA

USE Hadoop4windows

CREATE EXTERNAL TABLE [dbo].[landregistrydata] (

[transactionuid] nvarchar(100) NOT NULL,

[price] int NULL,

[dateoftransfer] nvarchar(100) NOT NULL,

[postcode] nvarchar(100) NOT NULL,

[properttype] nvarchar(100) NOT NULL,

[newlybuilt] nvarchar(100) NOT NULL,

[tenure] nvarchar(100) NOT NULL,

[housenumorname] nvarchar(100) NOT NULL,

[2ndhousenumorname] nvarchar(100) NOT NULL,

[street] nvarchar(100) NOT NULL,

[locality] nvarchar(100) NOT NULL,

[towncity] nvarchar(100) NOT NULL,

[district] nvarchar(100) NOT NULL,

[county] nvarchar(100) NOT NULL,

[transactiontype] nvarchar(100) NOT NULL,

[recordstatus] nvarchar(100) NOT NULL)

WITH (LOCATION='/ukproperty/',

DATA_SOURCE = hadoop_4_windows,

FILE_FORMAT = CSVformat

);

For this code to work, please make sure the 4-GB Land Registry file is in a folder called ukproperty. To give the BI tools we're going to use a fair chance, we'll also create the table in Hive.

Code Listing 24: Code to create Land Registry table in Hive

-- Creating external table from Land Registry file in ukproperty folder

CREATE EXTERNAL TABLE IF NOT EXISTS landregistrydata02(transactionuid string,price string,dateoftransfer string,postcode string,properttype string,newlybuilt string,tenure string,

housenumorname string,2ndhousenumorname string,street string,locality string,towncity STRING,district STRING,county STRING,transactiontype STRING,recordstatus STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED by '\n' stored AS textfile;

Load data INPATH '/ukproperty/pp-complete1.csv' into table landregistrydata02;

select * from landregistrydata02 LIMIT 25;

The tools we are going to use are: Tableau, Azure Data Studio, Arcadia Data, and Syncfusion Dashboard Designer.

We already know the limitations of QlikView Direct Discovery, and with Azure Data Studio and SQL Server, Microsoft products are already represented.

Using Tableau with large data sets in Windows Hadoop

I used the Hortonworks Hive Connection for Tableau after installing Hive OBDC 2.1.16 driver.

Tableau connecting live to 4-GB Land Registry file in Hive

Figure 169: Tableau connecting live to 4-GB Land Registry file in Hive

The Tableau live connection was immediate and fast, automatically previewing 10,000 rows in a preview window with ease. I certainly don't need a data preview for anything that big, but it was impressive to see. I was able to use any visualization I wished to without adverse effects on performance. Overall, it's an easy pass for Tableau, though it would be good to see one or two new visualizations. As you work with larger amounts of data, standard visualizations don't always provide the best presentation choices. The Tableau visualization is shown in the following figure.

Tableau visualization using Land Registry data

Figure 170: Tableau visualization using Land Registry data

Using Azure Data Studio with large data sets in Windows Hadoop

Azure Data Studio pulled back the data in almost effortless fashion. The highlight was the easy use of IntelliSense on a file that big. A query with a where clause, as shown in Figure 171, works with IntelliSense. There is no delay in predicting the next word you type, or identifying words in the system. As before, there is no connection via Hive, but directly to HDFS from SQL Server. I can't find fault with the performance of this tool, and the application is compatible with the Syncfusion platform. As good as Azure Data Studio is, it's only as solid and reliable as the Hadoop system it's connected to.

Azure Data Studio making a fast connection to Land Registry data file in Hadoop

Figure 171: Azure Data Studio making a fast connection to Land Registry data file in Hadoop

Perhaps because of the small size of the app, I was expecting some kind of decrease or degradation of performance. That didn't happen though, and it's clear that Microsoft has laid the foundations for much smoother big data experiences. The word I would probably choose to describe the performance of this tool overall is “effortless.” Windows Server never felt like it was under any strain at all.

Using Arcadia Data with large data sets in Windows Hadoop

As stated earlier, we are now using Arcadia Data 5.0. This version of the software has a feature called Direct Access, as shown in Figure 172. Direct Access enables you to query a Hadoop data source live. In this case, Hive is being queried with SQL from the Arcadia Direct Access window. The data at the bottom of the screen is the data returned by the query. This is without question the fastest Hadoop query tool I've seen—it's as if you're in Hadoop itself. You may find that it returns queries faster than queries written inside Hadoop. With the data returned you can either create a dataset to create visualizations, or download it as a .csv file.

Direct Access in Arcadia 5.0

Figure 172: Direct Access in Arcadia 5.0

Arcadia has a number of features that enable you to sustain long sessions with large amounts of data. The following figure shows the "Clear result cache" facility, something you don't see often in BI tools.

Clear result cache feature in Arcadia

Figure 173: Clear result cache feature in Arcadia

Another feature is the sample mode, which lets you control how much data you work with in percentage terms. You can also limit the number of rows retuned in the preview. This prevents you from "flooding" the application with data, and ensures the screen redraws faster.

Sample Mode in Arcadia Data

Figure 174: Sample Mode in Arcadia Data

In order to analyze data and create visuals, you create datasets from your tables. This just involves clicking New dataset in the same row that your table is shown in. This is portrayed in the following figure.

Creating a new dataset from a table

Figure 175: Creating a new dataset from a table

You can now create a dataset from your visual by clicking the New Visual link.

Creating visuals from datasets

Figure 176: Creating visuals from datasets

We can create visualizations by dragging the Dimension and Measures fields highlighted in the red box shown in Figure 177 onto the Dimensions and Measures fields highlighted in the green rectangles.

Choosing dimensions and measures for our visualizations

Figure 177: Choosing dimensions and measures for our visualizations

We can filter the field data and store the choice to create segments.

Filtering data for the city of Liverpool

Figure 178: Filtering data for the city of Liverpool

Arcadia Data has a wide variety of visualizations, as seen in the following two figures.

. Dashboard design with Arcadia Data

Figure 179: Dashboard design with Arcadia Data

A second dashboard created using Arcadia Data

Figure 180: A second dashboard created using Arcadia Data

Using Syncfusion Dashboard Designer with large data sets in Windows Hadoop

When you start the Syncfusion Dashboard Designer, you see what is effectively a blank canvas.

Figure 181: The blank canvas of Syncfusion Dashboard Designer

In Figure 181, the icon highlighted in red highlights the button for creating a data source. To connect to SQL Server 2019, you simply fill out the form shown in the following figure, then click Connect.

Connection to SQL Server 2019

Figure 182: Connection to SQL Server 2019

If the tables you want to work with are in Hive, you can use a Hive connection or a faster Spark SQL connection. This will pick up the tables in the "default" database when you log in.

Spark SQL login

Figure 183:Spark SQL login

You instantly connect to a blank canvas, and on the left side, you see the tables within SQL Server. You follow the onscreen message to drag and drop tables to create a virtual table.

Invitation to drag and drop tables to create a virtual table

Figure 184: Invitation to drag and drop tables to create a virtual table

When you drag your table to the canvas, you can add columns as you wish. The application does not overload the table with data from the data source.

Selecting columns to add to the virtual table

Figure 185: Selecting columns to add to the virtual table

You have the ability to select, deselect, and search records, as shown in the following figure.

Filtering records you do or don't wish to select

Figure 186: Filtering records you do or don't wish to select

You can change column types, rename columns, and carry out aggregation functions on the fly.

Change of column types, names and aggregation functions

Figure 187: Change of column types, names and aggregation functions

You are now ready to create a visualization. Click a graph icon on the Dashboard tab.

Bar and Column graph icons on the Dashboard tab

Figure 188: Bar and Column graph icons on the Dashboard tab

You create graphs by dragging Measures and Dimension fields to values, columns, and row fields. If large amounts of data are being loaded, a warning appears on screen, inviting you to filter the data. This warning is shown in the following figure.

Adding measures and dimensions to create visualizations

Figure 189: Adding measures and dimensions to create visualizations

You can then filter the data based on your selections.

Filtering the records in your dataset

Figure 190: Filtering the records in your dataset

If you wish, you can use the Rank feature; this example ranks the top 10 average house prices.

Top 10 average house prices by rank

Figure 191: Top 10 average house prices by rank

We can now create visualizations with our filtered dataset. Dashboard speed is maintained by preventing unnecessary data from overwhelming the dashboard.

Creating visualizations from our filtered data

Figure 192: Creating visualizations from our filtered data

We can put together several visualizations to create dashboards; this includes pivot table visualizations shown in Figure 193. I often don't use pivot table visualizations, as the dashboard creation tools can't manage the data. You have seen throughout the use of this tool that it has been designed from the ground up to manage large amounts of data. That said, Syncfusion is the designer of the Big Data Platform, so the excellent performance of the tool here isn't surprising.

Syncfusion Dashboard Designer Dashboard including pivot table visualization

Figure 193:Syncfusion Dashboard Designer Dashboard including pivot table visualization

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now