CHAPTER 5
We need to see how BI tools for Hadoop fare when faced with a larger data load in Windows. For this reason, we'll create the code for the 4-GB Land Registry database and its .csv file format.
Code Listing 23: Code for creating csv file format and Land Registry table in SQL Server 2019
-- CREATE EXTERNAL FILE FORMAT FOR CSV FILE IN SQLSERVER2019 USE Hadoop4windows GO CREATE EXTERNAL FILE FORMAT CSVformat WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS
( FIELD_TERMINATOR =',', DATE_FORMAT = 'MM/dd/yyyy', STRING_DELIMITER = '"', USE_TYPE_DEFAULT = TRUE)); -- CREATE EXTERNAL TABLE FOR UK LAND REGISTRY DATA USE Hadoop4windows GO CREATE EXTERNAL TABLE [dbo].[landregistrydata] ( [transactionuid] nvarchar(100) NOT NULL, [price] int NULL, [dateoftransfer] nvarchar(100) NOT NULL, [postcode] nvarchar(100) NOT NULL, [properttype] nvarchar(100) NOT NULL, [newlybuilt] nvarchar(100) NOT NULL, [tenure] nvarchar(100) NOT NULL, [housenumorname] nvarchar(100) NOT NULL, [2ndhousenumorname] nvarchar(100) NOT NULL, [street] nvarchar(100) NOT NULL, [locality] nvarchar(100) NOT NULL, [towncity] nvarchar(100) NOT NULL, [district] nvarchar(100) NOT NULL, [county] nvarchar(100) NOT NULL, [transactiontype] nvarchar(100) NOT NULL, [recordstatus] nvarchar(100) NOT NULL)
WITH (LOCATION='/ukproperty/', DATA_SOURCE = hadoop_4_windows, FILE_FORMAT = CSVformat ); |
For this code to work, please make sure the 4-GB Land Registry file is in a folder called ukproperty. To give the BI tools we're going to use a fair chance, we'll also create the table in Hive.
Code Listing 24: Code to create Land Registry table in Hive
-- Creating external table from Land Registry file in ukproperty folder CREATE EXTERNAL TABLE IF NOT EXISTS landregistrydata02(transactionuid string,price string,dateoftransfer string,postcode string,properttype string,newlybuilt string,tenure string, housenumorname string,2ndhousenumorname string,street string,locality string,towncity STRING,district STRING,county STRING,transactiontype STRING,recordstatus STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED by '\n' stored AS textfile; Load data INPATH '/ukproperty/pp-complete1.csv' into table landregistrydata02; select * from landregistrydata02 LIMIT 25; |
The tools we are going to use are: Tableau, Azure Data Studio, Arcadia Data, and Syncfusion Dashboard Designer.
We already know the limitations of QlikView Direct Discovery, and with Azure Data Studio and SQL Server, Microsoft products are already represented.
I used the Hortonworks Hive Connection for Tableau after installing Hive OBDC 2.1.16 driver.

Figure 169: Tableau connecting live to 4-GB Land Registry file in Hive
The Tableau live connection was immediate and fast, automatically previewing 10,000 rows in a preview window with ease. I certainly don't need a data preview for anything that big, but it was impressive to see. I was able to use any visualization I wished to without adverse effects on performance. Overall, it's an easy pass for Tableau, though it would be good to see one or two new visualizations. As you work with larger amounts of data, standard visualizations don't always provide the best presentation choices. The Tableau visualization is shown in the following figure.

Figure 170: Tableau visualization using Land Registry data
Azure Data Studio pulled back the data in almost effortless fashion. The highlight was the easy use of IntelliSense on a file that big. A query with a where clause, as shown in Figure 171, works with IntelliSense. There is no delay in predicting the next word you type, or identifying words in the system. As before, there is no connection via Hive, but directly to HDFS from SQL Server. I can't find fault with the performance of this tool, and the application is compatible with the Syncfusion platform. As good as Azure Data Studio is, it's only as solid and reliable as the Hadoop system it's connected to.

Figure 171: Azure Data Studio making a fast connection to Land Registry data file in Hadoop
Perhaps because of the small size of the app, I was expecting some kind of decrease or degradation of performance. That didn't happen though, and it's clear that Microsoft has laid the foundations for much smoother big data experiences. The word I would probably choose to describe the performance of this tool overall is “effortless.” Windows Server never felt like it was under any strain at all.
As stated earlier, we are now using Arcadia Data 5.0. This version of the software has a feature called Direct Access, as shown in Figure 172. Direct Access enables you to query a Hadoop data source live. In this case, Hive is being queried with SQL from the Arcadia Direct Access window. The data at the bottom of the screen is the data returned by the query. This is without question the fastest Hadoop query tool I've seen—it's as if you're in Hadoop itself. You may find that it returns queries faster than queries written inside Hadoop. With the data returned you can either create a dataset to create visualizations, or download it as a .csv file.

Figure 172: Direct Access in Arcadia 5.0
Arcadia has a number of features that enable you to sustain long sessions with large amounts of data. The following figure shows the "Clear result cache" facility, something you don't see often in BI tools.

Figure 173: Clear result cache feature in Arcadia
Another feature is the sample mode, which lets you control how much data you work with in percentage terms. You can also limit the number of rows retuned in the preview. This prevents you from "flooding" the application with data, and ensures the screen redraws faster.

Figure 174: Sample Mode in Arcadia Data
In order to analyze data and create visuals, you create datasets from your tables. This just involves clicking New dataset in the same row that your table is shown in. This is portrayed in the following figure.

Figure 175: Creating a new dataset from a table
You can now create a dataset from your visual by clicking the New Visual link.

Figure 176: Creating visuals from datasets
We can create visualizations by dragging the Dimension and Measures fields highlighted in the red box shown in Figure 177 onto the Dimensions and Measures fields highlighted in the green rectangles.

Figure 177: Choosing dimensions and measures for our visualizations
We can filter the field data and store the choice to create segments.

Figure 178: Filtering data for the city of Liverpool
Arcadia Data has a wide variety of visualizations, as seen in the following two figures.
.
Figure 179: Dashboard design with Arcadia Data

Figure 180: A second dashboard created using Arcadia Data
When you start the Syncfusion Dashboard Designer, you see what is effectively a blank canvas.

Figure 181: The blank canvas of Syncfusion Dashboard Designer
In Figure 181, the icon highlighted in red highlights the button for creating a data source. To connect to SQL Server 2019, you simply fill out the form shown in the following figure, then click Connect.

Figure 182: Connection to SQL Server 2019
If the tables you want to work with are in Hive, you can use a Hive connection or a faster Spark SQL connection. This will pick up the tables in the "default" database when you log in.

Figure 183:Spark SQL login
You instantly connect to a blank canvas, and on the left side, you see the tables within SQL Server. You follow the onscreen message to drag and drop tables to create a virtual table.

Figure 184: Invitation to drag and drop tables to create a virtual table
When you drag your table to the canvas, you can add columns as you wish. The application does not overload the table with data from the data source.

Figure 185: Selecting columns to add to the virtual table
You have the ability to select, deselect, and search records, as shown in the following figure.

Figure 186: Filtering records you do or don't wish to select
You can change column types, rename columns, and carry out aggregation functions on the fly.

Figure 187: Change of column types, names and aggregation functions
You are now ready to create a visualization. Click a graph icon on the Dashboard tab.
![]()
Figure 188: Bar and Column graph icons on the Dashboard tab
You create graphs by dragging Measures and Dimension fields to values, columns, and row fields. If large amounts of data are being loaded, a warning appears on screen, inviting you to filter the data. This warning is shown in the following figure.

Figure 189: Adding measures and dimensions to create visualizations
You can then filter the data based on your selections.

Figure 190: Filtering the records in your dataset
If you wish, you can use the Rank feature; this example ranks the top 10 average house prices.

Figure 191: Top 10 average house prices by rank
We can now create visualizations with our filtered dataset. Dashboard speed is maintained by preventing unnecessary data from overwhelming the dashboard.

Figure 192: Creating visualizations from our filtered data
We can put together several visualizations to create dashboards; this includes pivot table visualizations shown in Figure 193. I often don't use pivot table visualizations, as the dashboard creation tools can't manage the data. You have seen throughout the use of this tool that it has been designed from the ground up to manage large amounts of data. That said, Syncfusion is the designer of the Big Data Platform, so the excellent performance of the tool here isn't surprising.

Figure 193:Syncfusion Dashboard Designer Dashboard including pivot table visualization