SQL Queries Succinctly^®
by Nick Harrison

CHAPTER 4

Slicing and Dicing Data

Aggregate functions

After we have applied the appropriate joins to get related data and filters to filter out the data that we are not interested in, we may want to run some calculations to summarize the relevant data. Aggregate functions perform their calculations against a set of values, returning a single value.

The most common aggregate functions you are likely to use include:

Table 4-1: List of aggregate functions

AVG	Returns the average value for the column specified.
COUNT	Returns the count of the number of records in the identified set.
MAX	Returns the largest value for the column specified.
MIN	Returns the smallest value for the column specified.
SUM	Returns the sum of the values for the column specified.
StdDev	Returns the statistical standard deviation for the column specified.
VAR	Returns the statistical variance for the column specified.

Unless you are a statistician, you will probably mainly use the top five from the list, but these functions can combine to provide some interesting insights into your data. Here we will focus on the nonstatistical functions.

Note: Null values are ignored when evaluating any aggregate function.

Group By

The GROUP BY statement is used with the aggregate functions to group the result set by one or more columns. Any column from the SELECT clause that is not used in an aggregate function must be included in the GROUP BY clause. You can include any additional columns that you might need, but it rarely makes sense to group by a column that will not be displayed: the results would not make sense.

A simple example of a SQL statement using aggregate functions and a GROUP BY statement is:

Code Listing 4-1: General syntax for an aggregate function

SELECT DepartmentId ,

MIN(HireDate) ,

MAX(HireDate)

FROM dbo.Employee

GROUP BY DepartmentId;

This query will give a list of DepartmentIds with the first and last HireDate for each department.

Sum and Count

The SUM aggregate function only makes sense for numeric data types. It doesn’t make sense to add up dates or strings, but any numeric data type can be summed.

COUNT is applicable regardless of the data type, except for text, image, or ntext, because columns of these types are not stored with the rest of the record. In general, we don’t care about which column is specified for the COUNT function; we just count the number of records.

Both of these functions can also take a distinct modifier, which means that instead of operating on every record in the result set, they will operate on the unique values in the result set. If you do not specify a modifier, then ALL will be assumed.

Code Listing 4-2: Query showing the difference between COUNT and COUNT DISTINCT

SELECT COUNT(1) NumberOfEmployees ,

COUNT(DISTINCT DepartmentId) NumberOfDepartments

FROM dbo.Employee;

Even though SUM requires numeric input, we can get useful information against nonnumeric data. We can use SQL Server’s CASE statement to give us some numeric values to add up.

Code Listing 4-3: Using the CASE statement to provide a numeric value to add up

SELECT SUM(CASE WHEN HireDate < '1/1/1970' THEN 1

ELSE 0

END) HiredBeforeThe70s ,

SUM(CASE WHEN HireDate BETWEEN '1/1/1970' AND '12/31/1979' THEN 1

ELSE 0

END) HiredDuringThe70s ,

SUM(CASE WHEN HireDate BETWEEN '1/1/1980' AND '12/31/1989' THEN 1

ELSE 0

END) HiredBeforeThe80s ,

SUM(CASE WHEN HireDate BETWEEN '1/1/1990' AND '12/31/1999' THEN 1

ELSE 0

END) HiredBeforeThe80s

FROM dbo.Employee;

We use the multiple case statements to compare the HireDate and return either a 0 or a 1. When we add 0 it will have no effect, but when we add a 1 we will increase the count for that date range.

This will return a single record with four counts:

Hired Before the 70s	Hired in the 70s	Hired in the 80s	Hired in the 90sT
312	185	169	185

Result Set 4-1: Breaking hiring data into bands based on hire date

This is a handy way to break the data up into intervals.

To get a better feel for how this works, let’s drop the aggregate functions and look at some sample rows.

Code Listing 4-4: Using the CASE statement without SUM

SELECT (CASE WHEN HireDate < '1/1/1970' THEN 1

ELSE 0

END) HiredBeforeThe70s ,

(CASE WHEN HireDate BETWEEN '1/1/1970' AND '12/31/1979' THEN 1

ELSE 0

END) HiredDuringThe70s ,

(CASE WHEN HireDate BETWEEN '1/1/1980' AND '12/31/1989' THEN 1

ELSE 0

END) HiredBeforeThe80s ,

(CASE WHEN HireDate BETWEEN '1/1/1990' AND '12/31/1999' THEN 1

ELSE 0

END) HiredBeforeThe80s

FROM dbo.Employee;

Hired Before the 70s	Hired in the 70s	Hired in the 90sT
0	0	1
1	0	0
0	0	1
0	1	0
1	0	0
0	0	1
0	1	0
1	0	0

Result Set 4-2: Only one column in each record will have a value

Each row returned will have at most one column with a value of 1. The other columns will be 0. If all of the columns are 0, then that particular employee was hired after the 90s. This provides a powerful way to summarize data quickly.

We can also combine a Group By to get this summarized by department:

Code Listing 4-5: Summarizing the metrics by department

SELECT DepartmentName ,

SUM(CASE WHEN HireDate < '1/1/1970' THEN 1

ELSE 0

END) HiredBeforeThe70s ,

SUM(CASE WHEN HireDate BETWEEN '1/1/1970' AND '12/31/1979' THEN 1

ELSE 0

END) HiredDuringThe70s ,

SUM(CASE WHEN HireDate BETWEEN '1/1/1980' AND '12/31/1989' THEN 1

ELSE 0

END) HiredBeforeThe80s ,

SUM(CASE WHEN HireDate BETWEEN '1/1/1990' AND '12/31/1999' THEN 1

ELSE 0

END) HiredBeforeThe80s

FROM dbo.Employee

INNER JOIN dbo.Department ON Department.DepartmentId = Employee.DepartmentId

GROUP BY DepartmentName;

Department Name	Hired Before the 70s	Hired in the 70s	Hired in the 80s	Hired in the 90sT
Accounting	50	36	29	31
BusinessSales	7	3	4	3
ConsumerSales	5	2	1	2
Corporate Care	44	25	24	25
CorporateSales	6	4	2	2
Customer	23	7	13	18
InternationalSales	6	8	2	3
Marketing	29	19	20	16
NationalSales	5	2	5	4
Prepaid Customer	46	17	21	26
Sales	6	6	3	4
Service	15	17	10	8
Technical	27	18	14	13
TechnicalSales	13	6	6	7
Web	27	11	13	22

Result Set 4-3: Summarizing the hiring statistics by department

Min and Max

The MIN and MAX aggregate functions work on a wide range of data types. You can find the MIN and MAX for varchars, DateTime, and any numeric types. MIN and MAX do not make sense against a bit field or image. These functions can be used to find the oldest and youngest employees in each department. When applied to Varchars or NVarchar, these functions will do an alphabetic comparison, which can also provide useful insight into your data.

We can also combine MIN and MAX with the other aggregate functions to get more insight into your data, provided that the other aggregate functions are in a nested query or a common table expression.

Note: You cannot perform an aggregate function on an expression containing an aggregate or a subquery.

We can use this technique to pivot data in the result set. This allows us to convert the rows in the result set to columns in a new result set. Let’s see how we might construct a summary for a monthly sales report.

Going back to the Order tables that we have been working with, let’s step through what a sales report might look like.

Code Listing 4-6: Building a sales report

SELECT MAX(CASE OrderSummary.OrderDate

WHEN 1 THEN ( OrderSummary.OrderTotal )

ELSE 0

END) AS JanuaryOrders ,

MAX(CASE OrderSummary.OrderDate

WHEN 2 THEN ( OrderSummary.OrderTotal )

ELSE 0

END) AS FeburaryOrders ,

MAX(CASE OrderSummary.OrderDate

WHEN 3 THEN ( OrderSummary.OrderTotal )

ELSE 0

END) AS MarchOrders

FROM ( SELECT DATEPART(MONTH, OrderDate) AS OrderDate ,

SUM(OrderTotal) AS OrderTotal

FROM dbo.[Order]

GROUP BY DATEPART(MONTH, OrderDate) ) OrderSummary;

Note: Don’t worry about the DatePart function yet. We will cover all of the date and time functions in Chapter 7.

The results from this query may look like this, depending on how the company did that quarter:

January Orders	February Orders	March Orders
18206	5212	14586

Result Set 4-4: Sales report rollup and pivoted

To see the impact of the MAX aggregate functions, let’s review the results of the query without them. Without the MAX aggregate functions, we get a result set with 12 records.

January Orders	February Orders	March Orders
18206	0	0
0	5212	0
0	0	14586
0	0	0
0	0	0
0	0	0
0	0	0
0	0	0
0	0	0
0	0	0
0	0	0
0	0	0

Result Set 4-5: Sales report before the rollup

Each record will have a valid value in one column at most. The MAX aggregates will collapse all of these records into a single record, showing the valid record for each column.

Tip: We could add a Group By clause to get a summary breakdown for any level in the Sales hierarchy (such as Group, Region, Territory, etc.) that we are interested in.

Having

The HAVING clause comes after any GROUP BY statements your query may include. The HAVING clause allows you to provide filters based on the results on the aggregate function calls.

A simple example of a SQL statement including a HAVING clause is:

Code Listing 4-7: General syntax for a HAVING clause

SELECT CustomerId ,

COUNT(1)

FROM dbo.[Order]

GROUP BY CustomerId

HAVING COUNT(1) > 20;

With this query we can easily identify frequent customers.

The HAVING clause is not often used and is often overlooked, partly because we generally need to report on the results of the aggregate functions, and not filter based on their output. We can also get the same results without the HAVING clause.

Code Listing 4-8: Simulating the effects of a HAVING clause

SELECT OrderCounts.CustomerId ,

OrderCounts.NumberOfOrders

FROM ( SELECT CustomerId ,

COUNT(1) NumberOfOrders

FROM dbo.[Order]

GROUP BY CustomerId ) OrderCounts

WHERE OrderCounts.NumberOfOrders > 20;

Tip: Even though it’s easily overlooked, keep the HAVING clause in mind. When needed, this clause will simplify the query syntax.

Sorting

Once we have identified the records we are interested in, we often need to sort the data to put it into context and help ensure that we do not overlook relevant data by keeping the related records near each other in the result set. If you are looking at a detail sales records for a division, you want all of the detail records to be together.

There are just a few rules for when and how you can sort. Columns of type ntext, text, image, geography, geometry, and xml cannot be used in an ORDER BY clause. This is primarily because there is no clear sequence strategy for these data types.

The ORDER BY clause cannot be included in views, inline functions, derived tables, subqueries, etc., because these structures are used in larger queries that would generally handle sorting that would contradict the sorting defined elsewhere.

Note: The exception to this restriction is if a TOP, OFFSET, or FETCH clause is specified. We will discuss OFFSET and FETCH shortly. The TOP clause is straightforward—you simply specify the number of records to return. With any of these clauses included, the sort order will influence which records will be returned.

Changing directions

If you do not specify the sort direction, the direction will default to ascending.

Code Listing 4-9: Sorting by the number of orders ascending

SELECT CustomerId ,

COUNT(1) NumberOfOrders

FROM dbo.[Order]

GROUP BY CustomerId

HAVING COUNT(1) > 20

ORDER BY NumberOfOrders;

This query will show the customer with the highest number of orders last, while the query in Code Listing 4-10 will show the customer with the most number of orders first:

Code Listing 4-10: Sorting by the number of orders descending

SELECT CustomerId ,

COUNT(1) NumberOfOrders

FROM dbo.[Order]

GROUP BY CustomerId

HAVING COUNT(1) > 20

ORDER BY NumberOfOrders DESC;

Based on your needs, you will generally know the direction that you want to sort, but sometimes it may not be so clear cut. We can make the details of the ORDER BY more dynamic. We can specify the conditional case statement evaluated for each record returned to determine how that record should be sorted.

Let’s look at a slightly contrived scenario. Revisiting the Department table, we may want to sort by the DepartmentName unless the LocationId is less than 5, in which case we want to sort by the CostCenter.

Code Listing 4-11: Deciding how to sort record by record

SELECT DepartmentId ,

DepartmentName ,

CostCenter ,

LocationId

FROM dbo.Department

ORDER BY CASE WHEN LocationId < 5 THEN DepartmentName

ELSE CostCenter

END;

DepartmentId	Department Number	Cost Center	LocationId
8	ConsumerSales	1R8JASDAFAD	5
3	Prepaid Customer	3SXY6JLYJM1	5
7	Database	80O3SAS8SP6	2
12	Prepaid Customer	DNM1OUVCS	5
14	Express Marketing	1SNXK5DFDSF	1
5	Inside Accounting	VVCELPQURP	3
2	InternationalSales	IDAP8Q32CB	4
6	Outside Accounting	NI04JEDVP8RE	1
10	Phone Marketing	I8J5XNIYOE9KF	2
15	Service	K4DFDFSGGH	1
4	Service Hardware	FWR9NEF5X	1
13	Service Software	WGFB3HWMD	4
9	TechnicalSales	SU476Y3GFGF	2
11	Web	DZYFWH4JXIJJ	3
1	Web Marketing	E5Z3BSG80R1	2

Result Set 4-6: Sorting differently based on the location

Note: If you want to specify the direction for a conditional sort, you specify it after the end in the case statement. A conditional order by cannot change the direction of the order by, only which column to use.

Multiple sorts

You can sort by multiple columns; each column can have a different direction, and each column could potentially be a conditional ORDER BY. The individual columns specified must be unique. It does not make sense to order by the same column twice. The sequence of the columns specified matter. The result set retrieved is sorted by the first column specified, and then this result is sorted by the second column specified, etc.

Code Listing 4-12: Sorting by multiple columns

SELECT DepartmentId ,

DepartmentName ,

CostCenter ,

LocationId

FROM dbo.Department

ORDER BY LocationId ,

DepartmentName ,

CostCenter DESC;

In this query, we sort by the LocationId, then the DepartmentName, and then the CostCenter. The impact of the last column specified in the order by can be seen in the last two records in the following result set.

DepartmentId	Department Name	Cost Center	LocationId
14	Express Marketing	1SNXK5DFDSF	1
6	Outside Accounting	NI04JEDVP8RE	1
15	Service	K4DFDFSGGH	1
4	Service Hardware	FWR9NEF5X	1
7	Database	80O3SAS8SP6	2
10	Phone Marketing	I8J5XNIYOE9KF	2
9	TechnicalSales	SU476Y3GFGF	2
1	Web Marketing	E5Z3BSG80R1	2
5	Inside Accounting	VVCELPQURP	3
11	Web	DZYFWH4JXIJJ	3
2	InternationalSales	IDAP8Q32CB	4
13	Service Software	WGFB3HWMD	4
8	ConsumerSales	1R8JASDAFAD	5
12	Prepaid Customer	DNM1OUVCS	5
3	Prepaid Customer	3SXY6JLYJM1	5

Result Set 4-7: Sorting by multiple columns

Offset and Fetch

OFFSET and FETCH clauses can be used to implement a paging strategy. This provides a filter on the data based on the sorting sequence. An example SQL SELECT including OFFSET and FETCH in your ORDER BY follows:

Code Listing 4-13: A simple example using OFFSET and FETCH

SELECT EmployeeId ,

FirstName ,

LastName ,

ReportsTo ,

DepartmentId

FROM dbo.Employee

ORDER BY DepartmentId ,

LastName ,

FirstName

OFFSET 20 ROWS FETCH NEXT 10 ROWS ONLY;

Note: OFFSET and FETCH were not added until SQL Server 2012.

The values for the OFFSET and FETCH can be numeric constants like you see in Code Listing 4-13, or could be locally declared variables, or even subqueries. You can query the number of records to be returned from a configuration table for configuration settings.

Code Listing 4-14: Pulling the parameters for FETCH from a configuration table

SELECT EmployeeId ,

FirstName ,

LastName ,

ReportsTo ,

DepartmentId

FROM dbo.Employee

ORDER BY DepartmentId ,

LastName ,

FirstName

OFFSET 20 ROWS FETCH NEXT ( SELECT PageSize

FROM dbo.AppSettings

WHERE AppSettingID = 1 ) ROWS ONLY;

Summary

In this chapter we have seen various ways to slice and summarize data. The aggregate functions open up a lot of possibilities to get fresh insights into your data. We have also seen various ways that sorting can be used to draw the focus to relevant data.

Now we will turn our attention to a new type of table join from what we saw in Chapter 2. Next we will explore the implications of tables selecting from themselves.

Uncover hidden insights in your SQL data

Direct integration with SQL databases for real-time data access
Interactive dashboards for deeper data exploration
Advanced analytics to uncover hidden patterns and trends
Enhanced decision-making with data-driven insights

Get Your Free Trial Now

Slicing and Dicing Data

Aggregate functions

Group By

Sum and Count

Min and Max

Having

Sorting

Changing directions

Multiple sorts

Offset and Fetch

Summary

DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.