left-icon

SQL Queries Succinctly®
by Nick Harrison

Previous
Chapter

of
A
A
A

CHAPTER 4

Slicing and Dicing Data

Slicing and Dicing Data


Aggregate functions

After we have applied the appropriate joins to get related data and filters to filter out the data that we are not interested in, we may want to run some calculations to summarize the relevant data. Aggregate functions perform their calculations against a set of values, returning a single value.

The most common aggregate functions you are likely to use include:

Table 4-1: List of aggregate functions

AVG

Returns the average value for the column specified.

COUNT

Returns the count of the number of records in the identified set.

MAX

Returns the largest value for the column specified.

MIN

Returns the smallest value for the column specified.

SUM

Returns the sum of the values for the column specified.

StdDev

Returns the statistical standard deviation for the column specified.

VAR

Returns the statistical variance for the column specified.

Unless you are a statistician, you will probably mainly use the top five from the list, but these functions can combine to provide some interesting insights into your data. Here we will focus on the nonstatistical functions.

Note: Null values are ignored when evaluating any aggregate function.

Group By

The GROUP BY statement is used with the aggregate functions to group the result set by one or more columns. Any column from the SELECT clause that is not used in an aggregate function must be included in the GROUP BY clause. You can include any additional columns that you might need, but it rarely makes sense to group by a column that will not be displayed: the results would not make sense.

A simple example of a SQL statement using aggregate functions and a GROUP BY statement is:

Code Listing 4-1: General syntax for an aggregate function

SELECT  DepartmentId ,

        MIN(HireDate) ,

        MAX(HireDate)

FROM    dbo.Employee

GROUP BY DepartmentId;

This query will give a list of DepartmentIds with the first and last HireDate for each department.

Sum and Count

The SUM aggregate function only makes sense for numeric data types. It doesn’t make sense to add up dates or strings, but any numeric data type can be summed.

COUNT is applicable regardless of the data type, except for text, image, or ntext, because columns of these types are not stored with the rest of the record. In general, we don’t care about which column is specified for the COUNT function; we just count the number of records.

Both of these functions can also take a distinct modifier, which means that instead of operating on every record in the result set, they will operate on the unique values in the result set. If you do not specify a modifier, then ALL will be assumed.

Code Listing 4-2: Query showing the difference between COUNT and COUNT DISTINCT

SELECT  COUNT(1) NumberOfEmployees ,

        COUNT(DISTINCT DepartmentId) NumberOfDepartments

FROM    dbo.Employee;

Even though SUM requires numeric input, we can get useful information against nonnumeric data. We can use SQL Server’s CASE statement to give us some numeric values to add up.

Code Listing 4-3: Using the CASE statement to provide a numeric value to add up

SELECT  SUM(CASE WHEN HireDate < '1/1/1970' THEN 1

                 ELSE 0

            END) HiredBeforeThe70s ,

        SUM(CASE WHEN HireDate BETWEEN '1/1/1970' AND '12/31/1979' THEN 1

                 ELSE 0

            END) HiredDuringThe70s ,

        SUM(CASE WHEN HireDate BETWEEN '1/1/1980' AND '12/31/1989' THEN 1

                 ELSE 0

            END) HiredBeforeThe80s ,

        SUM(CASE WHEN HireDate BETWEEN '1/1/1990' AND '12/31/1999' THEN 1

                 ELSE 0

            END) HiredBeforeThe80s

FROM    dbo.Employee;

We use the multiple case statements to compare the HireDate and return either a 0 or a 1. When we add 0 it will have no effect, but when we add a 1 we will increase the count for that date range.

This will return a single record with four counts:

Hired Before the 70s

Hired in the 70s

Hired in the 80s

Hired in the 90sT

312

185

169

185

Result Set 4-1: Breaking hiring data into bands based on hire date

This is a handy way to break the data up into intervals.

To get a better feel for how this works, let’s drop the aggregate functions and look at some sample rows.

Code Listing 4-4: Using the CASE statement without SUM

SELECT  (CASE WHEN HireDate < '1/1/1970' THEN 1

              ELSE 0

         END) HiredBeforeThe70s ,

        (CASE WHEN HireDate BETWEEN '1/1/1970' AND '12/31/1979' THEN 1

              ELSE 0

         END) HiredDuringThe70s ,

        (CASE WHEN HireDate BETWEEN '1/1/1980' AND '12/31/1989' THEN 1

              ELSE 0

         END) HiredBeforeThe80s ,

        (CASE WHEN HireDate BETWEEN '1/1/1990' AND '12/31/1999' THEN 1

              ELSE 0

         END) HiredBeforeThe80s

FROM    dbo.Employee;

Hired Before the 70s

Hired in the 70s

Hired in the 80s

Hired in the 90sT

0

0

0

1

1

0

0

0

0

0

0

1

0

1

0

0

1

0

0

0

0

0

0

1

0

1

0

0

1

0

0

0

Result Set 4-2: Only one column in each record will have a value

Each row returned will have at most one column with a value of 1. The other columns will be 0. If all of the columns are 0, then that particular employee was hired after the 90s. This provides a powerful way to summarize data quickly.

We can also combine a Group By to get this summarized by department:

Code Listing 4-5: Summarizing the metrics by department

SELECT  DepartmentName ,

        SUM(CASE WHEN HireDate < '1/1/1970' THEN 1

                 ELSE 0

            END) HiredBeforeThe70s ,

        SUM(CASE WHEN HireDate BETWEEN '1/1/1970' AND '12/31/1979' THEN 1

                 ELSE 0

            END) HiredDuringThe70s ,

        SUM(CASE WHEN HireDate BETWEEN '1/1/1980' AND '12/31/1989' THEN 1

                 ELSE 0

            END) HiredBeforeThe80s ,

        SUM(CASE WHEN HireDate BETWEEN '1/1/1990' AND '12/31/1999' THEN 1

                 ELSE 0

            END) HiredBeforeThe80s

FROM    dbo.Employee

        INNER JOIN dbo.Department ON Department.DepartmentId = Employee.DepartmentId

GROUP BY DepartmentName;

Department

Name

Hired Before the 70s

Hired in the 70s

Hired in the 80s

Hired in the 90sT

Accounting

50

36

29

31

BusinessSales

7

3

4

3

ConsumerSales

5

2

1

2

Corporate Care

44

25

24

25

CorporateSales

6

4

2

2

Customer

23

7

13

18

InternationalSales

6

8

2

3

Marketing

29

19

20

16

NationalSales

5

2

5

4

Prepaid Customer

46

17

21

26

Sales

6

6

3

4

Service

15

17

10

8

Technical

27

18

14

13

TechnicalSales

13

6

6

7

Web

27

11

13

22

Result Set 4-3: Summarizing the hiring statistics by department

Min and Max

The MIN and MAX aggregate functions work on a wide range of data types. You can find the MIN and MAX for varchars, DateTime, and any numeric types. MIN and MAX do not make sense against a bit field or image. These functions can be used to find the oldest and youngest employees in each department. When applied to Varchars or NVarchar, these functions will do an alphabetic comparison, which can also provide useful insight into your data.

We can also combine MIN and MAX with the other aggregate functions to get more insight into your data, provided that the other aggregate functions are in a nested query or a common table expression.

Note: You cannot perform an aggregate function on an expression containing an aggregate or a subquery.

We can use this technique to pivot data in the result set. This allows us to convert the rows in the result set to columns in a new result set. Let’s see how we might construct a summary for a monthly sales report.

Going back to the Order tables that we have been working with, let’s step through what a sales report might look like.

Code Listing 4-6: Building a sales report

SELECT  MAX(CASE OrderSummary.OrderDate

              WHEN 1 THEN ( OrderSummary.OrderTotal )

              ELSE 0

            END) AS JanuaryOrders ,

        MAX(CASE OrderSummary.OrderDate

              WHEN 2 THEN ( OrderSummary.OrderTotal )

              ELSE 0

            END) AS FeburaryOrders ,

        MAX(CASE OrderSummary.OrderDate

              WHEN 3 THEN ( OrderSummary.OrderTotal )

              ELSE 0

            END) AS MarchOrders

FROM    ( SELECT    DATEPART(MONTH, OrderDate) AS OrderDate ,

                    SUM(OrderTotal) AS OrderTotal

          FROM      dbo.[Order]

          GROUP BY  DATEPART(MONTH, OrderDate) ) OrderSummary;

Note: Don’t worry about the DatePart function yet. We will cover all of the date and time functions in Chapter 7.

The results from this query may look like this, depending on how the company did that quarter:

January Orders

February Orders

March Orders

18206

5212

14586

Result Set 4-4: Sales report rollup and pivoted

To see the impact of the MAX aggregate functions, let’s review the results of the query without them. Without the MAX aggregate functions, we get a result set with 12 records.

January Orders

February Orders

March Orders

18206

0

0

0

5212

0

0

0

14586

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Result Set 4-5: Sales report before the rollup

Each record will have a valid value in one column at most. The MAX aggregates will collapse all of these records into a single record, showing the valid record for each column.

Tip: We could add a Group By clause to get a summary breakdown for any level in the Sales hierarchy (such as Group, Region, Territory, etc.) that we are interested in.

Having

The HAVING clause comes after any GROUP BY statements your query may include. The HAVING clause allows you to provide filters based on the results on the aggregate function calls.

A simple example of a SQL statement including a HAVING clause is:

Code Listing 4-7: General syntax for a HAVING clause

SELECT  CustomerId ,

        COUNT(1)

FROM    dbo.[Order]

GROUP BY CustomerId

HAVING  COUNT(1) > 20;

With this query we can easily identify frequent customers.

The HAVING clause is not often used and is often overlooked, partly because we generally need to report on the results of the aggregate functions, and not filter based on their output. We can also get the same results without the HAVING clause.

Code Listing 4-8: Simulating the effects of a HAVING clause

SELECT  OrderCounts.CustomerId ,

        OrderCounts.NumberOfOrders

FROM    ( SELECT    CustomerId ,

                    COUNT(1) NumberOfOrders

          FROM      dbo.[Order]

          GROUP BY  CustomerId ) OrderCounts

WHERE   OrderCounts.NumberOfOrders > 20;

Tip: Even though it’s easily overlooked, keep the HAVING clause in mind. When needed, this clause will simplify the query syntax.

Sorting

Once we have identified the records we are interested in, we often need to sort the data to put it into context and help ensure that we do not overlook relevant data by keeping the related records near each other in the result set. If you are looking at a detail sales records for a division, you want all of the detail records to be together.

There are just a few rules for when and how you can sort. Columns of type ntext, text, image, geography, geometry, and xml cannot be used in an ORDER BY clause. This is primarily because there is no clear sequence strategy for these data types.

The ORDER BY clause cannot be included in views, inline functions, derived tables, subqueries, etc., because these structures are used in larger queries that would generally handle sorting that would contradict the sorting defined elsewhere.

Note: The exception to this restriction is if a TOP, OFFSET, or FETCH clause is specified. We will discuss OFFSET and FETCH shortly. The TOP clause is straightforward—you simply specify the number of records to return. With any of these clauses included, the sort order will influence which records will be returned.

Changing directions

If you do not specify the sort direction, the direction will default to ascending.

Code Listing 4-9: Sorting by the number of orders ascending

SELECT  CustomerId ,

        COUNT(1) NumberOfOrders

FROM    dbo.[Order]

GROUP BY CustomerId

HAVING  COUNT(1) > 20

ORDER BY NumberOfOrders;

This query will show the customer with the highest number of orders last, while the query in Code Listing 4-10 will show the customer with the most number of orders first:

Code Listing 4-10: Sorting by the number of orders descending

SELECT  CustomerId ,

        COUNT(1) NumberOfOrders

FROM    dbo.[Order]

GROUP BY CustomerId

HAVING  COUNT(1) > 20

ORDER BY NumberOfOrders DESC;

Based on your needs, you will generally know the direction that you want to sort, but sometimes it may not be so clear cut. We can make the details of the ORDER BY more dynamic. We can specify the conditional case statement evaluated for each record returned to determine how that record should be sorted.

Let’s look at a slightly contrived scenario. Revisiting the Department table, we may want to sort by the DepartmentName unless the LocationId is less than 5, in which case we want to sort by the CostCenter.

Code Listing 4-11: Deciding how to sort record by record

SELECT  DepartmentId ,

        DepartmentName ,

        CostCenter ,

        LocationId

FROM    dbo.Department

ORDER BY CASE WHEN LocationId < 5 THEN DepartmentName

              ELSE CostCenter

         END;

DepartmentId

Department Number

Cost Center

LocationId

8

ConsumerSales

1R8JASDAFAD

5

3

Prepaid Customer

3SXY6JLYJM1

5

7

Database

80O3SAS8SP6

2

12

Prepaid Customer

DNM1OUVCS

5

14

Express Marketing

1SNXK5DFDSF

1

5

Inside Accounting

VVCELPQURP

3

2

InternationalSales

IDAP8Q32CB

4

6

Outside Accounting

NI04JEDVP8RE

1

10

Phone Marketing

I8J5XNIYOE9KF

2

15

Service

K4DFDFSGGH

1

4

Service Hardware

FWR9NEF5X

1

13

Service Software

WGFB3HWMD

4

9

TechnicalSales

SU476Y3GFGF

2

11

Web

DZYFWH4JXIJJ

3

1

Web Marketing

E5Z3BSG80R1

2

Result Set 4-6: Sorting differently based on the location

Note: If you want to specify the direction for a conditional sort, you specify it after the end in the case statement. A conditional order by cannot change the direction of the order by, only which column to use.

Multiple sorts

You can sort by multiple columns; each column can have a different direction, and each column could potentially be a conditional ORDER BY. The individual columns specified must be unique. It does not make sense to order by the same column twice. The sequence of the columns specified matter. The result set retrieved is sorted by the first column specified, and then this result is sorted by the second column specified, etc.

Code Listing 4-12: Sorting by multiple columns

    SELECT  DepartmentId ,

            DepartmentName ,

            CostCenter ,

            LocationId

    FROM    dbo.Department

    ORDER BY LocationId ,

            DepartmentName ,

            CostCenter DESC;

In this query, we sort by the LocationId, then the DepartmentName, and then the CostCenter. The impact of the last column specified in the order by can be seen in the last two records in the following result set.

DepartmentId

Department Name

Cost Center

LocationId

14

Express Marketing

1SNXK5DFDSF

1

6

Outside Accounting

NI04JEDVP8RE

1

15

Service

K4DFDFSGGH

1

4

Service Hardware

FWR9NEF5X

1

7

Database

80O3SAS8SP6

2

10

Phone Marketing

I8J5XNIYOE9KF

2

9

TechnicalSales

SU476Y3GFGF

2

1

Web Marketing

E5Z3BSG80R1

2

5

Inside Accounting

VVCELPQURP

3

11

Web

DZYFWH4JXIJJ

3

2

InternationalSales

IDAP8Q32CB

4

13

Service Software

WGFB3HWMD

4

8

ConsumerSales

1R8JASDAFAD

5

12

Prepaid Customer

DNM1OUVCS

5

3

Prepaid Customer

3SXY6JLYJM1

5

Result Set 4-7: Sorting by multiple columns

Offset and Fetch

OFFSET and FETCH clauses can be used to implement a paging strategy. This provides a filter on the data based on the sorting sequence. An example SQL SELECT including OFFSET and FETCH in your ORDER BY follows:

Code Listing 4-13: A simple example using OFFSET and FETCH

SELECT  EmployeeId ,

        FirstName ,

        LastName ,

        ReportsTo ,

        DepartmentId

FROM    dbo.Employee

ORDER BY DepartmentId ,

        LastName ,

        FirstName

        OFFSET 20 ROWS FETCH NEXT 10 ROWS ONLY;

Note: OFFSET and FETCH were not added until SQL Server 2012.

The values for the OFFSET and FETCH can be numeric constants like you see in Code Listing 4-13, or could be locally declared variables, or even subqueries. You can query the number of records to be returned from a configuration table for configuration settings.

Code Listing 4-14: Pulling the parameters for FETCH from a configuration table

SELECT  EmployeeId ,

        FirstName ,

        LastName ,

        ReportsTo ,

        DepartmentId

FROM    dbo.Employee

ORDER BY DepartmentId ,

        LastName ,

        FirstName

        OFFSET 20 ROWS FETCH NEXT ( SELECT  PageSize

                                    FROM    dbo.AppSettings

                                    WHERE   AppSettingID = 1 ) ROWS ONLY;

Summary

In this chapter we have seen various ways to slice and summarize data. The aggregate functions open up a lot of possibilities to get fresh insights into your data. We have also seen various ways that sorting can be used to draw the focus to relevant data.

Now we will turn our attention to a new type of table join from what we saw in Chapter 2. Next we will explore the implications of tables selecting from themselves.

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.