CHAPTER 4
After we have applied the appropriate joins to get related data and filters to filter out the data that we are not interested in, we may want to run some calculations to summarize the relevant data. Aggregate functions perform their calculations against a set of values, returning a single value.
The most common aggregate functions you are likely to use include:
Table 4-1: List of aggregate functions
AVG | Returns the average value for the column specified. |
COUNT | Returns the count of the number of records in the identified set. |
MAX | Returns the largest value for the column specified. |
MIN | Returns the smallest value for the column specified. |
SUM | Returns the sum of the values for the column specified. |
StdDev | Returns the statistical standard deviation for the column specified. |
VAR | Returns the statistical variance for the column specified. |
Unless you are a statistician, you will probably mainly use the top five from the list, but these functions can combine to provide some interesting insights into your data. Here we will focus on the nonstatistical functions.
Note: Null values are ignored when evaluating any aggregate function.
The GROUP BY statement is used with the aggregate functions to group the result set by one or more columns. Any column from the SELECT clause that is not used in an aggregate function must be included in the GROUP BY clause. You can include any additional columns that you might need, but it rarely makes sense to group by a column that will not be displayed: the results would not make sense.
A simple example of a SQL statement using aggregate functions and a GROUP BY statement is:
Code Listing 4-1: General syntax for an aggregate function
SELECT DepartmentId , MIN(HireDate) , MAX(HireDate) FROM dbo.Employee GROUP BY DepartmentId; |
This query will give a list of DepartmentIds with the first and last HireDate for each department.
The SUM aggregate function only makes sense for numeric data types. It doesn’t make sense to add up dates or strings, but any numeric data type can be summed.
COUNT is applicable regardless of the data type, except for text, image, or ntext, because columns of these types are not stored with the rest of the record. In general, we don’t care about which column is specified for the COUNT function; we just count the number of records.
Both of these functions can also take a distinct modifier, which means that instead of operating on every record in the result set, they will operate on the unique values in the result set. If you do not specify a modifier, then ALL will be assumed.
Code Listing 4-2: Query showing the difference between COUNT and COUNT DISTINCT
SELECT COUNT(1) NumberOfEmployees , COUNT(DISTINCT DepartmentId) NumberOfDepartments FROM dbo.Employee; |
Even though SUM requires numeric input, we can get useful information against nonnumeric data. We can use SQL Server’s CASE statement to give us some numeric values to add up.
Code Listing 4-3: Using the CASE statement to provide a numeric value to add up
SELECT SUM(CASE WHEN HireDate < '1/1/1970' THEN 1 ELSE 0 END) HiredBeforeThe70s , SUM(CASE WHEN HireDate BETWEEN '1/1/1970' AND '12/31/1979' THEN 1 ELSE 0 END) HiredDuringThe70s , SUM(CASE WHEN HireDate BETWEEN '1/1/1980' AND '12/31/1989' THEN 1 ELSE 0 END) HiredBeforeThe80s , SUM(CASE WHEN HireDate BETWEEN '1/1/1990' AND '12/31/1999' THEN 1 ELSE 0 END) HiredBeforeThe80s FROM dbo.Employee; |
We use the multiple case statements to compare the HireDate and return either a 0 or a 1. When we add 0 it will have no effect, but when we add a 1 we will increase the count for that date range.
This will return a single record with four counts:
Hired Before the 70s | Hired in the 70s | Hired in the 80s | Hired in the 90sT |
312 | 185 | 169 | 185 |
Result Set 4-1: Breaking hiring data into bands based on hire date
This is a handy way to break the data up into intervals.
To get a better feel for how this works, let’s drop the aggregate functions and look at some sample rows.
Code Listing 4-4: Using the CASE statement without SUM
SELECT (CASE WHEN HireDate < '1/1/1970' THEN 1 ELSE 0 END) HiredBeforeThe70s , (CASE WHEN HireDate BETWEEN '1/1/1970' AND '12/31/1979' THEN 1 ELSE 0 END) HiredDuringThe70s , (CASE WHEN HireDate BETWEEN '1/1/1980' AND '12/31/1989' THEN 1 ELSE 0 END) HiredBeforeThe80s , (CASE WHEN HireDate BETWEEN '1/1/1990' AND '12/31/1999' THEN 1 ELSE 0 END) HiredBeforeThe80s FROM dbo.Employee; |
Hired Before the 70s | Hired in the 70s | Hired in the 80s | Hired in the 90sT |
|---|---|---|---|
0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 |
0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 |
1 | 0 | 0 | 0 |
0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 |
1 | 0 | 0 | 0 |
Result Set 4-2: Only one column in each record will have a value
Each row returned will have at most one column with a value of 1. The other columns will be 0. If all of the columns are 0, then that particular employee was hired after the 90s. This provides a powerful way to summarize data quickly.
We can also combine a Group By to get this summarized by department:
Code Listing 4-5: Summarizing the metrics by department
SELECT DepartmentName , SUM(CASE WHEN HireDate < '1/1/1970' THEN 1 ELSE 0 END) HiredBeforeThe70s , SUM(CASE WHEN HireDate BETWEEN '1/1/1970' AND '12/31/1979' THEN 1 ELSE 0 END) HiredDuringThe70s , SUM(CASE WHEN HireDate BETWEEN '1/1/1980' AND '12/31/1989' THEN 1 ELSE 0 END) HiredBeforeThe80s , SUM(CASE WHEN HireDate BETWEEN '1/1/1990' AND '12/31/1999' THEN 1 ELSE 0 END) HiredBeforeThe80s FROM dbo.Employee INNER JOIN dbo.Department ON Department.DepartmentId = Employee.DepartmentId GROUP BY DepartmentName; |
Department Name | Hired Before the 70s | Hired in the 70s | Hired in the 80s | Hired in the 90sT |
|---|---|---|---|---|
Accounting | 50 | 36 | 29 | 31 |
BusinessSales | 7 | 3 | 4 | 3 |
ConsumerSales | 5 | 2 | 1 | 2 |
Corporate Care | 44 | 25 | 24 | 25 |
CorporateSales | 6 | 4 | 2 | 2 |
Customer | 23 | 7 | 13 | 18 |
InternationalSales | 6 | 8 | 2 | 3 |
Marketing | 29 | 19 | 20 | 16 |
NationalSales | 5 | 2 | 5 | 4 |
Prepaid Customer | 46 | 17 | 21 | 26 |
Sales | 6 | 6 | 3 | 4 |
Service | 15 | 17 | 10 | 8 |
Technical | 27 | 18 | 14 | 13 |
TechnicalSales | 13 | 6 | 6 | 7 |
Web | 27 | 11 | 13 | 22 |
Result Set 4-3: Summarizing the hiring statistics by department
The MIN and MAX aggregate functions work on a wide range of data types. You can find the MIN and MAX for varchars, DateTime, and any numeric types. MIN and MAX do not make sense against a bit field or image. These functions can be used to find the oldest and youngest employees in each department. When applied to Varchars or NVarchar, these functions will do an alphabetic comparison, which can also provide useful insight into your data.
We can also combine MIN and MAX with the other aggregate functions to get more insight into your data, provided that the other aggregate functions are in a nested query or a common table expression.
Note: You cannot perform an aggregate function on an expression containing an aggregate or a subquery.
We can use this technique to pivot data in the result set. This allows us to convert the rows in the result set to columns in a new result set. Let’s see how we might construct a summary for a monthly sales report.
Going back to the Order tables that we have been working with, let’s step through what a sales report might look like.
Code Listing 4-6: Building a sales report
SELECT MAX(CASE OrderSummary.OrderDate WHEN 1 THEN ( OrderSummary.OrderTotal ) ELSE 0 END) AS JanuaryOrders , MAX(CASE OrderSummary.OrderDate WHEN 2 THEN ( OrderSummary.OrderTotal ) ELSE 0 END) AS FeburaryOrders , MAX(CASE OrderSummary.OrderDate WHEN 3 THEN ( OrderSummary.OrderTotal ) ELSE 0 END) AS MarchOrders FROM ( SELECT DATEPART(MONTH, OrderDate) AS OrderDate , SUM(OrderTotal) AS OrderTotal FROM dbo.[Order] GROUP BY DATEPART(MONTH, OrderDate) ) OrderSummary; |
Note: Don’t worry about the DatePart function yet. We will cover all of the date and time functions in Chapter 7.
The results from this query may look like this, depending on how the company did that quarter:
January Orders | February Orders | March Orders |
|---|---|---|
18206 | 5212 | 14586 |
Result Set 4-4: Sales report rollup and pivoted
To see the impact of the MAX aggregate functions, let’s review the results of the query without them. Without the MAX aggregate functions, we get a result set with 12 records.
January Orders | February Orders | March Orders |
18206 | 0 | 0 |
0 | 5212 | 0 |
0 | 0 | 14586 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
Result Set 4-5: Sales report before the rollup
Each record will have a valid value in one column at most. The MAX aggregates will collapse all of these records into a single record, showing the valid record for each column.
Tip: We could add a Group By clause to get a summary breakdown for any level in the Sales hierarchy (such as Group, Region, Territory, etc.) that we are interested in.
The HAVING clause comes after any GROUP BY statements your query may include. The HAVING clause allows you to provide filters based on the results on the aggregate function calls.
A simple example of a SQL statement including a HAVING clause is:
Code Listing 4-7: General syntax for a HAVING clause
SELECT CustomerId , COUNT(1) FROM dbo.[Order] GROUP BY CustomerId HAVING COUNT(1) > 20; |
With this query we can easily identify frequent customers.
The HAVING clause is not often used and is often overlooked, partly because we generally need to report on the results of the aggregate functions, and not filter based on their output. We can also get the same results without the HAVING clause.
Code Listing 4-8: Simulating the effects of a HAVING clause
SELECT OrderCounts.CustomerId , OrderCounts.NumberOfOrders FROM ( SELECT CustomerId , COUNT(1) NumberOfOrders FROM dbo.[Order] GROUP BY CustomerId ) OrderCounts WHERE OrderCounts.NumberOfOrders > 20; |
Tip: Even though it’s easily overlooked, keep the HAVING clause in mind. When needed, this clause will simplify the query syntax.
Once we have identified the records we are interested in, we often need to sort the data to put it into context and help ensure that we do not overlook relevant data by keeping the related records near each other in the result set. If you are looking at a detail sales records for a division, you want all of the detail records to be together.
There are just a few rules for when and how you can sort. Columns of type ntext, text, image, geography, geometry, and xml cannot be used in an ORDER BY clause. This is primarily because there is no clear sequence strategy for these data types.
The ORDER BY clause cannot be included in views, inline functions, derived tables, subqueries, etc., because these structures are used in larger queries that would generally handle sorting that would contradict the sorting defined elsewhere.
Note: The exception to this restriction is if a TOP, OFFSET, or FETCH clause is specified. We will discuss OFFSET and FETCH shortly. The TOP clause is straightforward—you simply specify the number of records to return. With any of these clauses included, the sort order will influence which records will be returned.
If you do not specify the sort direction, the direction will default to ascending.
Code Listing 4-9: Sorting by the number of orders ascending
SELECT CustomerId , COUNT(1) NumberOfOrders FROM dbo.[Order] GROUP BY CustomerId HAVING COUNT(1) > 20 ORDER BY NumberOfOrders; |
This query will show the customer with the highest number of orders last, while the query in Code Listing 4-10 will show the customer with the most number of orders first:
Code Listing 4-10: Sorting by the number of orders descending
SELECT CustomerId , COUNT(1) NumberOfOrders FROM dbo.[Order] GROUP BY CustomerId HAVING COUNT(1) > 20 ORDER BY NumberOfOrders DESC; |
Based on your needs, you will generally know the direction that you want to sort, but sometimes it may not be so clear cut. We can make the details of the ORDER BY more dynamic. We can specify the conditional case statement evaluated for each record returned to determine how that record should be sorted.
Let’s look at a slightly contrived scenario. Revisiting the Department table, we may want to sort by the DepartmentName unless the LocationId is less than 5, in which case we want to sort by the CostCenter.
Code Listing 4-11: Deciding how to sort record by record
SELECT DepartmentId , DepartmentName , CostCenter , LocationId FROM dbo.Department ORDER BY CASE WHEN LocationId < 5 THEN DepartmentName ELSE CostCenter END; |
DepartmentId | Department Number | Cost Center | LocationId |
|---|---|---|---|
8 | ConsumerSales | 1R8JASDAFAD | 5 |
3 | Prepaid Customer | 3SXY6JLYJM1 | 5 |
7 | Database | 80O3SAS8SP6 | 2 |
12 | Prepaid Customer | DNM1OUVCS | 5 |
14 | Express Marketing | 1SNXK5DFDSF | 1 |
5 | Inside Accounting | VVCELPQURP | 3 |
2 | InternationalSales | IDAP8Q32CB | 4 |
6 | Outside Accounting | NI04JEDVP8RE | 1 |
10 | Phone Marketing | I8J5XNIYOE9KF | 2 |
15 | Service | K4DFDFSGGH | 1 |
4 | Service Hardware | FWR9NEF5X | 1 |
13 | Service Software | WGFB3HWMD | 4 |
9 | TechnicalSales | SU476Y3GFGF | 2 |
11 | Web | DZYFWH4JXIJJ | 3 |
1 | Web Marketing | E5Z3BSG80R1 | 2 |
Result Set 4-6: Sorting differently based on the location
Note: If you want to specify the direction for a conditional sort, you specify it after the end in the case statement. A conditional order by cannot change the direction of the order by, only which column to use.
You can sort by multiple columns; each column can have a different direction, and each column could potentially be a conditional ORDER BY. The individual columns specified must be unique. It does not make sense to order by the same column twice. The sequence of the columns specified matter. The result set retrieved is sorted by the first column specified, and then this result is sorted by the second column specified, etc.
Code Listing 4-12: Sorting by multiple columns
SELECT DepartmentId , DepartmentName , CostCenter , LocationId FROM dbo.Department ORDER BY LocationId , DepartmentName , CostCenter DESC; |
In this query, we sort by the LocationId, then the DepartmentName, and then the CostCenter. The impact of the last column specified in the order by can be seen in the last two records in the following result set.
DepartmentId | Department Name | Cost Center | LocationId |
|---|---|---|---|
14 | Express Marketing | 1SNXK5DFDSF | 1 |
6 | Outside Accounting | NI04JEDVP8RE | 1 |
15 | Service | K4DFDFSGGH | 1 |
4 | Service Hardware | FWR9NEF5X | 1 |
7 | Database | 80O3SAS8SP6 | 2 |
10 | Phone Marketing | I8J5XNIYOE9KF | 2 |
9 | TechnicalSales | SU476Y3GFGF | 2 |
1 | Web Marketing | E5Z3BSG80R1 | 2 |
5 | Inside Accounting | VVCELPQURP | 3 |
11 | Web | DZYFWH4JXIJJ | 3 |
2 | InternationalSales | IDAP8Q32CB | 4 |
13 | Service Software | WGFB3HWMD | 4 |
8 | ConsumerSales | 1R8JASDAFAD | 5 |
12 | Prepaid Customer | DNM1OUVCS | 5 |
3 | Prepaid Customer | 3SXY6JLYJM1 | 5 |
Result Set 4-7: Sorting by multiple columns
OFFSET and FETCH clauses can be used to implement a paging strategy. This provides a filter on the data based on the sorting sequence. An example SQL SELECT including OFFSET and FETCH in your ORDER BY follows:
Code Listing 4-13: A simple example using OFFSET and FETCH
SELECT EmployeeId , FirstName , LastName , ReportsTo , DepartmentId FROM dbo.Employee ORDER BY DepartmentId , LastName , FirstName OFFSET 20 ROWS FETCH NEXT 10 ROWS ONLY; |
Note: OFFSET and FETCH were not added until SQL Server 2012.
The values for the OFFSET and FETCH can be numeric constants like you see in Code Listing 4-13, or could be locally declared variables, or even subqueries. You can query the number of records to be returned from a configuration table for configuration settings.
Code Listing 4-14: Pulling the parameters for FETCH from a configuration table
SELECT EmployeeId , FirstName , LastName , ReportsTo , DepartmentId FROM dbo.Employee ORDER BY DepartmentId , LastName , FirstName OFFSET 20 ROWS FETCH NEXT ( SELECT PageSize FROM dbo.AppSettings WHERE AppSettingID = 1 ) ROWS ONLY; |
In this chapter we have seen various ways to slice and summarize data. The aggregate functions open up a lot of possibilities to get fresh insights into your data. We have also seen various ways that sorting can be used to draw the focus to relevant data.
Now we will turn our attention to a new type of table join from what we saw in Chapter 2. Next we will explore the implications of tables selecting from themselves.