left-icon

SQL Queries Succinctly®
by Nick Harrison

Previous
Chapter

of
A
A
A

CHAPTER 7

Importance of the Data Dictionary

Importance of the Data Dictionary


Learning the data dictionary is one of the most important things that you can do to make it easier to learn a new data model. Learning a new data model is often the hardest part of writing SQL queries. Quite often there is little to no documentation for the database, and sometimes there are remnants from old requirements and defunct systems. If the database has been in use for a while, there is a good chance that many of the relationships have been dropped to accommodate dirty data or changing data requirements.

All of this makes it harder to learn the data model.

The data dictionary can help. No matter how frequently the database is changed, the data dictionary is still up-to-date. No matter how out-of-date all other documentation may be, the data dictionary will be up-to-date because it’s kept current by the database itself.

The data dictionary is a collection of tables and views that can track literally millions of pieces of information about the database. Internally, the database uses this information for performance tuning, tracking maintenance, building execution plans, etc. We can use it to learn where the data is stored and ferret out key information and relationships.

Learning the data model

When I talk about the data dictionary here, I am specifically talking about a set of system views stored in each database that make up the Information_Schema schema. There are other views that could be used. Each database vendor has system views that accommodate proprietary details for their implementation. They may also change from one version to the next to accommodate new features introduced in each version.

Tip: While you could get to the information you want by referencing system tables directly, stick with the Information_Schema views instead. Their structure is an ANSI standard and will be consistent across most database systems and versions. The same cannot be said about the raw system tables.

If you need to track vendor and version-specific metadata, you will need to use the system tables directly, but as you will see, we can get a lot of useful details while limiting ourselves to the Information_Schema, and this will help future-proof our processes. The queries we are about to go over will be portable across vendors and across database versions with the same vendor.

Note: While these views are an ANSI standard, not all database vendors follow them. The biggest missing player is Oracle, but there are open source projects devoted to filling this gap, such as this one.

Our discussion will focus on the following views:

  • Information_Schema.Tables
  • Information_Schema.Columns
  • Information_Schema.Views
  • Information_Schema.View_Table_Usage
  • Information_Schema.View_Column_Usage
  • Information_Schema.Routines
  • Information_Schema.Parameters

Information_Schema.Tables

The Tables view has data for every table in the database. This is where we would go to get a list of tables in the database or, more often, a list of tables that satisfy a nested question, such as having a particular column. Once we have explored all of these views, we will explore some creative ways to use this information.

Tables has the following columns:

  • Table_Catalog: The name of the database.
  • Table_Schema:  The name of the table’s owner.
  • Table_Name:  The name of the table.
  • Table_Type: An indicator for whether it’s a base table of a view.

Information_Schema.Columns

The Columns view has data for every column in the database. There are quite a few columns in this view. In some circumstances, you may care about each of these columns, but for our purposes, only a handful of these columns are relevant.

  • Table_Catalog: The name of the database.
  • Table_Schema: The name of the table’s owner.
  • Table_Name:  The name of the table.
  • Column_Name:  The name of the column.
  • Ordinal_Position: The sequence for where the column is defined in the table.
  • Is_Nullable: An indicator for whether or not this column can have NULL values.
  • Data_Type: A string for the data type for the column.

The rest of the columns in this view are used to gather more details about the data type, such as the length, precision scale, etc.

Information_Schema.Views

The Views view has data for every view defined in the database. This is very similar to the Tables view, except for one notable addition. The Views view includes a view definition column, which contains the SQL used to create the view. This can be useful if we want to get the definition of a view.

Information_Schema.View_Table_Usage

The View_Table_Usage view has a record for every table used in any view definition. This can make it easier to query and get a list of tables involved in any view to track down data mapping.

Information_Schema.View_Column_Usage

The view Information_Schema.View_Column_Usage has a record for every column that is exposed in a view. This includes the Table_Name and Column_Name where the column comes from home. This makes it easy to see where the data from views comes from.

Information_Schema.Routines

The Information_Schema.Routines view has a record for every stored procedure and stored function defined in the database. There are lots of columns here to give more information about the programmability objects defined. For the purposes of learning the data model, we don’t really care about most of these columns. The columns that we do care about include:

  • Specific_Name: The name of the routine being defined.
  • Routine_Type: An indicator to identify the type of routine being defined.
  • Routine_Definition: This column holds the SQL used to define the routine.

Often all we really care about is identifying the stored procedures that are in the database.

Information_Schema.Parameters

The Information_Schema.Parameters view has a record for every parameter used in any routine.

  • Specific_Name will have the name of the routine using the parameter.
  • Parameter_Name will have the name of the parameter being defined.
  • Ordinal_Position will have the order that the parameter is expected.
  • Parameter_Name will be the name of the parameter being defined.

We also have a slew of columns to define the data type for the parameter.

We might use this to identify all of the stored procedures that include a specific parameter or list all of the parameters used by a specific stored procedure.

Common queries

There are some queries that you might use over and over to get a feel for what you have when you meet a new database for the first time.

Find out who the owners are

Code Listing 7-1: Who are the owners?

SELECT  TABLE_SCHEMA ,

        COUNT(1)

FROM    INFORMATION_SCHEMA.TABLES

GROUP BY TABLE_SCHEMA;

This will tell you who the owners are and how many tables they each own.

Tip: In many databases, you may find that everything is owned by one dbo (database owner), but sometimes tables are grouped together into functional areas by ownership.

Which tables are owned by a specific owner?

Code Listing 7-2: What does Finance own?

SELECT  TABLE_NAME

FROM    INFORMATION_SCHEMA.TABLES

WHERE   TABLE_SCHEMA = 'Finance'

ORDER BY TABLE_NAME;

This will identify all of the tables owned by Finance. If you have table owners for each functional area, this will give us a list of the tables that would be used in the Finance module of your system.

Which tables include a specific column?

Sometimes you might have a column name used throughout the data model. For example, in an insurance policy management system, you might find that every policy-level table will include a PolicyKey column. In a mortgage loan origination system, every loan level table may include a LoanKey column. This can help identify or filter out tables that are at that level.

Code Listing 7-3: Where are the PolicyKey columns?

SELECT  t.TABLE_SCHEMA ,

        t.TABLE_NAME

FROM    INFORMATION_SCHEMA.TABLES t

        INNER JOIN INFORMATION_SCHEMA.COLUMNS ON COLUMNS.TABLE_NAME = t.TABLE_NAME

                                 AND COLUMNS.TABLE_SCHEMA = t.TABLE_SCHEMA

WHERE   COLUMN_NAME = 'PolicyKey'

ORDER BY t.TABLE_NAME;

This query should identify all of the policy-level tables in your database.

Sometimes you want to do the opposite. In this case, for example, we want to find all of the tables that do not have a column named PolicyKey:

Code Listing 7-4: Which tables do not have a PolicyKey column?

SELECT  t.TABLE_SCHEMA ,

        t.TABLE_NAME

FROM    INFORMATION_SCHEMA.TABLES t

        LEFT  OUTER JOIN INFORMATION_SCHEMA.COLUMNS

                 ON COLUMNS.TABLE_NAME = t.TABLE_NAME

                     AND COLUMNS.TABLE_SCHEMA = t.TABLE_SCHEMA

                     AND COLUMN_NAME = 'PolicyKey'

WHERE COLUMN_NAME IS null

ORDER BY t.TABLE_NAME;

This will give us a list of all of the tables that are not at the policy level. Often this might mean that the tables are for configuration, administration, or something else not at the policy level.

Now we can extend this further to find the tables that have a combination of columns that we are interested in:

Code Listing 7-5: Do any tables have a LoanKey and Term column combination?

SELECT  t.TABLE_SCHEMA ,

        t.TABLE_NAME

FROM    INFORMATION_SCHEMA.TABLES t

        INNER JOIN INFORMATION_SCHEMA.COLUMNS P

                         ON P.TABLE_NAME = t.TABLE_NAME

                        AND P.TABLE_SCHEMA = t.TABLE_SCHEMA

        INNER JOIN INFORMATION_SCHEMA.COLUMNS Term

                         ON Term.TABLE_NAME = t.TABLE_NAME

                        AND Term.TABLE_SCHEMA = t.TABLE_SCHEMA

WHERE   P.COLUMN_NAME = 'LoanKey' AND term.COLUMN_NAME = 'Term'

ORDER BY t.TABLE_NAME;

Here we get a list of the tables that have both a LoanKey column and a Term column. This could be used to filter out any tables used to configure term options or term restrictions, and instead only report the references to Term that are at the loan level.

Search for common patterns that might point to a problem

I worked on an application once where the city, state, and zip fields for an address were combined in CSZ fields. Various tables would have all three values in one column. This will ultimately be problematic for various reasons, so I started explicitly looking for tables that have any columns that end with CSZ.

Code Listing 7-6: Did anyone put city, state, and zip in a single column?

SELECT  TABLE_NAME ,

        COLUMN_NAME

FROM    INFORMATION_SCHEMA.COLUMNS

WHERE   COLUMN_NAME LIKE '%CSZ';

Another problem may be dealing with common or inconsistent abbreviations. I see this a lot with date fields. Often, people get creative when naming date columns. For example, I have seen ClosingDt, FirstPaymentDueDte, and RecissionDate all in the same table. My preference is to always spell out the components of the name, but not everyone agrees with me. We can easily find all the variations used in your database:

Code Listing 7-7: How did we name the date columns?

SELECT  TABLE_NAME ,

        COLUMN_NAME

FROM    INFORMATION_SCHEMA.COLUMNS

WHERE   (DATA_TYPE = 'DateTime' or DATA_TYPE = 'Date')

        AND COLUMN_NAME NOT LIKE '%Date';

Tip: Never include a NOT LIKE clause in a condition that you plan to run in production, because it will generally involve a full table scan to retrieve the data. Fortunately, you can easily get the data you want in from the development or training environments.

Another common problem you are likely to run into is inconsistent data types for columns that should be consistent. This can complicate integrating with other systems or cause unexpected results when reporting within a single application.

Code Listing 7-8: Did we standardize the length of a City column?

SELECT  TABLE_NAME ,

        COLUMN_NAME, CHARACTER_MAXIMUM_LENGTH

FROM    INFORMATION_SCHEMA.COLUMNS

WHERE   COLUMN_NAME LIKE '%City';

You probably want to run a similar query to confirm consistent lengths for all state columns, zip code columns, phone number columns, etc. Anytime you see a common component for multiple field names, review these columns and ensure that they have a consistent data type, or that you understand why they are not consistent.

Note: Sometimes they may need to be different to accommodate a third-party interface or similar integration concerns.

Are we using any obsolete data types?

There are several data types that have been identified as obsolete, and we have been warned that they may be dropped in the future. It is a good idea to track these columns and make sure that new columns don’t show up in your list. You can potentially even update them to a more functional standard, depending on your change control.

Code Listing 7-9: Finding obsolete types

SELECT  TABLE_NAME ,

        COLUMN_NAME

FROM    INFORMATION_SCHEMA.COLUMNS

WHERE   DATA_TYPE IN ( 'text', 'ntext', 'image', 'datetime' );

You may identify other data types that you want to avoid, so you may want to track their usage.

Code Listing 7-10: Finding types to avoid

SELECT  TABLE_NAME ,

        COLUMN_NAME

FROM    INFORMATION_SCHEMA.COLUMNS

WHERE   DATA_TYPE IN ( 'uniqueidentifier', 'ntext');

What data types are being used?

It’s good to know what data types are being used across the system. The first time you run such a query, you may find a lot of inconsistencies.

Code Listing 7-11: What data types are being used?

SELECT  DATA_TYPE ,

        COUNT(1)

FROM    INFORMATION_SCHEMA.COLUMNS

GROUP BY DATA_TYPE

ORDER BY DATA_TYPE;

You may find results like this:

Data Type

Count

Bit

17

char

2

datetime

34

datetime2

9

decimal

124

float

1

Int

25

Nchar

2

nvarchar

5

Smallint

2

Varbinary

1

varchar

270

Result Set 7-1: Data types being used

This is good information to have. You can quickly tell that you have not taken advantage of the new Date data type. You can also quickly tell that someone has probably gotten confused about the various numeric data types, such as NChar and NVarChar.

Tip: Unless you need support for internationalization, stick with Char and VarChar.

This will help you track the inconsistencies and progress towards standardization.

Summary

We have seen in this chapter that the data dictionary provides a good source of documentation for the database, even when all other documentation has long since been outdated, because the database itself is keeping this documentation current. We have seen how to use this information to quickly get an overview of what’s in the database by finding out who owns the structures in the database, and getting a list of every table in the database and the columns associated with each of these tables. We can easily identify which tables are reference data and which have transaction data by searching for key columns being included in the table.

We have also seen some examples of how to detect patterns in the database structure that may cause problems. We looked at finding a pattern where city, state, and zip code data was combined in a single field, where naming conventions were inconsistent for Date columns, where the lengths for common columns such as City were not consistent, or we had not standardized on data types for numeric values.

There is a wealth of information available that can help you quickly get up to speed on a data model when you face one and wonder where to begin.

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.