Friday, January 2, 2015

Database Query Log (DBQL) in Teradata

DBQL provides a series of pre-defined tables that can store, based on rules you specify, historical records of queries and their duration, performance, and target activity.

DBQL is flexible enough to log information on the variety of SQL requests that run on the Teradata Database, from short transactions to longer-running analysis and  mining queries. You begin and end collection for a user, group of users, account, or a list of accounts.

Collection option include:
  • Default logging reports for each query with at least the leading SQL characters, the time of receipt, the number of processing steps completed, and the time the first step was dispatched, and the times the packets were returned to the host.
  • Summary logging reports at each logging interval the count of all queries that completed processing time within the specified time intervals, I/O criteria, or CPU usage criteria. 
  • Threshold logging can log a combination of default and summary data:
    • Default data for each query that ran beyond the threshold limit
    • Summary counts of all queries that ran within the threshold time
    • Use I/O or CPU time as a criterion for detail or summary logging
  • Detail logging, which includes:
    • Default Data
    • Any or all of the following:
      • Step level Activity, including parallel steps
      • Object Usage per Query
      • Full SQL text
      • Explain Text


Note: Any DBC database tables and columns used by the system while processing a query are not reflected in the DBQL object rows for that query. This means, for example, that statements like CREATE TABLE or SELECT FROM DBC.xxx will not have objects logged through DBQL because they deal with DBC tables and columns.

Also, DBQL is limited to logging information about base tables and logging direct SQL statements. Macros, Views and triggers are not reported.

The DBQL Components
The DBQL logs are a series of system tables created in database DBC during the Teradata Database installation process. The suite of DBQL components includes a security macro and a view for each table, which are created in database DBC by the DIP utility.

DBQL tables that are being populated include:
  • DBQLogTbl  table stores default rows. The foundation of the DBQL feature.
  • DBQLObjTbl  table stores information on the target objects of the query being logged. One row is logged for each object referenced in the query.
  • DBQLSQLTbl  table logs the full SQL statement text, no matter how large. Multiple rows are generated if necessary.
  • DBQLStepTbl  table logs the AMP step-level information of the query being logged
  • DBQLExplainTbl  table logs the full unformatted Explain text.
DBQL Data Capture

Like other system tables, the predefined DBQL logs are created as relational tables in database DBC during normal Teradata Database installation. However, while most system tables are populated automatically, you can choose whether you want to populate the DBQL tables.
If you choose not to use the feature, the tables remain empty. If you want to use the feature, simply submit a BEGIN QUERY LOGGING statement, with or without options.
The options enable you to control the volume and detail of the logged data. You can define rules, for instance, that log the first 5,000 characters of any query that runs during a session invoked by a specific user under a specific account. Or you could define a rule to log queries that take more time to complete than the specified time threshold or queries with CPU usage that exceed a specific CPU hundredths of a second.
DBQL is controlled by the Teradata SQL statements BEGIN QUERY LOGGING and END QUERY LOGGING. Only a user with EXECUTE privilege on DBC.DBQLAccessMacro can invoke the statements.

DBQL Log Table
Default rows are stored in DBQLogTbl, the foundation of the DBQL feature.
When you do not specify a LIMIT SUMMARY or LIMIT THRESHOLD option, one default row of query-level information is logged in DBQLogTbl for each query processed during a session that is initiated by any user for whom a query logging rule exists.
If you specify options that result in more detailed information, a default row is still generated in DBQLogTbl (except with the SUMMARY option or a query that completes within the limit specified with the THRESHOLD option), plus one or more additional logs are populated with one or more additional rows.
Note: DBQL is limited to logging information about base tables and logging direct SQL statements. Macros, views, and triggers do not result in complete logging information.

The Default Row
The fields of the default row provide general query information that is usually adequate for investigating a query that is interfering with performance. When no options are specified, a default row includes at least the:
  • User ID and user name under which the session being logged was initiated
  • Unique ID for the process, session, and host (client) connection
  • Account string, expanded as appropriate, that was current when the query completed
  • First 200 characters of the query SQL statement
  • Times - Start and Last Response
  • CPU and I/O statitics
  • Default Database name that was current when the query completed
  • Numerous other performance metrics
DBQL Object Table
With the OBJECTS option, DBQL logs one row in DBQLObjTbl for each data object referenced by the query. An object can be a database, data table, column, secondary index, join index, or journal table. (If the object is a secondary index, its number is logged rather than a name.) Also logged is a count of the number of times the object was accessed by the optimizer.
Note: Any DBC database tables and columns used by the system while processing a query are not reflected in the DBQL object rows for that query. This means, for example, that statements like CREATE TABLE or SELECT FROM DBC.xxx will not have objects logged through DBQL because they deal with DBC tables and columns.
Only objects from base tables are logged. Objects from views and macros are not logged.

Scenario 1
User runs a SELECT statement that joins two tables owned by the same database:
  • One row for the query is logged in DBQLogTbl
  • Rows are logged in DBQLObjTbl as follows:
    • A row for the database
    • One row for each table
    • One row for each selected column
Scenario 2
User runs a query that causes the Optimizer to reference the same object twice:
  • One row for the object is logged in DBQLObjTbl
  • The value in the FreqofUse field is incremented to 2.
DBQL SQL Table
The DBQLSQLTbl is populated if you specify the SQL option in addition to default information. The SQL option logs the full statement text, no matter how large, into DBQLSqlTbl. Multiple rows are generated as necessary.
Note: The full SQL text of a CREATE/ALTER/REPLACE PROCEDURE/ FUNCTION is not logged in DBQLSQLTbl when the statement is submitted with the option to not save the source in the database.

DBQL STEP Table
DBQLStepTbl is populated if you specify the STEPINFO option. When the query completes, the system logs one row for each query step, including parallel steps.

DBQL EXPLAIN Table
This table is populated if the WITH EXPLAIN option is requested in addition to the default information row. If the additional Explain text is greater than 64KB, multiple rows are generated.The DBQLExplainTblis populated with unformatted explain text.






Thursday, January 1, 2015

Collect Statistics Recommendations in Teradata TD14


Statistical information is vital for the optimizer when it builds query plans. But collecting statistics can involve time and resources. By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.

Collect Full Statistics
•PI of small tables (less than 100 rows per AMP)
•Non-indexed columns used in where or join constraints
•Composite columns that often appear together in conditions with equality
•All NUSIs with an uneven distribution (High Bias Values)
•NUSIs used in Join conditions
•UPIs and USIs used in range constraints
•NUPIs that are more than or less than 75% unique (with no statistics, optimizer assumes 75% unqiueness)

Can rely on Random AMP Sample
•UPIs or USIs only used in equality constraints
•NUSIs with even distribution of values
•NUPIs with even distribution

Option to use USING SAMPLE
• Unique index columns
• Nearly-unique columns or indexes (Columns or indexes which are over 95% unique)

Other Considerations

1. Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination will not be chosen unless statistics have been collected on the relevant columns.

Statistics must be collected on relevant columns to use optimizations such as:
•Nested Join
•Partial Group By
•Dynamic partition elimination

NUPIs that are used in join steps in the absence of collected statistics are assumed to
be 75% unique, and the number of distinct values in the table is derived from that. A
NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other
side, it’s 60% unique or less) will benefit from having statistics collected. However, if it
is close to being 75% unique, then random AMP samples are adequate. To determine
what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL
statement:
EXPLAIN SELECT DISTINCT nupi-column FROM table;

2. For a partitioned primary index table, it is recommended that you always collect statistics on:
• PARTITION. This tells the optimizer how many partitions are empty, and how many rows are in each partition. This statistic is used for optimizer costing.
• The partitioning column. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.

3. For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
• (PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
• (PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey join costing.

4. Random AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). All-AMP random AMP sampling has these particular advantages:
• It provides a more accurate row count estimate for a table with a NUPI. This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
• Statistics extrapolation for any column in a table will not be attempted for small tables or tables whose primary index is skewed (based on full statistics having been collected on the PI), unless all-AMP random AMP sampling is turned on. Because a random AMP sample is compared against the table row count in the histogram as the first step in the extrapolation process, an accurate random AMP sample row count is critical for determining if collected statistics are stale, or not.
(Set the DBSControl setting for Random AMP samples to All AMPs instead of the default of single AMP to leverage extrapolation on small tables or skewed tables)