Statistical information is vital for the optimizer when it builds query plans. But collecting statistics can involve time and resources. By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.
Collect Full Statistics
•PI of small tables (less than 100 rows per AMP)
•Non-indexed columns used in where or join constraints
•Composite columns that often appear together in conditions with equality
•All NUSIs with an uneven distribution (High Bias Values)
•NUSIs used in Join conditions
•UPIs and USIs used in range constraints
•NUPIs that are more than or less than 75% unique (with no statistics, optimizer assumes 75% unqiueness)
Can rely on Random AMP Sample
•UPIs or USIs only used in equality constraints
•NUSIs with even distribution of values
•NUPIs with even distribution
Option to use USING SAMPLE
• Unique index columns
• Nearly-unique columns or indexes (Columns or indexes which are over 95% unique)
Other Considerations
1. Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination will not be chosen unless statistics have been collected on the relevant columns.
Statistics must be collected on relevant columns to use optimizations such as:
•Nested Join
•Partial Group By
•Dynamic partition elimination
NUPIs that are used in join steps in the absence of collected statistics are assumed to
be 75% unique, and the number of distinct values in the table is derived from that. A
NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other
side, it’s 60% unique or less) will benefit from having statistics collected. However, if it
is close to being 75% unique, then random AMP samples are adequate. To determine
what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL
statement:
EXPLAIN SELECT DISTINCT nupi-column FROM table;
2. For a partitioned primary index table, it is recommended that you always collect statistics on:
• PARTITION. This tells the optimizer how many partitions are empty, and how many rows are in each partition. This statistic is used for optimizer costing.
• The partitioning column. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.
3. For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
• (PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
• (PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey join costing.
4. Random AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). All-AMP random AMP sampling has these particular advantages:
• It provides a more accurate row count estimate for a table with a NUPI. This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
• Statistics extrapolation for any column in a table will not be attempted for small tables or tables whose primary index is skewed (based on full statistics having been collected on the PI), unless all-AMP random AMP sampling is turned on. Because a random AMP sample is compared against the table row count in the histogram as the first step in the extrapolation process, an accurate random AMP sample row count is critical for determining if collected statistics are stale, or not.
(Set the DBSControl setting for Random AMP samples to All AMPs instead of the default of single AMP to leverage extrapolation on small tables or skewed tables)
No comments:
Post a Comment