Friday, September 26, 2025

Handling Teradata SKEW with RANDOM function

Data skew in Teradata occurs when data is unevenly distributed across the AMPs (Access Module Processors), leading to performance issues, particularly during joins or aggregations. The RANDOM() function can be employed as a salting technique to mitigate this skew.

Using RANDOM() to Handle Skew:
  • Identify the Skewed Column: 
    Determine the column in a table that exhibits significant data skew, meaning a few values occur much more frequently than others.
  • Generate Random Values: 
    Create a new column or a temporary column populated with random numbers using the RANDOM() function. The range of random numbers should be chosen based on the desired level of distribution. For example, RANDOM(1, 100) generates random integers between 1 and 100.
  • Combine with Skewed Column: 
    Concatenate the skewed column with the newly generated random value. This creates a composite key.
    SELECT 
        PrimaryKeyColumn,
        SkewedColumn,
        TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))) AS RandomSalt,
        SkewedColumn || TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))) AS SaltedJoinColumn
    FROM 
        YourSkewedTable;

  • Perform Join/Aggregation on Composite Key: When performing joins or aggregations, use this composite key instead of the original skewed column. This distributes the frequently occurring values across different AMPs, as the random component ensures unique hash values for each instance of the skewed value.
    CREATE VOLATILE TABLE NotSkewingTable AS
    (
        SELECT 
            PrimaryKey, 
            SkewedColumn,
            SkewedColumn || TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))) AS SaltedJoinColumn
        FROM 
            YourSkewedTable
    ) WITH DATA
    ON COMMIT PRESERVE ROWS;

    SELECT 
        T1.ColumnA, 
        T2.ColumnB
    FROM 
        AnotherTable T1
    JOIN 
        NotSkewingTable T2
    ON 
        T1.JoinColumn = T2.SaltedJoinColumn;

Benefits:
  • Improved Parallelism: Distributes the workload more evenly across AMPs, leading to faster query execution.
  • Reduced Spool Space: Minimizes the need for large intermediate spool files, which can consume significant resources.
  • Enhanced Join Performance: Addresses the bottleneck caused by skewed join keys.
Considerations:
  • Overhead: 
    Generating and concatenating random values adds a small overhead, but the performance gains typically outweigh this.
  • Cardinality: 
    The range of random numbers should be sufficient to effectively distribute the skewed data.
  • Data Types: 
    Ensure compatible data types when concatenating the skewed column and the random number.
  • Temporary Nature: 
    This technique is often applied in temporary or volatile tables for specific operations to avoid altering the base table structure.

Finding Skewed Tables by Querying the TableSizeV View

Finding SKEW

In this example, the SELECT request looks for poorly distributed tables by displaying the average, minimum, and maximum of the CurrentPerm figures allocated on each AMP to every table in the USER database. Each table is reported separately, ordered by name.

SELECT
  TableName (FORMAT 'X(20)'),
  MIN(CurrentPerm) AS "AMP Minimum",
  AVG(CurrentPerm) AS "AMP Average",
  MAX(CurrentPerm) AS "AMP Maximum"
FROM DBC.TableSizeV
WHERE DatabaseName = 'USER'
GROUP BY TableName
ORDER BY TableName; 

Result:

TableName              AMP Minimum   AMP Average   AMP Maximum
--------------------   -----------   -----------   -----------
employee_nupi_ondept   4,096         15,488        30,208
employee_upi_onempid   18,944        19,200        19,968

The result displays two tables. Notice the results show that:

  • The table Employee_upi_onempid is evenly distributed. CurrentPerm is similar across all vprocs (the minimum and maximum are close to the average). Permanent space is relatively evenly distributed across all AMPs in the system.
  • The table Employee_nupi_ondept is poorly distributed. The CurrentPerm figures range from a minimum of 4,096 bytes to a maximum of 30,208 bytes on different AMPs, indicating a wide variance from the average.