Tuesday, May 13, 2014

How to use partitioned primary index (PPI) Part 1

How to use partitioned primary index (PPI)

This post is about row partitioning and will not discuss columnar.

What is partitioning?

To explain it correctly, let's get back to the basics...
Each Teradata tables (except NoPI type) have "Primary Index", aka. PI, which is not physical index, but rather a logical construction: one or more columns of the table which give the input for hashing method. The hash value determines two things:
  • Which AMP will store the record
  • Storing order of the records within the AMPs
If the PI is non-partitioned then the records are stored in order of hash value of PI.

If you use (row) partitioning, you define it at the Primary Index.
In this case Teradata will associate a 2bytes or 2/8 bytes (at V14.10) "partition code" to the record*, and the storing order is <partition code>,<hash_value> (aka. RowKey).
That way partitions are not sub-tables or other physical objects, but only influence the record storing order.

* This implies that no more than 64k(2bytes)/9Q(8bytes) partitions can exist. For details read the appropriate Teradata version's documentation.

What is the difference between PPI and NUSI?

NUSI (Non Unique Secondary Index) can serve as similar purposes, but is absolutely different.
NUSI is a separate subtable, with analogue PI to base table, but different (value) ordering.
For details please read Teradata documentation.

How to define?

Non partitioned table:
create table tablePI
(
  Trx_id Integer
, Trx_dt Date
)
PRIMARY INDEX (Trx_id)


Partitioned table:
create table tablePPI (   Trx_id Integer
, Trx_dt Date
)
PRIMARY INDEX (Trx_id
, Trx_dt**)
PARTITION BY RANGE_N(Trx_dt BETWEEN DATE '2010-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' DAY , NO RANGE, UNKNOWN) 


Highlights
  • **Partitioning key (Trx_dt here) can be part of the PI or not. This is very important, see below.
  • Partitioning can be single or multiple (MLPPI) levels***
  • RANGE_N or CASE_N functions can be used for determining partition code
  • RANGE_N function has constant interval endpoints and partition length.
  • NO RANGE and UNKNOWN partitions will store the out-of-intervals and null value records respectively
***MLPPI is a technique when multiple or nested partitioning is defined on the table. Logically it looks like sub-partitions, but in practice it only influences the calculation of partition code values, which is still a linear 2/8 bytes value overall the table.

Pros - Cons of using PPI

PPI is a very useful feature, but not a silver bullet to use it everywhere. Look the trade offs:
  • (+) Partition elimination
    Only the relevant partitions are scanned while accessing data
  • (+) Interval filtering is supported
  • (+) Accelerates INSERTs
    If we load increment data into a populated table. Very likely less data blocks are affected, since few partitions are involved (if date is the partitioning basis) 
  • (-) 2 or 8 bytes extra space allocation per record
  • (-) Compression is not allowed on PartKey column
  • (-) PartKey inclusion problem (see below)
  • (-) Partition elimination works only with literals
    Subselects cause full table scans

Design aspects

RANGE_N or CASE_N

These functions are used to define partitioning. RANGE_N is for concentrate date (integer) intervals into partitions, while CASE_N is like a CASE-WHEN-THEN expression, where the outcome is the partition.

Typically RANGE_N is used when we partition a transaction table by its date or timestamp, while CASE_N is popular in special cases like categorizing. You can use more columns in the logical expression, but take care, all of them must be used in filter condition to enable partition elimination.

RANGE_N: what interval size?

It depends on the granularity of the data, granularity of filtering and how long interval should be stored in the table. Usually daily partitioning is ideal.

RANGE_N: interval extension or intervals in advance?

If we load transactional data into our partitioned table, the date column we use as partition key is populated later and later dates, while we have a finite partition range definition.
Partition ranges can be added to RANGE_N definition periodically (depends on version), or we can define partitions in far advance. (365 partitions required for a year, 65k partitions cover ~ 180years, which is more than enough) Note that empty partitions do not allocate space.

One of the methods above should be applied, otherwise the NO RANGE partition will grow extensively, which will cause performance degradation due to less effective partition elimination.

Partitioning Key: include in PI or not?

This is the funny point.
Partitioning key is the column(s) that determines the partition, say used in the RANGE_N/CASE_N definition. We can include it in the Primary Index or not, we decide.

Let's take an example. We have a master-detail pair of tables, nicely "equi-PI"-ed for effective join:

CREATE TABLE ORDER_HEAD
(
  ORDER_NO INTEGER
, ORDER_DT DATE
) UNIQUE PRIMARY INDEX (ORDER_NO);

CREATE TABLE ORDER_ITEM
(
  ORDER_NO INTEGER
, ORDER_ITEM_NO
, PROD_NO INTEGER
) PRIMARY INDEX (ORDER_NO);


We modify ORDER_HEAD's PI:
UNIQUE PRIMARY INDEX (ORDER_NO, ORDER_DT)
PARTITION BY RANGE_N(ORDER_DT BETWEEN DATE '2010-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' DAY , NO RANGE, UNKNOWN)

Should we include ORDER_DT or not? Which is better, what is the difference?
  • Not include
    ORDER_HEAD and ORDER_ITEM tables will have similar AMP distribution, but different physical order within the AMPs.
    Each join operation requires sort of the selected ORDER_HEAD records in spool, or ORDER_ITEMS table will be merge joined against each selected non empty partitions of ORDER_HEAD sequentially (called sliding-window merge join)
  • Include
    ORDER_HEAD and ORDER_ITEM tables will have different AMP distribution, each join operation requires redistribution.Why do we not use the same PI at ORDER_ITEM? Because we do not have that column there.
Neither of the above is acceptable in many cases. What should we do? In this case I would copy the ORDER_DT to the ORDER_ITEM table also, and use the same "Included" version of PI. Requires some more space, logic in load time, but great gain while accessing data.

Use cases

Filtering

This select will eliminate all partitions except those three:
select * from ORDER_HEAD where order_dt between '2013-12-12' (date) and '2013-12-14' (date);

This select will generate all rows scan:
select * from ORDER_HEAD where cast( order_dt as char(7)) = '2013-12';

This select will generate all rows scan* either (sub-query):
select * from ORDER_HEAD  where order_dt in (select max(calendar_date) from sys_calendar.calendar  where year_of_calendar=2013 and month_of_year=5);
Why? Optimizer has to determine which partitions to be accessed in time of generating execution plan. That time it cannot know what is the result of the subquery. That is it.

* I got a proper comment on this option to double check. Yes, right, this information is a out-of-date. With actual versions of Teradata (V13.10..V14.10) I experienced 3 different results:
  • Full scan
    Eg. sub-query contains a "group by"
  • Dynamic partition elimination
    Sub-query is simple, indicates "enhanced by dynamic partition elimination" section in the plan
  • Plan-time partititon elimination
    Literal condition or very simple sub query. Parsing time evaluation enables PO to determine which partitions to be scanned.  Plan: "...We do an all-AMPs ... step from 3 partitions of...". Do not really know exactly what decides between full scan, dynamic- or plan-time elimination... Explanations welcome.

Join

We join two tables: T1 and T2. The table shows what happens if they are partitioned, not partitioned and the partitioning key is included or not in the PI:

T2

T1
PI:(a)PI:(a) PART(b)PI:(a,b) PART(b)
PI:(a)Join: T1.a=T2.a
RowHash match
PI:(a) PART(b)Join: T1.a=T2.a
T1 sorted by hash(a) or
Sliding-window MJ
Join: T1.a=T2.a
T1&T2 sorted by hash(a)
or Sliding-window MJ
(NxM combinations)
Join: T1.a=T2.a and T1.b=T2.b
T1&T2 sorted by RowKey
RowKey based MJ
PI:(a,b) PART(b)Join: T1.a=T2.a
T1 Redistributed & sorted
by hash(a)
Join: T1.a=T2.a
T1 Redistributed by hash(a)
T2 sorted by hash(a) and MJ
Join: T1.a=T2.a and T1.b=T2.b
T2 Redistributed and sorted by RowKey
RowKey based MJ
Join: T1.a=T2.a and T1.b=T2.b
RowKey based MJ


Insert

Let's take a transaction table like ORDERS. In practice we load it periodically (eg. daily) with the new increment which is typically focused to a short interval of transaction date/time. If the ORDERS table is not partitioned, then the outstanding hashing algorithm will spread them all over the data blocks of the table evenly, therefore Teradata has to modify far more data blocks than the increment was reside in.

But if the ORDERS table is partitioned, then the physical order of the records is primarily determined by the partition key. This means that the increment will reside in very few partitions, close together, and the insert operation requires approx the same number of blocks to be written than the increment was in.
For more details on PPIs please refer the documentation of the appropriate Teradata version.

To be continued...

How to choose partitioned primary index (PPI)

How to choose partitioned primary index (PPI)


What is the difference between NPPI and PPI?

  • NPPI: Non partitioned primary index
    The good old regular PI. The rows are distributed by HASHAMP(HASHBUCKET(HASHROW(PI))), and ordered by HASHROW(PI), nothing special
  • PPI: Partitioned primary index
    Distribution is the same, but ordering different: <PartitionID><HASHROW(PI)>. The <PartitionID> is a stored value in each rows, allocating 2 or 8 bytes (see below).
The only difference is the storing order of the records (and the 2/8 bytes overhead).

What is PPI good for?

The partitioning feature - like in many other databases - usually solves some performance issues, say enables to eliminate some needless work in specific situations.
  • SELECT
    Eligible "where conditions" result serious partition-elimination, which means that usually only a small fraction of the table should be scanned instead of the whole one.
  • INSERT
    Check the storing order of the NPPI tables: the records are in "hash" order, that is if I want to insert a series of records into a Teradata table, they will reside in spreadly distributed data blocks. If the table is big enough, my new eg. 1000 records will get into ~1000 different data blocks, what means 1000 pieces of expensive "random writes". However if my 1000 records got to a PPI table and they will have the same PartitionID, they will get into far less than 1000 data blocks with high probability. In real life situations we often will write to continuous data blocks with much cheaper "sequential write" 
  • DELETE
    Same as INSERT
  • BACKUP
    Teradata allows archiving only one or more partitions, saving lot of time and tape. Older data in transaction tables usually does not change therefore it is unnecessary to backup them every time

"Good-to-know"s

Costs of partitioning

Like all good things in the world, partitioning has trade-offs also:
  • Extra 2/8 bytes per record allocated storage space
    Depending on maximal number of partitions. See "Number of partitions" chapter
  • Slower "SAMPLE" scans
    Proper random sampling is more complex, since the physical storing order is in correlation with partitioning value
  • Extra sort operations / Sliding window joins
    If joined to a table which has NPPI or PPI with not exactly same definition will result a preparation "sort" step, or leads to a "sliding window merge join", which is technically N x M merge joins between the partitions of TableA and TableB.

Number of partitions

How many partitions should I have?
How many partitions do I have?
How is an empty partition looks like?
They are all interesting questions, let's analyze the implementation of Teradata implementation.

Partition is not an object, it is just a calculated (and stored) value in the record, which will determine the physical storing order of the record. A partition will not allocate space, an "empty partition" technically means that no record exists with that partition's partitionID, nothing else.
How many partitions I have in the table? As many different PartitionID in the existing records occure, which depends on the occurring values of the partitioning column.
How many partitions can I have in the table? It depends on the table definition. One must use the RANGE_N or the CASE_N function to define the PartitionID calculation. Its definition unambiguously determines how many different PartitionID values may occur. In versions up to V13.10 65535 is allowed, from V14.00 we can have as many as 9.2 Quintillion (8 bytes PartitionID). The table definition cannot be altered to switch between 2 and 8 bytes layout.

What is the drawback of having many partition? The sliding-window merge join. Mind including partitioning column into the PI if possible (otherwise PI based filtering will cause as many accesses as many partitions exist).

What happens with the out-of-range records?

We have the clauses NO RANGE and NO CASE in the PPI definition. They mean an ID value for that partition that is out of the defined range or case, those records got into this partition. It can be a hidden trap, if you forget to maintain your date partition definition on a transaction table, and all records got to get into this partition from a moment. And the partition keeps fattening, queries keep go slowing somehow...

Multi level partitioning

This is a good trick. One can define partitioning "hierarchically", which is simply a "Cartesian product" of the partitions at each levels, the result is a single PartitionID. In case of 2 bytes partitioning, the "Cartesian product" should fall below 65535.

What is sensational in the Teradata implementation of multi level PPI? You can filter only lower level partitioning key(s) also, partition elimination will happen. How? It calculates all possible combinations, and produces the PartitionID list to be scanned, excellent.

Partitioning granularity

The next good question is: how fine should I define partitioning?
It depends. Basically I'd branch to two main cases:
  • "Temporal" (date) partitioning
    The best partition size is the day. Most of the filtering is on day level, and we have ~365 days a year, not too much partitions for your lifetime. If we partition on monthly units, then the partition elimination ranges are more rough, and we have 12 partitions a year, which is also too much in case of a PI-NPPI join.
  • All others
    It really depends. Depends on the goal, and the value demographics. It's good to correlate with the filtering pattern (what is the frequent relevant 'where' condition parcel).
Hope it helped, please ask, if something is missing or confusing.

How to use query banding in Teradata?

How to use query banding in Teradata?

What is queryband?

Teradata is a diligent RDBMS that runs sometimes millions of SQLs a day. You will see them in the DBQL (DataBase Query Logging area) - if it is switched on - but it's a hard job to know around in that mess of queries. How can I find a specific query? What did that query run by? If I want to analyze or modify something I need to find the source of the execution as exactly as can be.
Queryband is a labelling possibility to flag the queries to let their source job/report/etc. be easily found.

Who defines the queryband?

Setting the queryband is usually the responsibility of the query runner:

  • ETL software or solution that executes it
  • OLAP tool that issues it
  • Person, who runs it ad-hoc

How to set the queryband?

Technically it is a quite simple stuff: Teradata provides a command to set it:

SET QUERY_BAND = {'<variable1>=<value1>;<variable2>=<value2>;...' / NONE} [UPDATE] for SESSION/TRANSACTION;

, where:
<variable1>=<value1>;
Queryband can consist of arbitrary number of "variable"-"value" pairs. Both are string values. Do not forget to put the semicolon after each variable-value pair!

NONE: clears the queryband 

UPDATE: is specified, then those variables that has been previously defined are updated by the new value, others are added with the given value. Empty value string is a valid content and will not remove the variable. Please note that deleting a value is only possible by redefining the queryband without that specific variable.


SESSION/TRANSACTION: what it says...

Where can I check queryband?

The values are reflected in the dbc.SessionfoX.QueryBand and the dbc.DBQLogtbl.QueryBand. The following example shows its content:

SET QUERY_BAND='PROJECT=TeraTuningBlog;TASK=QB_example;' for session;

(For the logged in session)
SELECT queryband FROM dbc.sessioninfoX WHERE sessionNo=session;----------------------------------------------------
PROJECT=TeraTuningBlog;TASK=QB_example;

(For the formerly ran queries)
SELECT queryband FROM dbc.dbqlogtbl WHERE Queryid=...;----------------------------------------------------
=S> PROJECT=TeraTuningBlog;TASK=QB_example;

(For a specific variable, eg. "PROJECT")
SELECT QB_PROJECT FROM
(
   SELECT CAST((case when index(queryband,'PROJECT=') >0 then substr(queryband,index(queryband,'PROJECT=') ) else '' end) AS VARCHAR(2050)) tmp_PROJECT
     ,CAST( (substr(tmp_PROJECT,characters('PROJECT=')+1, nullifzero(index(tmp_PROJECT,';'))-characters('PROJECT=')-1)) AS VARCHAR(2050)) QB_PROJECT
   FROM dbc.sessioninfoX 
WHERE sessionNo=session
) x ;

----------------------------------------------------
TeraTuningBlog

(Which queries has been run by the "LoadCustomers" project?)
   SELECT a.*, CAST((case when index(queryband,'PROJECT=') >0 then substr(queryband,index(queryband,'PROJECT=') ) else '' end) AS VARCHAR(2050)) tmp_PROJECT
     ,CAST( (substr(tmp_PROJECT,characters('PROJECT=')+1, nullifzero(index(tmp_PROJECT,';'))-characters('PROJECT=')-1)) AS VARCHAR(2050)) QB_PROJECT
   FROM dbc.dbqlogtbl a 
WHERE QB_PROJECT="LoadCustomers"
;

Designing querybanding

We know how to set the queryband, it's quite easy to build in / configure in the ETL tool, OLAP software and other query running applications. But what variables should we define, and how should we populate them? I give a best practice, but it is just a recommendation, can be modified due according to your taste.

First of all, some things to mind:
  • Use short variable names and values, since they will be logged in each DBQL records
  • Define consistent structure in each source systems to easily analyze data
  • Record as detailed information as you need, not more, not less. Define unique values for those items you later want to differentiate. Using a lookup/hierarchy table you can easily merge what you need, but never can drill down what is aggregated.
I recommend these variables to be defined:
  • SYS: Maximum of 3 characters ID of the system that ran the Query, like INF (Informatica), MST (Microstrategy), SLJ (SLJM), BO (Business Objects), AH (ad-hoc query tool)
  • ENV: P (Production) / Tx (Test x) / Dx (Development x), the identifier of environment. x may be neglected, if it does not matter
  • JOB: Which job or report contains that specific query (the name of it)
  • STP: (Step) Which SQL script, or other sub-structure does the query belong to (name of it)
  • VER: Version of the JOB. This will determine the version of the script (if available)