Thursday, June 27, 2013

What’s a DBA to do?

http://www.teradatamagazine.com/v09n02/tech2tech/applied-solutions-3-whats-a-dba-to-do/

A DATA WAREHOUSE FROM TERADATA ELIMINATES MANY MANUAL TASKS TYPICAL OF OTHER SYSTEMS.

BY ROGER MANN

What do experienced DBAs with an Oracle, IBM or Microsoft background need to know about managing a Teradata system? Basically, much less than they need to know about the others.

The Teradata Database is a shared-nothing massively parallel processing (MPP) relational database management system (RDBMS), making it the only commercially available RDBMS designed from the ground up for data warehousing.

Parallel processing and the automation of many typical DBA functions were created in the DNA of the Teradata Database. Because of that architecture, many functions that are manual or enhanced by wizards in other vendors’ systems are managed automatically. Consequently, the roles and responsibilities of the DBA are significantly different. Fewer tasks are required, making the system much easier to manage. Understanding those differences and how to exploit them with the RDBMS is key to driving the success of the organization.

The next sections break down how a data warehouse from Teradata differs from other systems, enabling the DBA to focus on more productive work and less on manually maintaining the data warehouse.

SYSTEM INSTALLATION

Data warehouse performance is achieved in a parallel database architecture by the “divide and conquer” method. First, the data is divided into small, equal units. Then independent software modules process those units simultaneously (i.e., in parallel) to conquer the problems (i.e., answer the queries).

The units of work and the hardware resources allocated to each parallel software module must be as equal as possible. Like a chain that is only as strong as its weakest link, the overall job cannot conclude until every unit has completed its processing. This “balanced processing” of workloads in the Teradata Database enables superior query performance.

The DBA’s objective, therefore, is to install a balanced processing platform, operating system (OS), database management software and disk subsystems. In a typical environment, these steps require careful planning with time-consuming analysis of user data and targeted queries. Anxious for a quick return on investment (ROI), however, management too often applies pressure to take shortcuts—which can have disastrous consequences.

With a Teradata purpose-built platform, the OS, database and disk subsystems are installed before system delivery. All that is needed is the size of the raw user data, number of concurrent users and some targeted queries. This information is inputted into a system-sizing calculator and a platform configuration is recommended. The result is a balanced parallel-processing platform that is tailored to the organization’s application requirements.

A pre-configured, purpose-built Teradata platform relieves DBAs of the responsibility of understanding and setting the myriad OS and database parameters and installation options. No longer do DBAs have to spend hours writing programs to analyze the data and determine partitioning and data placement. They are free from the burden of having to understand and be responsible for setting the various options and initialization parameters with both the database and OS.

Those runtime database control parameter settings are critical for performance tuning of transaction processing applications where queries are predetermined and must be tuned. However, a data warehouse is built on the concept that users are able to ask any question of any data at any time. There is no opportunity to know or tune the queries beforehand. This is best left to the Teradata Database to optimize and tune dynamically.

In short, once the Teradata platform is installed, the DBA can immediately define the databases, users and tables and load data. Users can then leverage the data warehouse as it is intended—to run queries that answer their business intelligence (BI) questions.

STORAGE MANAGEMENT

Click to enlarge

Because the Teradata storage subsystem is installed and balanced before delivery, management of the disk subsystem is greatly simplified. The DBA familiar with managing items such as disk groups, logical volumes, node groups, file system, files and tablespaces will find that those entities and concepts are nonexistent. (See table.)

All disk organization is entirely logical, as opposed to physical. (See figure) Initially, all space in the system is allocated to a predefined system database called DBC. Using the “CREATE” DML command, the DBA will define DATABASE and USER entities. The space parameter on the CREATE DML statement is not a physical “allocation” but is simply a size quota. If a database is allocated 5TB of space that is the maximum amount of space the database is allowed to use. Anytime that database attempts to use more than 5TB, an “out of space” message will result. However, the system is not out of space, it just exceeded its space quota.

Click to enlarge

DISK FILE SYSTEM

A database management system (DBMS) is designed for storing and retrieving data. Typical file-system architectures fragment and performance degrades over time as insert, updates and deletes are applied to the data.

Teradata broke all the rules with its file system design. Data is not stored in B-Tree indexes based on data values; rather, the file system is built on raw disk slices. There are no pages, BufferPools, tablespaces, extents, etc. to manage. Row lengths are variable and stored in blocks that can grow or shrink on demand. Maximum block size is configurable with row or block placement managed by the DBMS. Rows can dynamically be moved, space reclaimed or the file system defragmented on the fly. This is a transparent background process that runs continuously using available system resources. Because of this process, the DBA never has to do a re-organization, and performance is optimized on a continual basis and does not degrade with file updates.

USER MANAGEMENT

Defining users is easy with a Teradata system. There are two types: query users with no workspace capability; and power users who have the capability to create and manipulate tables within their own workspace based on whatever limitations the DBA placed on their space usage.

The DBA first adds the users with a CREATE USER DML command, then grants them security rights to database entities. Role-based security is supported for ease of maintenance.

INDEX MANAGEMENT

DBAs need to resist the temptation to over-index the tables. Because of the powerful parallel architecture of the Teradata platform, it is unnecessary to avoid full-table scans. Therefore, far fewer indexes are needed than in other RDBMSs. In fact, as experienced in several organizations, tables having more than 80 billion rows each can be scanned in less than five minutes.

These recommended steps will help determine the number of indexes needed in the Teradata Database:

The DBA defines a primary index (PI) and a secondary index on any column that will participate as a foreign key in join operations. The PI is for data distribution and keyed access.
Once the indexes are defined, the query workload is run and the query capture facility logs the query activity.
The Teradata Index Wizard uses this information to recommend the addition or removal of indexes based on actual query usage.

This process saves the DBA from having to manually analyze the number of indexes.

Special indexes are available for specific performance needs. The partitioned primary index (PPI), for one, can deliver dramatic results. A large transaction file that is accessed with date parameter queries is a good candidate for creating a PPI on transaction date. The Optimizer then can eliminate partitions on any date-sensitive queries with dramatic response-time reductions.

The multi-level PPI, join, aggregate and aggregate join indexes are tools that can turbo-charge certain applications.

WORKLOAD MANAGEMENT

Teradata Active System Management provides the necessary tools for comprehensive workload management; therefore, no outside tools or resources are needed. The product has three components:

Dynamic Query Manager enables the DBA to classify and govern the query before its execution in the database.

Priority Scheduler defines resource partitions where varying workloads can be controlled and monitored.

Database Query Log provides post-execution performance analysis.

Because Teradata tools administer all performance management and tuning needs, the DBA no longer has to be an expert in the OS, database and third-party tools.

SYSTEM EXPANSION

One common characteristic of successful data warehouses is growth. Often, no matter how much detail is put into planning a data warehouse, expansion needs often arise unexpectedly and in unanticipated areas. The scalable Teradata system makes growth an easy process. In fact, Teradata’s Customer Support Team performs the expansion—the DBA’s role is simply as an observer.

Expanding the Teradata system is similar to ordering the initial platform. With assistance from Teradata Professional Services, the DBA determines the amount of raw data on the existing Teradata system and the number of concurrent users, as well as a few critical queries. Then those numbers are added to the anticipated growth in each area. These values are input into a system-sizing calculator, which produces the additional platform requirements.

As with the original Teradata system, the additional platform and software will be pre-configured, delivered, installed and connected to the existing platform. The data redistribution utility is then run, which automatically rebalances the data on the system. The tool relocates any data that belongs on the new platform nodes had it been installed at the time the data was loaded. Once that data is relocated, the utility tool removes it from the old nodes. (No data movement occurs between the original nodes.) Redistributing the data requires downtime on the system, but the process normally takes less than a shift to complete.

A BETTER VALUE

The features of the Teradata system make it ideal for data warehouse applications. The balanced, purpose-built platform arrives ready to deliver the first application generally in days, instead of weeks or months.

With the automated data management features, DBAs are freed from having to micro-manage the file system and can, therefore, engage in other tasks and responsibilities. For instance, instead of the DBA constantly writing and tuning queries, the query optimizer allows the user to ask any question, anytime. The support and freedom provided by a data warehouse from Teradata empowers DBAs to concentrate on working with the user community to deliver greater business value to their organization.

WHAT TERADATA DBAS DON’T DO:

With the automatic features included in a Teradata Database, DBAs have fewer tasks and responsibilities for implementing and maintaining the system. As identified in this partial list of duties, Teradata DBAs have never been required to:

Install an operating system (OS)
Understand and set extensive OS tuning parameters
Install the Teradata Database
Understand and set extensive Teradata Database parameters
Write programs/execute utilities that determine how to divide data into partitions
Determine size and physical location of each table and index partition or simple tablespace
Code/allocate/format partitions or underlying file structures
Embed partition assignment into CREATE TABLE statements
Determine level/degrees of parallelism to be assigned to tables/partitions/databases
Assign and manage special buffer pools for parallel processing
Associate tables/queries with parallel degrees
Code/allocate/format temporary work space

Sunday, March 24, 2013

Join Indexes

The join index JOIN the two tables together and keeps the result set in the permanent space of TD.

The join index will hold the result set of two table, and at the time of JOIN, parsing engine will decide whether it is fast to build the result set from the actual BASE tables or the JOIN index.

User never directly query the JOIN index.

In the sense JOIN index is the result of joining two tables together so that parsing engine always decide to take the result set from this JOIN index instead of going and doing manual join on the base table.

Types of JOIN index-

1. Multi table Join Index- Suppose we have two BASE tables Employee and Dept, which holds the data of employee and department respectively. Now a JOIN index on these two tables will be somewhat –

Create Join Index emp_dept as

Select empno, empname, emp_dept, emp_sal, emp_mgr

From employee e inner join dept d

On e.emp_dept=d.deptno

Unique primary index (empno);

This way the JOIN index EMP_DEPT holds the result set of two BASE tables and at the time of JOIN, PE will decide whether it is faster to join actual tables or to take result set from this JOIN index. So always choose wise list of columns and tables to create JOIN index.

2. Single table JOIN Index – A Single table JOIN index duplicate a single table, but changes the primary index. Users will only query the base table and its PE who decide which result set is faster, from JOIN index or from actual BASE tables. The reason to create the single table JOIN index is so joins can be performed faster because no redistribution or duplication needs to occur.

Create Join Index emp_snap as

Select empno, empname, emp_dept

From employee

primary index (empdept);

3. Aggregate JOIN Index – An aggregate JOIN index will allow the tracking of averages SUM and COUNT on any table. This JOIN index is basically used if we need to perform any aggregate function in the data of the table.

Create Join Index AGG_TABLE

Sel

Empno, sum(emp_sal)

From emp_salary

Group by 1;

The main fundamentals of JOIN indexes are:

1. JOIN index is not a pointer to data it actually store data in PERM space.

2. Users never query them directly, its PE who decide which result set to take.

3. Updated when base tables are changed.

4. Can’t be loaded with fastload or multiload.

Thursday, March 21, 2013

Performance Issues With Data Maintenance

The very mention of changing data on disk implies that space must be managed by the AMP(s) owning the row(s) to modify. Data cannot be changed unless it is read from the disk.

For INSERT operations, a new block might be written or an existing block might be modified to contain the new data row. The choice of which to use depends on whether or not there is sufficient space on the disk to contain the original block plus the number of bytes in the new row.

If the new row causes the block to increase beyond the current number of sectors, the AMP must locate an empty slot with enough contiguous sectors to hold the larger block. Then, it can allocate this new area for the larger block.

A DELETE is going to make one or more blocks shorter. Therefore, it should never have to find a larger slot in which to write the block back to disk. However, it still has to read the existing block, remove the appropriate rows and re-write the smaller block.

The UPDATE is more unpredictable than either the DELETE or the INSERT. This is because an UPDATE might increase the size of the block like the INSERT, decrease the size like the DELETE or not change the size at all.

A larger block might occur because one of the following conditions:

· A NULL value was compressed and now must be expanded to contain a value. This is the most likely situation .

· A longer character literal is stored into a VARCHAR column.

A smaller block might occur because one of these conditions:

· A data value is changed to a NULL value with compression. This is the most likely situation .

· A smaller character literal is stored into a VARCHAR column.

A block size does not change:

· The column is a fixed length CHAR, regardless of the length of the actual character data value, the length stays at the maximum defined.

· All numeric columns are stored in their maximum number of bytes.

There are many reasons for performance gains or losses. Another consideration, which was previously mentioned, is the journal entries for the Transient Journal for recovery and rollback processing. The Transient Journal is mandatory and cannot be disabled. Without it, data integrity cannot be guaranteed.

Impact of FALLBACK on Row Modification

When using FALLBACK on tables, it negatively impacts the processing time when changing rows within a table. This is due to the fact that the same change must also be made on the AMP storing the FALLBACK copy of the row(s) involved. These changes involve additional disk I/O operations and the use of two AMPs instead of one for each row INSERT, UPDATE, or DELETE. That equates to twice as much I/O activity.

Impact of PERMANENT JOURNAL Logging on Row Modification

When using PERMANENT JOURNAL logging on tables, it will negatively impact the processing time when changing rows within a table. This is due to the fact that the UPDATE processing also inserts a copy of the row into the journal table. If BEFORE journals are used, a copy of the row as it existed before a change is placed into the log table. When AFTER images are requested, a copy of the row is inserted into the journal table that looks exactly like the changed row.

There is another issue to consider for journaling, based on SINGLE or DUAL journaling. DUAL asks for a second (mirror) copy to be inserted. It is the journals way to provide FALLBACK copies without the table being required to use FALLBACK. The caution here is that if the TABLE is FALLBACK protected, so are the journals. This will further impact the performance of the row modification.

Impact of Primary Index on Row Modification

In Teradata, all tables must have a Primary Index (PI). It is a normal and very important part of the storage and retrieval of rows for all tables. Therefore, there is no additional overhead processing involved in an INSERT or DELETE operation for Primary Indices.

However, when using an UPDATE and the data value of a PI is changed, there is more processing required than when changing the content of any other column. This is due to the fact that the original row must be read, literally deleted from the current AMP and rehashed, redistributed and inserted on another AMP based on the new data value.

Remember that Primary Keys do not allow changes, but Primary Indexes do. Since the PI may be a column that is not the Primary Key, this rule does not apply. However, be aware that it will take more processing and therefore, more time to successfully complete the operation when a PI is the column being modified.

Impact of Secondary Indices on Row Modification

In Teradata, a Secondary Index is optional. Currently, a table may have 32 secondary indices. Each index may be a combination of up 16 columns within a table. Every unique data value in a defined index has a row in the subtable and potentially one on each AMP for a NUSI (Non Unique Secondary Index). Additionally, every index has its own subtable.

When using secondary indices on tables, it may also negatively impact the processing time when changing rows within a table. This is due to the fact that when a column is part of an index and its data value is changed in the base table, the index value must also be changed in the subtable. This normally requires that a row be read, deleted and inserted into a subtable when the column is involved in a USI (Unique Secondary Index). Remember that the delete and insert are probably be on different AMP processors.

For a NUSI, the processing all takes place on the same AMP. This is referred to as AMP Local. At first glance this sounds like a good thing. However, the processing requires a read of the old NUSI, a modification, and a rewrite. Then, most likely it will be necessary to insert an index row into the subtable. However, if the NUSI already exists, Teradata needs to read the existing NUSI, append the new data value to it and re-write it back into the subtable. This is why it is important not to create a Primary Index or a Secondary Index on data that often changes.

The point of this discussion is simple. If secondary indices are used, additional processing is involved when the data value of the index is changed. This is true on an INSERT, a DELETE and an UPDATE. So, if a secondary index is defined, make sure that the SQL is using it to receive the potential access speed benefit. An EXPLAIN can provide this information. If it is not being used, drop the index.

As an added note to consider, when using composite secondary indices, the same column can be included in multiple indices. When this is the case, any data value change requires multiple subtable changes. The result is that the number of indices in which it is defined multiplies the previous AMP and subtable-processing overhead. Therefore, it becomes more important to choose columns with a low probability of change.

TD SQL

Determining the Release of Your Teradata System:

SELECT * FROM DBC.DBCINFO;

Fundamental Structured Query Language (SQL)

The access language for all modern relational database systems (RDBMS) is Structured Query Language (SQL). It has evolved over time to be the standard. The ANSI SQL group defines which commands and functionality all vendors should provide within their RDBMS.

There are three levels of compliance within the standard: Entry, Intermediate and Full. The three level definitions are based on specific commands, data types and functionalities. So, it is not that a vendor has incorporated some percentage of the commands; it is more that each command is categorized as belonging to one of the three levels. For instance, most data types are Entry level compliant. Yet, there are some that fall into the Intermediate and Full definitions.

Since the standard continues to grow with more options being added, it is difficult to stay fully ANSI compliant. Additionally, all RDBMS vendors provide extra functionality and options that are not part of the standard. These extra functions are called extensions because they extend or offer a benefit beyond those in the standard definition.

At the writing of this book, Teradata was fully ANSI Entry level compliant based on the 1992 Standards document. NCR also provides much of the Intermediate and some of the Full capabilities. This book indicates feature by feature which SQL capabilities are ANSI and which are Teradata specific, or extensions. It is to NCR’s benefit to be as compliant as possible in order to make it easier for customers of other RDBMS vendors to port their data warehouse to Teradata.

As indicated earlier, SQL is used to access, store, remove and modify data stored within a relational database, like Teradata. The SQL is actually comprised of three types of statements. They are: Data Definition Language (DDL), Data Control Language (DCL) and Data Manipulation Language (DML). The primary focus of this book is on DML and DDL. Both DDL and DCL are, for the most part, used for administering an RDBMS. Since the SELECT statement is used the vast majority of the time, we are concentrating on its functionality, variations and capabilities.

Everything in the first part of this chapter describes ANSI standard capabilities of the SELECT command. As the statements become more involved, each capability will be designated as either ANSI or a Teradata Extension.

Basic SELECT Command

Using the SELECT has been described like playing the game, Jeopardy. The answer is there; all you have to do is come up with the correct question.

The basic structure of the SELECT statement indicates which column values are desired and the tables that contain them. To aid in the learning of SQL, this book will capitalize the SQL keywords. However, when SQL is written for Teradata, the case of the statement is not important. The SQL statements can be written using all uppercase, lowercase or a combination; it does not matter to the Teradata PE.

The SELECT is used to return the data value(s) stored in the columns named within the SELECT command. The requested columns must be valid names defined in the table(s) listed in the FROM portion of the SELECT.

The following shows the format of a basic SELECT statement. In this book, the syntax uses expressions like: <column-name> (see Figure 1-1) to represent the location of one or more names required to construct a valid SQL statement:

$image\img00005.gif$

The structure of the above command places all keywords on the left in uppercase and the variable information such as column and table names to the right. Like using capital letters, this positioning is to aid in learning SQL. Lastly, although the use of SEL is acceptable in Teradata, with [ECT] in square brackets being optional, it is not ANSI standard.

Lastly, when multiple column names are requested in the SELECT, a comma must separate them. Without the separator, the optimizer cannot determine where one ends and the next begins.

The following syntax format is also acceptable:

SEL[ECT] <column-name> FROM <table-name> ;

Both of these SELECT statements produce the output report, but the above style is easier to read and debug for complex queries. The output display might appear as:

3 Rows Returned

<column-name>

aaaaaaaaaaaaaaaaaa

bbbbbbbbbbbbbbbb

cccccccccccccccccc

In the output, the column name becomes the default heading for the report. Then, the data contained in the selected column is displayed once for each row returned.

The next variation of the SELECT statement returns all of the columns defined in the table indicated in the FROM portion of the SELECT.

$image\img00150.gif$

The output of the above request uses each column name as the heading and the columns are displayed in the same sequence as they are defined in the table. Depending on the tool used to submit the request, care should be taken, because if the returned display is wider than the media (i.e. terminal=80 and paper=133); it may be truncated.

At times, it is desirable to select the same column twice. This is permitted and to accomplish it, the column name is simply listed in the SELECT column list more than once. This technique might often be used when doing aggregations or calculating a value, both are covered in later chapters.

The table below is used to demonstrate the results of various requests. It is a small table with a total of ten rows for easy comparison.

Student Table - contains 10 students

Student_ID	Last_Name	First_name	Class_code	Grade_Pt
PK			FK
UPI	NUSI		NUSI
123250 125634 234121 231222 260000 280023 322133 324652 333450 423400	Phillips Hanson Thomas Wilson Johnson McRoberts Bond Delaney Smith Larkins	Martin Henry Wendy Susie Stanley Richard Jimmy Danny Andy Michael	SR FR FR SO JR JR SR SO FR	3.00 2.88 4.00 3.80 1.90 3.95 3.35 2.00 0.00

Figure 2-1

For Example: the next SELECT might be used with Figure 2-1, to display the student number, the last name, first name, the class code and grade point for all of the students in the Student table:

SELECT *

FROM Student_Table ;

10 Rows returned

Student_ID	Last_Name	First_Name	Class_Code	Grade_Pt
423400	Larkins	Michael	FR	0.00
125634	Hanson	Henry	FR	2.88
280023	McRoberts	Richard	JR	1.90
260000	Johnson	Stanley	?	?
231222	Wilson	Susie	SO	3.80
234121	Thomas	Wendy	FR	4.00
324652	Delaney	Danny	SR	3.35
123250	Phillips	Martin	SR	3.00
322133	Bond	Jimmy	JR	3.95
333450	Smith	Andy	SO	2.00

Notice that Johnson has question marks in the grade point and class code columns. Most client software uses the question mark to represent missing data or an unknown value (NULL). More discussion on this condition will appear throughout this book. The other thing to note is that character data is aligned from left to right, the same as we read it and numeric is from right to left, from the decimal.

This SELECT returns all of the columns except the Student ID from the Student table:

$image\img00006.gif$

10 Rows returned

First_Name	Last_Name	Class_Code	Grade_Pt
Michael	Larkins	FR	0.00
Henry	Hanson	FR	2.88
Richard	McRoberts	JR	1.90
Stanley	Johnson	?	?
Susie	Wilson	SO	3.80
Wendy	Thomas	FR	4.00
Danny	Delaney	SR	3.35
Martin	Phillips	SR	3.00
Jimmy	Bond	JR	3.95
Andy	Smith	SO	2.00

There is no short cut for selecting all columns except one or two. Also, notice that the columns are displayed in the output in the same sequence they are requested in the SELECT statement.

WHERE Clause

The previous “unconstrained” SELECT statement returned every row from the table. Since the Teradata database is most often used as a data warehouse, a table might contain millions of rows. So, it is wise to request only certain types of rows for return.

By adding a WHERE clause to the SELECT, a constraint is established to potentially limit which rows are returned based on a TRUE comparison to specific criteria or set of conditions.

$image\img00007.gif$

The conditional check in the WHERE can use the ANSI comparison operators (symbols are ANSI / alphabetic is Teradata Extension):

Equal	Not Equal	Less Than	Greater Than	Less Than or Equal	Greater Than or Equal
=	<>	<	>	<=	>=
EQ	NE	LT	GT	LE	GE

Figure 2-2

The following SELECT can be used to return the students with a B (3.0) average or better from the Student table:

$image\img00151.gif$

5 Rows returned

Student_ID	Last_Name	Grade_Pt
231222	Wilson	3.80
234121	Thomas	4.00
324652	Delaney	3.35
123250	Phillips	3.00
322133	Bond	3.95

Without the WHERE clause, the AMPs return all of the rows in the table to the user. More and more Teradata user systems are getting to the point where they are storing billions of rows in a single table. There must be a very good reason for needing to see all of them. More simply put, you will always use a WHERE clause whenever you want to see only a portion of the rows in a table.

Compound Comparisons ( AND / OR )

Many times a single comparison is not sufficient to specify the desired rows. To add more functionality to the WHERE it is common to use more than one comparison. The multiple condition checks and column names are not separated by a comma, like column names. Instead, they must be connected using a logical operator.

The following is the syntax for using the AND logical operator:

$image\img00008.gif$

Notice that the column name is listed for each comparison separated by a logical operator; this will be true even when it is the same column being compared twice. The AND signifies that each individual comparison on both sides of the AND must be true. The final result of the comparison must be TRUE for a row to be returned.

This Truth Table illustrates this point using AND.

First Test Result AND	Second Test Result	Final Result
True	True	True
True	False	False
False	True	False
False	False	False

Figure 2-3

When using AND, different columns must be used because a single column can never contain more than a single data value.

Therefore, it does not make good sense to issue the next SELECT using an AND on the same column because no rows will ever be returned.

$image\img00009.gif$

No rows found

The above SELECT will never return any rows. It is impossible for a column to contain more than one value. No student has a 3.0 grade average AND a 4.0 average. They might have one or the other, but not both. It might contain one or the other, but never

both at the same time. The AND operator indicates both must be TRUE and should never be used between two comparisons on the same column.

By substituting an OR logical operator for the previous AND, rows will now be returned.

The following is the syntax for using OR:

$image\img00010.gif$

2 Rows returned

Student_ID	Last_Name	First_Name	Grade_Pt
234121	Thomas	Wendy	4.00
123250	Phillips	Martin	3.00

The OR signifies that only one of the comparisons on each side of the OR needs to be true for the entire test to result in a true and the row to be selected.

This Truth Table illustrates the results for the OR:

First Test Result OR	Second Test Result	Final Result
True	True	True
True	False	True
False	True	True
False	False	False

Figure 2-4

When using the OR, the same column or different column names may be used. In this case, it makes sense to use the same column because a row is returned when a column contains either of the specified values as opposed to both values as seen with AND.

It is perfectly legal and common practice to combine the AND with the OR in a single SELECT statement.

The next SELECT contains both an AND as well as an OR:

$image\img00016.gif$

2 Rows returned

Student_ID	Last_Name	First_Name	Class_Code	Grade_Pt
234121	Thomas	Wendy	FR	4.00
123250	Phillips	Martin	SR	3.00

At first glance, it appears that the comparison worked correctly. However, upon closer evaluation it is incorrect because Phillips is a senior and not a freshman.

When mixing AND with OR in the same WHERE clause, it is important to know that the AND is evaluated first. The previous SELECT actually returns all rows with a grade point of 3.0. Hence, Phillips was returned. The second comparison returned Thomas with a grade point of 4.0 and a class code of ‘FR’.

When it is necessary for the OR to be evaluated before the AND the use of parentheses changes the priority of evaluation. A different result is seen when doing the OR first. Here is how the statement should be written:

$image\img00012.gif$

1 Row returned

Last_Name	Class_Code	Grade_Pt
Thomas	FR	4.00

Now, only Thomas is returned and the output is correct.

Impact of NULL on Compound Comparisons

NULL is an SQL reserved word. It represents missing or unknown data in a column. Since NULL is an unknown value, a normal comparison cannot be used to determine whether it is true or false. All comparisons of any value to a NULL result in an unknown; it is neither true nor false. The only valid test for a null uses the keyword NULL without the normal comparison symbols and is explained in this chapter.

When a table is created in Teradata, the default for a column is for it to allow a NULL value to be stored. So, unless the default is over-ridden and NULL values are not allowed, it is a good idea to understand how they work.

A SHOW TABLE command (chapter 3) can be used to determine whether a NULL is allowed. If the column contains a NOT NULL constraint, you need not be concerned about the presence of a NULL because it is disallowed.

This AND Truth Table must now be used for compound tests when NULL values are allowed:

First Test Result AND	Second Test Result	Final Result
True	Unknown	Unknown
Unknown	True	Unknown
False	Unknown	False
Unknown	False	False
Unknown	Unknown	Unknown

Figure 2-5

This OR Truth Table must now be used for compound tests when NULL values are allowed:

First Test Result OR	Second Test Result	Final Result
True	Unknown	True
Unknown	True	True
False	Unknown	Unknown
Unknown	False	Unknown
Unknown	Unknown	Unknown

Figure 2-6

For most comparisons, an unknown (null) is functionally equivalent to a false because it is not a true. Therefore, when using any comparison symbol a row is not returned when it contains a NULL.

At the same time, the next SELECT does not return Johnson because all comparisons against a NULL are unknown:

$image\img00017.gif$

No rows found

V2R5: *** Failure 3731 The user must use IS NULL or IS NOT NULL to test for NULL values.

As seen in the above Truth tables, a comparison test cannot be used to find a NULL.

To find a NULL, it becomes necessary to make a slight change in the syntax of the conditional comparison. The coding necessary to find a NULL is seen in the next section.

Using NOT in SQL Comparisons

It can be fairly straightforward to request exactly which rows are needed. However, sometimes rows are needed that contain any value other than a specific value. When this is the case, it might be easier to write the SELECT to find what is not needed instead of what is needed. Then convert it to return everything else. This might be the situation when there are 100 potential values stored in the database table and 99 of them are needed. So, it is easier to eliminate the one value than it is to specifically list the desired 99 different values individually.

Either of the next two SELECT formats can be used to accomplish the elimination of the one value:

$image\img00018.gif$

This second version of the SELECT is normally used when compound conditions are required. This is because it is usually easier to code the SELECT to get what is not wanted and then to enclose the entire set of comparisons in parentheses and put one NOT in front of it. Otherwise, with a single comparison, it is easier to put NOT in front of the comparison operator without requiring the use of parentheses.

The next SELECT uses the NOT with an AND comparison to display seniors and lower classmen with grade points less than 3.0:

$image\img00019.gif$

6 Rows returned

Last_Name	First_Name	Class_Code	Grade_Pt
McRoberts	Richard	JR	1.90
Hanson	Henry	FR	2.88
Delaney	Danny	SR	3.35
Larkins	Michael	FR	0.00
Phillips	Martin	SR	3.00
Smith	Andy	SO	2.00

Without using the above technique of a single NOT, it is necessary to change every individual comparison. The following SELECT shows this approach, notice the other change necessary below, NOT AND is an OR:

Since you cannot have conditions like: NOT >= and NOT <>, they must be converted to < (not < and not =) and = (not, not =). It returns the same 5 rows, but also notice that the AND is now an OR:

$image\img00020.gif$

6 Rows returned

Last_Name	First_Name	Class_Code	Grade_Pt
McRoberts	Richard	JR	1.90
Hanson	Henry	FR	2.88
Delaney	Danny	SR	3.35
Phillips	Martin	SR	3.00
Larkins	Michael	FR	0.00
Smith	Andy	SO	2.00

Chart of individual conditions and NOT:

Condition	Opposite condition	NOT condition
<=	<	NOT >=
<>	=	NOT <>
AND	OR	OR
OR	AND	AND

Figure 2-7

To maintain the integrity of the statement, all portions of the WHERE must be changed, including AND, as well as OR. The following two SELECT statements illustrate the same concept when using an OR:

$image\img00021.gif$

1 Row returned

Last_Name

Hanson

In the earlier Truth table, the NULL value returned an unknown when checked with a comparison operator. When looking for specific conditions, an unknown was functionally equivalent to a false, but really it is an unknown.

These two Truth tables can be used together as a tool when mixing AND and OR together in the WHERE clause along with NOT.

This Truth Table helps to gauge returned rows when using NOT with AND:

First Test Result AND	Second Test Result	Result
NOT(True) = False	NOT(Unknown) = Unknown	False
NOT(Unknown) = Unknown	NOT(True) = False	False
NOT(False) = True	NOT(Unknown) = Unknown	Unknown
NOT(Unknown) = Unknown	NOT(False) = True	Unknown
NOT(Unknown) = Unknown	NOT(Unknown) = Unknown	Unknown

Figure 2-8

This Truth Table can be used to gauge returned rows when using NOT with OR:

First Test Result OR	Second Test Result	Result
NOT(True) = False	NOT(Unknown) = Unknown	Unknown
NOT(Unknown) = Unknown	NOT(True) = False	Unknown
NOT(False) = True	NOT(Unknown) = Unknown	True
NOT(Unknown) = Unknown	NOT(False) = True	True
NOT(Unknown) = Unknown	NOT(Unknown) = Unknown	Unknown

Figure 2-9

There is an issue associated with using NOT. When a NOT is done on a true condition, the result is a false. Likewise, the NOT of a false is a true. However, when a NOT is done with an unknown, the result is still an unknown. Whenever a NULL appears in the data for any of the columns being compared, the row will never be returned and the answer set will not be what is expected.

Another area where care must be taken is when allowing NULL values to be stored in one or both of the columns. As mentioned earlier, previous versions of Teradata had no concept of “unknown” and if a compare didn’t result in a true, it was false. With the emphasis on ANSI compatibility the unknown was introduced.

If NULL values are allowed and there is potential for the NULL to impact the final outcome of compound tests, additional tests are required to eliminate them. One way to eliminate this concern is to never allow a NULL value in any columns. However, this may not be appropriate and it will require more storage space because a NULL can be compressed. Therefore, when a NULL is allowed, the SQL needs to simply check for a NULL.

Therefore, using the expression IS NOT NULL is a good technique when NULL is allowed in a column and the NOT is used with a single or a compound comparison. This does require another comparison and could be written as:

$image\img00022.gif$

7 Rows returned

Last_Name	First_Name	Class_Code	Grade_Pt
Larkins	Michael	FR	0.00
Hanson	Henry	FR	2.88
McRoberts	Richard	R	1.90
Johnson	Stanley	?	?
Delaney	Danny	SR	3.35
Phillips	Martin	SR	3.00
Smith	Andy	SO	2.00

Notice that Johnson came back this time and did not appear previously because of the NULL values.

Later in this book, the COALESCE will be explored as another way to eliminate NULL values directly in the SQL instead of in the database.

Thursday, June 27, 2013

What’s a DBA to do?

SYSTEM INSTALLATION

STORAGE MANAGEMENT

DISK FILE SYSTEM

USER MANAGEMENT

INDEX MANAGEMENT

WORKLOAD MANAGEMENT

SYSTEM EXPANSION

A BETTER VALUE

WHAT TERADATA DBAS DON’T DO:

Sunday, March 24, 2013

Join Indexes

Thursday, March 21, 2013

Performance Issues With Data Maintenance

Performance Issues With Data Maintenance

Impact of FALLBACK on Row Modification

Impact of PERMANENT JOURNAL Logging on Row Modification

Impact of Primary Index on Row Modification

Impact of Secondary Indices on Row Modification

TD SQL

Determining the Release of Your Teradata System:

Fundamental Structured Query Language (SQL)

Basic SELECT Command

PK

FK

WHERE Clause

Compound Comparisons ( AND / OR )

Impact of NULL on Compound Comparisons

Using NOT in SQL Comparisons