sunnuntai 23. elokuuta 2015

Let’s SELECT errors...

Data problems are often hard to notice when operating with SQL. With other languages like C, C#, Java etc. information is typically investigated and modified item-by-item. In SQL all the operations are set based, meaning that a single statement often deals with several rows. For example a simple SELECT statement investigates all the rows in involved tables, one way or another, and returns the desired results, hopefully...

Syntactical problems are easy to catch, you get an error. But what if your conditions are false, you don’t get an error but the results aren’t what were expected. When creating a program or modifying it, part of the testing is to ensure that the returned results are correct so this way we can deal with the initial situation. However, things get more complicated when the world around changes. Let’s take an example.

The following script creates an OrderItem table. It’s not a gem what comes to database design but sufficient to pinpoint the problem we’re discussing of:

CREATE TABLE OrderItem (
   OrderNumber int,
   Product     varchar(100),
   RowType     int CONSTRAINT ck_orderitem_rowtype 
                   CHECK (RowType IN (1,2)),   
                   -- 1=Included, 2=Excluded
   Amount      int,
   ListPrice   decimal
);

Product is the name of the item to be sold, Amount defines the quantity of items to be purchased and ListPrice is the price of a single item. RowType defines if this item is included in the order (value is one) or was it selected but then removed from the order (value of two).

Now let’s add some data

INSERT INTO OrderItem 
(OrderNumber, Product, RowType, Amount, ListPrice)
VALUES (1, 'Squared tyre', 1, 4, 100.0);

INSERT INTO OrderItem 
(OrderNumber, Product, RowType, Amount, ListPrice)
VALUES (1, 'Oil scent, 100 oz', 2, 1, 50.0);

INSERT INTO OrderItem 
(OrderNumber, Product, RowType, Amount, ListPrice)
VALUES (1, 'Haircut', 1, 1, 20.0);

Now let’s imagine that the application fetches a list of items in the order. In order to show the prices correctly, the SELECT statement needs to take the RowType into account, something like the following:

SELECT CASE oi.RowType 
          WHEN 1 THEN 'Included into order'
          WHEN 2 THEN 'Excluded from order'
       END As Status,
       oi.Product, 
       oi.Amount,
       CASE oi.RowType 
          WHEN 1 THEN oi.Amount * oi.ListPrice 
                      -- Calculate the price
          WHEN 2 THEN 0  -- Excluded so no price
       END As Price
FROM OrderItem oi
WHERE oi.OrderNumber = 1
ORDER BY oi.RowType,
         oi.Product;

The results would be

Status                Product            Amount   Price
-------------------   -----------------  ------   -----
Included into order   Haircut            1        20
Included into order   Squared tyre       4        400
Excluded from order   Oil scent, 100 oz  1        0

Also the application could fetch the items to be delivered, with the following query:

SELECT oi.Product, 
       oi.Amount,
       oi.Amount * oi.ListPrice  AS Price
FROM  OrderItem oi
WHERE oi.OrderNumber = 1
AND   oi.RowType = 1
ORDER BY oi.RowType,
         oi.Product;

And the results would be

Product        Amount   Price
------------   ------   -----
Haircut        1        20
Squared tyre   4        400

So everything is perfect, until…
One day an extra row type is defined and added into the database. It was decided that customers buying lot of items were rewarded with a gift included in the purchase, naturally free of charge. The gift is a bunch of flowers which is also sold separately. Because of this a new RowType 3 was introduced to distinguish free gift items from other items.
The following change was made to the database

ALTER TABLE OrderItem 
DROP CONSTRAINT ck_orderitem_rowtype;

ALTER TABLE OrderItem 
ADD CONSTRAINT ck_orderitem_rowtype 
CHECK (RowType IN (1,2,3));   
-- 1=Included, 2=Excluded, 3=Gift

The order 1 we used as an example had such a big value that it would get a gift so let’s add it

INSERT INTO OrderItem 
(OrderNumber, Product, RowType, Amount, ListPrice)
VALUES (1, 'Bunch of flowers', 3, 1, 5.0);

Now let’s see what happens with our original queries. The first one would return

Status                Product             Amount   Price
-------------------   -----------------   ------   -----
Included into order   Haircut             1        20
Included into order   Squared tyre        4        400
Excluded from order   Oil scent, 100 oz   1        0
NULL                  Bunch of flowers    1        NULL

And the second one

Product        Amount   Price
------------   ------   -----
Haircut        1        20
Squared tyre   4        400

The first one is more obvious you get NULL values in the result set in places where they are not expected so if someone looks carefully this may be noticed.

The second one is more dangerous, since the WHERE clause restricted the rows only to known RowTypes the gift isn’t included in the returned set of rows. Just by looking at the data nothing looks suspicious so you need to know the data behind in order to spot the problem.

If you think about other languages they have different kinds of mechanisms to handle known and unknown values. For example a switch structure has a default block that is fired for all values not listed in case statements. This same idea can be applied to some extent to queries. However, in order to ensure that false data doesn’t pass through we won’t do any default actions but deliberately generate errors.

Consider the following statement. It has a small modification in the CASE structure. RowTypes one and two are handled correctly but if an unknown RowType is spotted the ELSE branch is executed and a division by zero error is thrown.

SELECT CASE oi.RowType 
          WHEN 1 THEN 'Included into order'
          WHEN 2 THEN 'Excluded from order'
       END As Status,
       oi.Product, 
       oi.Amount,
       CASE oi.RowType 
          WHEN 1 THEN oi.Amount * oi.ListPrice
          WHEN 2 THEN 0
          ELSE  0/0
       END As Price
FROM OrderItem oi
WHERE oi.OrderNumber = 1
ORDER BY oi.RowType,
         oi.Product;

When the statement is run you get a message like following

Msg 8134, Level 16, State 1, Line 86
Divide by zero error encountered.

That would prevent a situation where an unknown RowType could cause false data to be returned. But what about the second one, it didn’t have any CASE structure in the SELECT. And even if a similar CASE would be added  to the statement it wouldn’t work since the WHERE clause takes care that no other values but 1 is returned.

Well, why not add the CASE into the WHERE clause:

SELECT oi.Product, 
       oi.Amount,
       oi.Amount * oi.ListPrice  AS Price
FROM OrderItem oi
WHERE oi.OrderNumber = 1
AND   1 = CASE oi.RowType 
             WHEN 1 THEN 1
             WHEN 2 THEN 0
             ELSE  0/0
          END
ORDER BY oi.RowType,
         oi.Product;

When the statement above is run, again, a division by zero error is returned.

So after adding the new RowType and fixing the statements the final versions could look something like these:

SELECT CASE oi.RowType 
          WHEN 1 THEN 'Included into order'
          WHEN 2 THEN 'Excluded from order'
          WHEN 3 THEN 'Gift'
       END As Status,
       oi.Product, 
       oi.Amount,
       CASE oi.RowType 
          WHEN 1 THEN oi.Amount * oi.ListPrice
          WHEN 2 THEN 0
          WHEN 3 THEN 0
          ELSE  0/0
       END As Price
FROM OrderItem oi
WHERE oi.OrderNumber = 1
ORDER BY oi.RowType,
         oi.Product;

And

SELECT oi.Product, 
       oi.Amount,
       CASE oi.RowType
          WHEN 1 THEN oi.Amount * oi.ListPrice  
          WHEN 3 THEN 0
          ELSE 0/0
       END AS Price
FROM OrderItem oi
WHERE oi.OrderNumber = 1
AND   1 = CASE oi.RowType 
             WHEN 1 THEN 1
             WHEN 2 THEN 0
             WHEN 3 THEN 1
             ELSE  0/0
          END
ORDER BY oi.RowType,
         oi.Product;

Next time a new RowType is added both of the statements will react by causing and error and will hopefully get caught already in testing.


So is this the way to always write SELECT statements? No, it isn’t. The modification in the SELECT clause is quite safe and can be used more widely but the second one, a CASE in the WHERE clause is something to be careful with. Complex structures in conditions may lead into a situation where perfectly valid indexes aren’t used because of an expression. So it’s important to always check how the statement is behaving by investigating the execution plan.

Also it’s not feasible to use this kind of structure in every place. But in critical statements it may be justified to add deliberate errors in case of unknown values.