2017-09-12

Database crackdown

It was a bright sunny day, or maybe not. Then, suddenly, a system started to misbehave.

After a while we realized it was because a select was stuck giving the same results over and over and over again — each time it ran (every few minutes), but indeed we expected it to return different rows, since in fact the data in the table were changed in the meantime…

What the heck was the database doing? (Talking about Oracle 11g release 2).

Select, update, select, update…

There's a process (A) — and no more — that every few minutes opens a cursor and starts fetching rows according to a select that looks like this (by heart and stripping away “internals”):

select
col1, col2
from
a_table
where
taken = 'N' and
active = 'Y' and
order by
datefield1 desc nulls last, datefield2;

Each fetched col1 and col2 are then elaborated (other queries involved, but not on a_table). At the end, before fetching the next row, the taken column gets updated, altogether with the datefield1, according to a certain value v; like this:

update a_table
set
taken = 'Y',
datefield1 = decode(:v, 0, datefield1, sysdate)
where
col1 = :col1;

(The column col1 is a key, so it's unique).

When there are no other rows, the transaction is committed.

At the end of that process, how many rows do you expect to have taken equal to N? Of course, you expect that count(·) to be 0.

Now, let us suppose the process (A) runs again and no changes are done in the table in the meantime. Do you think you are able to forecast what the process will do?

I would answer yes, I do think I can forecast no rows will be elaborated, because every row has taken equal to Y.

However… Usually the table a_table changes between different runs of the process (A).

There's another process, (B), that according to specific external events updates the column taken for a specific value of col1 (it depends on an input). The update is similar to the one seen above, except it updates taken with N.

What do you think the process (A) will do when it runs again?

I would say that (A) will see all the committed changes to the table a_table and it will elaborate the rows matching the where conditions, viz. (A) will process all the new rows with taken = 'N'.

Indeed, you don't simply think it is just what you believe it could happen: you think it is what must happen and you rely on it. I mean, the system is built around this certainty, and it isn't just this system, but (almost?) every system using a database.

Result cache

It happens it exists this Oracle feature, the result cache. It's a cache that caches a result set of a select. When the DB sees that you are running a query that it's already executed, it checks if its results are in a cache. If it finds the query in the cache, the DB can skip slow I/O operations (the cache is in memory) and return the cached rows.

Now, even if you are not a DB professional, you surely wonder how it happens that the same query can return different rows, if there's a cache hit (i.e. if a query is already cached). It's obvious that there must exist a condition that, when it holds, makes the cache invalid (all or part of it, but here I assume all the rows become invalid).

It means that the DB knows the results in the cache aren't valid anymore: it “drops” them and works the usual way, fetching data from datafiles or wherever. Likely it stores the new results in the cache for later use.

When do you expect the cache to become invalid?

Surely the rows fetched by the select depend on the values of the fields the “where conditions” check. (Updates to the selected fields could be done on the cached results and written back on the storage under specific conditions — but let assume that every change in those data would invalidate the cache instead: it's easier.)

This is a starting point and we don't need to go further: if the DB sees a change in those data, it must mark the result cache for that query as invalid. This way it can guarantee the basic assumption we are relying on: if data are changed (atomic updates or updates in a committed transaction), next time the process (A) runs the query, it must see the current data on the table and not an old snapshot.

How can the DB notice those changes? I imagine it tracks the dependencies of the query; the easiest dependency to track is the table itself, so that the easiest thing to do would be to invalidate the cache of every query referencing a_table whenever a change occurs on it (updates, deletions, insertions).

This sounds excessive. In fact, what if there exists no queries using col5 of a_table and someone updates that column? Of course, in this case, there's no reason to make the specific result cache invalid.

So you can imagine the Oracle's brain works hard to minimize pointless cache invalidations, since invalidating a cache (and hence forcing to re-read data with expensive I/O), when the cache could be indeed still valid, would degrade performance — and you want a cache to make performances better.

In fact, performance (speed) is the reason why you can enable this feature: whoever tunes the DB can decide to enable (serverside) this feature. There are different modes you can enable it with. Among these modes, there's the “try to cache every query you can” mode — also known as AUTO or FORCE. Otherwise (and it's the suggested setting), you use the MANUAL mode. This means that there are (as far as I can understand), two ways a query can use the result cache: adding a hint (per query) to say “use the result cache”; adding annotations on tables, to say that it's worth using the result cache for queries using that table.

There are quirks as you can imagine, but a fact must hold true: you won't give the same result set over and over again when rows on a_table were changed (in any committed transaction).

It'd break the basic of database properties — here we're beyond defeating the meaning of any of the letters in ACID1 — and it'd make the database unreliable.

There can be only a senseful explanation: a bug (which isn't excluded when you activate experimental features or you use officially not documented values for a specific configuration.)

Notes

This isn't a fantasy tale: it happened for real and the cheap explanation of the savvies wasn't “clearly it's a bug, let's see how we can fix it or let's conceive a workaround in your application”. Instead it sounded more like this: your application is doing nasty queries… you must change them2.

1. What happened here isn't just obtaining a result set which is one “time quantum” old with respect to a “changeset” happened in a now-committed transaction the result of which hasn't be seen in the moment the process (A) has run its select. Here we've a situation where the “changeset” is never seen and the select keeps giving the same result set again and again hours later the last changes were committed.

2. That is close to “you have a bug in your application (which uses our “perfect” DB), that's the reason why you obtain the same result set over and over again.”