Riddle me this.
Let's say you want to perform a simple select statement from a table in a database. And let's say that table has a primary key, meaning that it has a clustered index on some column. let's say, the ID column. And let's say you do a select of some or all of the rows in that table.
Can you say for sure that the resulting data set will have one and only one entry for each primary key in the source table?
The answer is: it depends. :)
Most people would think, and I can understand why, that if you have a clustered index on a table, and you do something simple like SELECT * FROM myTable, that you'll get one result for each row in that table. But that's not always the case.
Imagine that the table you're selecting from has an index, and that index is on the column "Lastname". And imagine that your select statement is something like: SELECT ID, Lastname FROM myTable. It's fairly likely that your query will take advantage of the index since it "covers" all the data in your query.
So what's the issue? Well, let's say when you start your query, the index data looks like this:
So now your query starts reading the rows out of the index. You've read "Allen" and "Downey" when all of a sudden, somebody changes the last name "Allen" to "Kelly". Since "Kelly" comes after "J" and before "R", the new index data looks like:
But wait, what happens? You just finished reading through Downey, but now Allen (who I guess got married to Mr. Kelly) is going to be read again. In other words, the output from your select statement will look like:
We have a duplicate primary key value in our output, and 1 more record than currently exists in the original table! What gives?
Well, it's just the way things work. When the database engine starts reading data, it reads it in the order of the underlying data source. If that happens to be a source whose order can change, all bets are off.
If you don't want this to happen, you need to use a more strict isolation level for your query. In SQL Server, the minimum isolation level required to prevent this stuff is Repeatable Read. If you don't want to change isolation levels, you can essentially accomplish the same thing by using a TABLOCK hint on your select, although SQL Server might decide to ignore this hint.
So why don't we see these issues all the time? Well, part of it is a timing issue. You need to be performing a query on the exact set of data that happens to include a row that's being changed, and your query must be against a data set that is ordered on one or more columns with changing values. (That's why I used an index in my previous example, since it's very rare for a primary key on a table to change.)
Duplicate rows aren't the only issue. Using our previous example, what happens when Jacobs is renamed to "Clark"? That row would be moved to before Downey, and if I had already read past Downey, I would miss it. In other words, your results would be entirely missing the record with the ID of 4.
So watch out people!
Oh, and for the love of baby jesus, don't use the NOLOCK hint if you care at ALL about the consistency of the resulting data. Seriously.