Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
208 views
in Technique[技术] by (71.8m points)

sql server - Performance of different approaches to time-based data

I'm asking this in the context of PerformanceDBA's claim in this answer to another question that this query:

SELECT  ProductId,
        Description
    FROM  Product       p,
          ProductStatus ps
    WHERE p.ProductId = ps.ProductId  -- Join
    AND   StatusCode  = 2             -- Request
    AND   DateTime    = (             -- Current Status on the left ...
        SELECT MAX(DateTime)          -- Current Status row for outer Product
            FROM  ProductStatus ps_inner
            WHERE p.ProductId = ps_inner.ProductId
            )

using a ProductStatus table that holds only an effective (start) date for a status that changes over time, will outperform this query:

SELECT  ProductId,
        Description
    FROM  Product       p,
          ProductStatus ps
    WHERE p.ProductId = ps.ProductId  -- Join
    AND   StatusCode  = 2             -- Request
    AND   getdate() BETWEEN DateFrom AND Dateto

using a ProductStatus table that holds both a start and an end date for the status.

While I accept the other claims made for the first approach being better than the second, I would however expect the second approach to be faster (based on my experience with Oracle only) because it merely filters the data rather than performing an additional subquery and comparing with it.

I'd like to know how Sybase or SQL Server would process these queries, and what the relative performance is in some simple tests.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

On the one hand, it is good that you have opened a new question. But on the hand, by extracting one query and asking if it performs faster, loses the context of the previous question, the new question is too isolated. As I am sure you know, administering a database, managing resources (memory/cache, disk, CPU cycles), managing code (good or poor) that uses those resources, are all part of the whole picture. Performance is a trading game, nothing is free.

  1. The foremost issue I had, was the duplication of the EndDate column, which is easily derived. Duplicated columns equals Update Anomalies. Smirkingman has provided the classic example: some queries will get one result and other queries will get the other. That is simply not acceptable is large organisations; or in banks (at least in developed countries) where the data is audited and protected. You've broken a basic Normalisation rule, and there are penalties to be paid.

    • Update Anomailes; two versions (already detailed). Auditors may not pass the system.

    • Table Size
      In any large table it is a problem, and especially in time series or temporal data, where the number of columns are small, and the number of rows is huge. So what, some will say, disk space is cheap. Yeah, so are STDs. What matters is what it is used for, and how well one takes care of it.

      • Disk space
        May be cheap on a PC, but in a production server it is not. Basically you have added 62% to the row size (13 plus 8 equals 21) and therefore the table size. At the bank I am currently assigned, each department that owns the data is charged as follows, SAN-based storage is all there is. Figures are for per GB per Month (this is not a high end Aussie bank):

        $1.05 for RAID5 Unmirrored
        (we know it is slow, but it is cheap, just do not put important info on it, cause if it breaks, after the new disk is hot or cold-swapped in, it takes days for it to re-synch itself.)

        $2.10 for RAID5 Mirrored
        In the SAN, that is.

        $4.40 for RAID1+0
        Minimum for Production data, transaction logs backed up, and nightly database dumps.

        $9.80 for RAID1+0 Replicated
        To an identical SAN Layout at another, bomb proof, site. Production cut-over in minutes; almost zero transaction loss.

      • Memory/Cache
        Ok, Oracle does not have it but the serious banking dbs do have caches, and they are managed. Given any specific cache size, only 62% of the rows will fit into the same cache size.

      • Logical & Physical I/O
        Which means 50% more I/O to read the table; both streaming into cache and disk reads.

  2. Therefore, whether the query performs better or worse in isolation, is an academic issue. In the context of the above, the table is slow, and performing 62% worse, all the time, on every access. And it is affecting every other user on the server. Most DBAs will not care (I certainly wouldn't) if the subquery form performs at half the speed, because their bonus is tied to audit acceptance, not just code performance.

    • Besides, there is the added benefit of never having to revisit code, and fix up transactions due to Update Anomalies.

    • And the transactions have less points to update, so they are smaller; less blocking locks, etc.

  3. Agreed, that discussion in the Comments are difficult. In my Answer, I have detailed and explained two subqueries. There was a misunderstanding: you were talking about this subquery (in the WHERE clause, a table subquery) and I was talking about the other subquery (in the column list, a scalar subquery) when I said it performs as fast or faster. Now that that has been cleared up, I cannot say that the first query above (subquery in the WHERE clause, a table) will perform as fast as the second query (with the duplicated column); the first has to perform 3 scans, where the second only performs 2 scans. (I dare say the second will table scan though.)

    The point is, in addition to the isolation issue, it is not a fair comparison, I made the comment about scalar subqueries. I would not suggest than a 3-scan query is as fast or faster than a 2-scan query.

    The statement I made about the 3-scan table subquery (which I quote here) needs to be taken in the full context (either that post in toto, or the above). I am not backing away from it.

    This is ordinary SQL, a subquery, using the power of the SQL engine, Relational set processing. It is the one correct method, there is nothing faster, and any other method would be slower. Any report tool will produce this code with a few clicks, no typing.

    I spend half my life removing Illegal alternatives such as duplicated columns, which are predicated on the issue of performance, with the creators chanting the mantra the the table is slow, so they have "denormalised for performance". The result, predictable before I start, is a table half the size, which performs twice as fast overall. The Times Series is the most common question here (the link links to another question; which links to another), but imagine the problem in a banking database: daily OpeningExposure and ClosingExposureper Securityper HoldingperUnitTrustperPortfolio.

  4. But let me answer a question that has not been asked. This sort of interaction is normal, not uncommon when working with in-house development teams; it comes up at least once a month. A crash hot developer has already written and tested his code, using a table with a duplicated column, it flies, and now it is stalled because I won't put it in the db.

    No, I will test it within the context of the whole system and:

    • half the time, the table goes in without the EndDate column because there is no big deal about a half second query now performing in one second.

    • The other half of the time, the [table subquery] performance is not acceptable, so I implement a boolean (bit) indicator to identify IsCurrent. That is much better than a duplicated column, and provides 2-scan speeds.

    • Not in a million years will you get me duplicating a column; adding 62% to the table size; slowing the table down in the full multi-user context by 62%; and risk failing an Audit. And I am not an employee, I do not get a bonus.

    Now that would be worth testing: query with a duplicated column vs query with a IsCurrent indicator, in the full context of overall resource use.

  5. Smirkingman has brought up a good point. And I will restate it clearly, so that it does not get fragmented and then one or the other fragment gets attacked. Please do not break this up:

    A Relational Database,
    Normalised by an experienced Relational modeller, to true Fifth Normal Form

    (no Update Anomalies; no duplicated columns),
    with full Relational Compliance
    (IDEF1X, particularly relating to minimisation of Id Primary Keys; and thus not crippling the power of the Relational engine)
    will result in more, smaller tables, a smaller database,
    with fewer Indices,
    requiring fewer joins

    (that's right, more tables but fewer joins),
    and it will out-perform anything that breaks any of those rules
    on the same hardware, and enterprise db platform

    (excludes freeware, MS, Oracle; but don't let that stop you),
    in the full context of Production OLTP use
    by at least one order of magnitude,
    and it will be much easier to use
    and to change

    (never need "refactoring").

    I have done this at least 80 times. Two orders of magnitude is not uncommon, if I do it myself, rather than providing the framework for someone else to do it.

Neither I, not the people I work with or who pay me, care what one query will do in isolation.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...