Cookie Notice

As far as I know, and as far as I remember, nothing in this page does anything with Cookies.

2012/07/09

Petdance says if you have SELECT * in your code, you have a bug

Or, he retweeted it. I think. He knows where to yell at me if I got that wrong.

I don't really agree, except maybe the memory hit when you get tons more data than you expected or need, but the case I'm talking about here is a structured database dump. I'm proud of this query, and I'll expand on it.

        SELECT *
        FROM  sequence_sample
        LEFT JOIN accessions
            ON  sequence_sample.accession_id = accessions.accession_id
        LEFT JOIN requests
            ON  accessions.request_id        = requests.request_id 
        WHERE run_id = ?

A sample is an accession, except that a sample is an accession associated with a run, and that accession can be associated with many runs but only one request. I could see adding several joins to bring in controlled vocabularies, where, for example, the library types come in.

I've worked with SQL for a while, but only now am I integrating the joy of joins. I can thank Coding Horror for his excellent explanation on what the different joins mean, and Learning SQL for the use of ON to specify  the JOIN joins to distinguish it from WHERE which brings in the unique constraints

6 comments:

  1. My point about "SELECT *" being a bug is more that it's a bug waiting to happen. It's sloppy. I'm all about being specific in what you're fetching from the database. "SELECT *" feels like leaving off "use strict".

    For instance, if I expect that I'm always going to have column Foo returned to me in my SELECT *, but column Foo gets dropped from the table, the SQL will still execute, and my client program will probably not notice and just think that $row['Foo'] is always returning NULL.

    Here's another example: You have a JOIN with two tables and it's just a SELECT *. Table A has Foo, and B has Bar and Bat. So you get back A.Foo, B.Bar and B.Bat as columns Foo, Bar and Bat. But later on, someone adds a Foo to table B. Now your SELECT * returns Foo, Bar, Bat and Foo. And A.Foo might be different from B.Foo. Again, no warnings. However, if you'd specified "SELECT foo, bar, bat FROM A JOIN B", then you'd get a warning about ambiguous column Foo. Or if you'd specified "SELECT A.foo, b.bar, b.bat from A JOIN B", then you can add whatever you want to B and your SELECT won't care.

    There are speed issues, of course. It is never slower to specify columns. It can ONLY be faster. See http://stackoverflow.com/questions/65512/which-is-faster-best-select-or-select-column1-colum2-column3-etc for much discussion.

    But for me, my main point is that it's all about making your intentions explicit, now and for the future.

    ReplyDelete
  2. I just turned all that into a blog post on petdance.com. Thanks for the kick in the butt. :-)

    ReplyDelete
  3. Glad to help, Andy. I mostly mentioned you because of my quest for an interesting title....

    For me, I'm much more worried about columns getting added, and having a firehose query in my db access function means I won't have to change it when I change the table.

    And, an increasingly large amount of my code is taking the hashref from the DB, serializing it to JSON and handling it to my Javascript, and many JS libraries might pull from the same source. (Not as many as should. I have reinvented wheels unnecessarily and repent.)

    My joy is the discovery of the mojo behind joins, which is preferable to my previous method of making an array of hashrefs by repeated queries, and I clearly could have selected accession.accession_id accession_id , accession.request_id request_id and so on without hindering the point of the post.

    ReplyDelete
  4. Here's my blog post on the topic.

    Take a look at table aliases, Dave. Specifying "a.request_id" isn't nearly as odious as "accession.request_id".

    ReplyDelete
  5. Also consider the situation where you get a ton of data from outside that you have to process periodically. Every once in a while, the person generating the the data sets decides that some variables, even though they the exact same definition as before, should be renamed, say from ClientNo to CLIENT_ID. This change is not documented anywhere or propagated to anyone by any other means. With select *, you'd notice this much later than if you had specified column names.

    ReplyDelete
  6. Andy: I know labels. I use labels. It struck me as going into the weeds in context of this conversation.

    A.: In this work, I handle the application from the SQL to the CSS, so changes in the data structures are all me, so nobody's going to change variable names without me knowing. A generally valid concern, though.

    ReplyDelete