/var/log/rant: This is only kinda like thinking in algorithms

2015/09/24

This is only kinda like thinking in algorithms

I have a program. It has two parts: get the data and use the data.

"Get the data" involves several queries to the database to gather the data, then I munge it into the form I need. Specifically, it's about people who generate samples of DNA data (called "primary investigator" or PI for those not involved in research), a little about the samples themselves, and those that the data are shared with.

"Use the data" involves seeing how closely the reality of the file system ACLs is aligned with the ideal as expressed by the database.

I expected that I'd spend 5% of my time, at worst, in "get the data" and 95% of my time in "use the data". So much so, I found a way to parallelize that part so I could do it n projects at a time.

In reality, it's running 50-50.

It might have something to do with the lag I've added, trying to throw in debugging code. That might've made it worse.

It might have something to do with database access. For this, I think we take a step back.

We have several database tables, and while each one rarely changes, they might. So, instead of having queries all over the place, we write dump_that_table() or the like. That way, instead of digging all over the code base for SELECT * FROM that_table (which, in and of itself, is a bug waiting to happen) (also), you go to one function and get it from one place.

So, I have get_all_pi_ids() and get_pi(), which could not be pulled into a single function until I rewrote the DB handling code, which now allows me to make [ 1 : { id:i, name:'Aaron A. Aaronson", ... }, ... ] to put it in JSON terms. Right now, though, this means I make 1 + 475 database calls to get that list.

Then I get all that PI's share info. This is done in two forms: when a PI shares a project and when a PI shares everything. I start with get_own_projects() and get_other_pi_projects(), which get both cases (a project is owned by PI and a project is shared with PI). That makes it 1 + ( 3 * 475) database calls.

I think I'll stop now, because the amount of shame I feel is still (barely) surmountable, and I'm now trying to look at the solutions.

A solution is to start with the projects themselves. Many projects are on an old system and we cannot do this mess with, and there's a nice boolean where we can say AND project.is_old_system = 0 and just ignore them. Each project has an owner, and so, if we add PI to the query, we lose having to get it special. Come to think of it, if we make each PI share with herself, we say goodbye to special cases altogether.

I'm suspecting that we cannot meaningfully handle both the "share all" and the "share one" parts in one query. I'm beginning to want to add joins to MongoDB or something, which might just be possible, but my data is in MySQL. Anyway, if we get this down to 2 queries instead of the nearly 1500, that should fix a lot of the issues with DB access lag.

As, of course, will making sure the script keeps DB handles alive, which I think I did with my first interface but removed due to a forgotten bug.

So, the first step in fixing this mess is to make better "get this" interfaces, which will allow me to get it all with as few steps as possible.

(As an aside, I'll say I wish Blogger had a "code" button along with the "B" "I" and "U" buttons.)

/var/log/rant

Cookie Notice

2015/09/24

This is only kinda like thinking in algorithms

No comments:

Post a Comment