Cookie Notice

As far as I know, and as far as I remember, nothing in this page does anything with Cookies.



I've been working on a tool. I discussed a lot of it yesterday.

I had a model to get the information based on PI, and I wanted to get to what I considered the interesting bit, so it was only after the performance went from suck to SUCK that I dove back, which is what I did yesterday. Starting with the project instead of the PI made the whole thing easier. But now I'm stuck.

The only differences between these queries are on lines 14, 19 and 20. Which gets to the problem. I know that I don't need half of what I get in lines 1-11, but when I pull stuff out, I now have two places to pull it.

I have a great 90-line majesty of a query that includes six left joins, and I append different endings depending on whether I want to get everything, or a segment defined by A or B or something. I could probably tighten it up so I have SELECT FROM, the different blocks, then AND AND ORDER BY. But there we're adding complexity, and we're butting Don't Repeat Yourself (DRY) against Keep It Simple, Stupid (KISS).

I'm happy to keep it as-is, as a pair of multi-line variables inside my program. I think I'd rather have the two like this than gin up a way to give me both, so KISS over DRY, in part because I cannot imagine a third way I'd want to access this data, so we hit You Ain't Gonna Need It (YAGNI).

But if there's strong reasons why I should make the change, rather than package it up and put it in cron, feel free to tell me.


This is only kinda like thinking in algorithms

I have a program. It has two parts: get the data and use the data.

"Get the data" involves several queries to the database to gather the data, then I munge it into the form I need. Specifically, it's about people who generate samples of DNA data (called "primary investigator" or PI for those not involved in research), a little about the samples themselves, and those that the data are shared with.

"Use the data" involves seeing how closely the reality of the file system ACLs is aligned with the ideal as expressed by the database.

I expected that I'd spend 5% of my time, at worst, in "get the data" and 95% of my time in "use the data". So much so, I found a way to parallelize that part so I could do it n projects at a time.

In reality, it's running 50-50.

It might have something to do with the lag I've added, trying to throw in debugging code. That might've made it worse.

It might have something to do with database access. For this, I think we take a step back.

We have several database tables, and while each one rarely changes, they might. So, instead of having queries all over the place, we write dump_that_table() or the like. That way, instead of digging all over the code base for SELECT * FROM that_table (which, in and of itself, is a bug waiting to happen) (also), you go to one function and get it from one place.

So, I have get_all_pi_ids() and get_pi(), which could not be pulled into a single function until I rewrote the DB handling code, which now allows me to make [ 1 : { id:i, name:'Aaron A. Aaronson", ... }, ... ] to put it in JSON terms. Right now, though, this means I make 1 + 475 database calls to get that list.

Then I get all that PI's share info. This is done in two forms: when a PI shares a project and when a PI shares everything. I start with get_own_projects() and get_other_pi_projects(), which get both cases (a project is owned by PI and a project is shared with PI). That makes it 1 + ( 3 * 475) database calls.

I think I'll stop now, because the amount of shame I feel is still (barely) surmountable, and I'm now trying to look at the solutions.

A solution is to start with the projects themselves. Many projects are on an old system and we cannot do this mess with, and there's a nice boolean where we can say AND project.is_old_system = 0 and just ignore them. Each project has an owner, and so, if we add PI to the query, we lose having to get it special. Come to think of it, if we make each PI share with herself, we say goodbye to special cases altogether.

I'm suspecting that we cannot meaningfully handle both the "share all" and the "share one" parts in one query. I'm beginning to want to add joins to MongoDB or something, which might just be possible, but my data is in MySQL. Anyway, if we get this down to 2 queries instead of the nearly 1500, that should fix a lot of the issues with DB access lag.

As, of course, will making sure the script keeps DB handles alive, which I think I did with my first interface but removed due to a forgotten bug.

So, the first step in fixing this mess is to make better "get this" interfaces, which will allow me to get it all with as few steps as possible.

(As an aside, I'll say I wish Blogger had a "code" button along with the "B" "I" and "U" buttons.)


Not Done, But Done For Now

I spent some more time on it, and I figured something out.

I looked at the data, and instead of getting 1 2 3 4 NULL NULL 5 6 7, I was getting 1 2 3 4 NULL NULL 7 1 2, starting at the beginning again. So, I figured out how to do loops and made a series of vectors, containing the dates in one, and load averages per VM.

Lev suggested that this is not how a real R person would do it. True. But this works, and I know how to plot vectors but not data tables. So, a few more changes (having the date in the title is good) and I can finish it up and put it into my workflow. Yay me.


Logging, Plotting and Shoshin: A Developer's Journey

I heard about Log::Log4perl and decided that this would be a good thing to learn and to integrate into the lab's workflow.

We were having problems with our VMs and it was suggested I start logging performance metrics, so when we go to our support people, we have something more than "this sucks" and "everything's slow".

So, I had a reason and I had an excuse, so I started logging. But logs without graphs are boring. I mean, some logs can tell you "here, right here, is the line number where you are an idiot", but this log is just performance metrics, so if you don't get a graph, you're not seeing it.

That tells a story. That tells us that there was something goofy going on with genomics-test (worse, I can tell you, because we had nothing going on with genomics-test, because the software we want to test is not working yet. There was a kernel bug and a few other things that had fixed the other VMs, but not that one, so our admin took it down and started from scratch.

Look at that graph. Do you see the downtime? No?

That's the problem. This shows the last 100 records for each VM, but for hours where is no record, there should be -1 or a discontinuity or something.

I generally use R as a plotting library, because all the preformatting is something I know how to do in Perl, my language of choice, but I've been trying to do more and more in R, and I'm hitting the limits of my knowledge. My code, including commented-out bad trails, follows.

My thought was, hey, we have all the dates in the column final$datetime, and I can make a unique array of them. My next step would be to go through each entry for dates, and, if there was no genomicstest$datetime that equalled that date, I would throw in a null or a -1 or something. That's what the ifelse() stuff is all about.

But, I found, this removes the association between the datetime and the load average, and the plots I was getting were not the above with gaps, as it should be, but one where I'm still getting high loads today.

Clearly, I am looking at R as an experienced Perl programmer. You can write FORTRAN in any language, they say, but you cannot write Perl in R. The disjunct between a Perl coder codes Perl and an R coder codes R is significant. As a Perl person, I want to create a system that's repeatable and packaged, to get the data I know is there and put it into the form I want it to be. The lineage of Perl is from shell tools like sed and awk, but it has aspirations toward being a systems programming language.

R users are about the opposite, I think. R is usable as a scripting language but the general case is as an interactive environment. Like a data shell. You start munging and plotting data in order to discover what it tells you. In this case, I can tell you that I was expecting a general high (but not terribly high; we have had load averages into the hundreds from these VMs), and that the periodicity of the load came as a complete surprise to me.

(There are other other differences, of course. Perl thinks of everything as a scalar, meaning either a string or a number, or an array of scalars, or a special fast array of scalars. R thinks of everything as data structures and vectors and such. Things I need to integrate into my head, but not things I wish to blog on right now.)

The difference between making a tool to give you expected results and using a tool to identify unexpected aspects is the difference, I believe, between a computer programmer and a data scientist. And I'm finding the latter is more where I want to be.

So, I want to try to learn R as she is spoke, to allow myself to think in it's terms. Certainly there's a simple transform that gives me what I want, but I do not know it yet. As I do it, I will have to let go of my feelings of mastery of Perl and allow myself to become a beginner in R.

But seriously, if anyone has the answer, I'll take that, too.


Thoughts on Machine Learning and Twitter Tools

I have a lot of Twitter data.

I decided a few months ago to get my list of those I follow and those following me, and then find that same information about each of them.

This took quite some time, as Twitter stops you from getting too much at a time.

I found a few interesting things. One was, of all the people I follow and all following me, I had the highest number of bi-directional connections (I follow them, they follow me; let's say "friends") of any, at something like 150.

But I'm thinking about writing the next step, and I gotta say, I don't wanna.

In part, that data's dead. It's Spring me, and my Spring followers. It's Fall now, and I don't know that the data is valid anymore.

So I'm thinking about the next step.

If "Big Data" is more than a buzzword, it is working with data that never ends. I mean, my lab has nodes in a huge research cluster, that contain like a quarter-TB of RAM. That's still a mind-boggling thing when I think about it. And it's not like we let that sit empty; we use it. Well, I don't. My part of the workflow is when the users bring their data and, increasingly, when it's done and being sent on. It clearly is large data sets being handled, but it isn't "Big Data", because we finish it, package it, and give it to our users.

"Big Data", it seems to me, means you find out more now than you did before, that you're always adding to it, because it never ends. "The Torture Never Stops", as Zappa said.

In the case of Twitter, I can get every tweet in my feed and every tweet I favorite. I could write that thing and have it run regularly, starting now. Questions: What do I do? How do I do it?

Here's the Why:

There's tools out there that I don't understand, and they are powerful tools. I want to understand them. I want to write things with these things that interest me, because it's an uncommon day that I'm interested in what I code most of the time.

Then there's Twitter. You can say many many things about Twitter, and it's largely true. But, there are interesting people talking about the interesting things they do, and and other people retweeting it. With luck, I can draw from that, create connections, learn things and do interesting things.

So, while there is the "make a thing because a thing" part, I'm hoping to use this.

So, the What:

A first step is to take my Favorites data and use it as "ham" in a Bayesian way to tell what kind of thing I like. I might need to come up with a mechanism beyond unfollowing to indicate what I don't want to see. Maybe do a search on "GamerGate", "Football" and "Cats"?

Anyway, once that's going, let it loose on my feed and have it bubble up tweets that it says I might favorite, that "fit the profile", so to speak. I'm thinking my next step is installing a Bayesian library like Algorithm::Bayesian or Algorithm::NaiveBayes, running my year+ worth of Twitter favorites as ham, and have it tell me once a day the things I should've favorited but didn't. Once I have that, I'll reorient and go from there.


So You Think You CAN REST

This is notes to self more than anything. Highly cribbed from

First step toward making RESTful APIs is using the path info. If you have a program api.cgi, you can post to it, use get and api.cgi?foo=bar, or you can use path info and api.cgi/foo/bar instead.

You can still use parameters, but if you're dealing with a foo named bar, working with $api.cgi/foo/bar is shorter, because you're overloading it.

Generally, we're tossing things around as JSON, which, as object notation, is easier to convert to objects on either side of the client/server divide than XML.

You're overloading it by using request method. Generally using POST, GET, PUT and DELETE as the basic CRUD entities. You can browse to api.cgi/foo/bar and find out all about bar, but that's going to be a GET command. You can use curl or javascript or other things where you can force the request method to add and update.

This means that, in the part that handles 'foo', we handle the cases.

For read/GET, api.cgi/foo likely means you want a list of all foos, maybe with full info and maybe not, and api.cgi/foo/bar means you want all the information specific to the foo called bar.

For the rest of CRUD, api.cgi/foo is likely not defined, and should return an error as such.

So, in a sense, sub foo should be a bit like this:
sub foo {

my $id  = undef ;
if ( scalar @pathinfo ) { $id = $pathinfo[-1] }

if ( $method eq 'GET' ) {
    if ( defined $id ) {
        my $info = get_info($id);
        status_200($info) if $info;
    my $list = get_list();
    status_200($list) if $list;

if ( $method eq 'POST' ) {
    if( $param ) {
        my $response = create($param);
        status_201() if $response; 
        status_409() if $response == -409 ; # foo exists
        status_400() if $response < 0 ;

if ( $method eq 'PUT' ) {
    my $id  = undef ;
    if ( scalar @pathinfo ) { $id = $pathinfo[-1] }
    if ( $id && $param ) {
        my $response = update( $id, $param ) ;
        status_201() if $response; 
        status_400() if $response < 0 ;

if ( $method eq 'DELETE' ) {
    if ( defined $id ) {
        my $response = delete($id);
        status_400() if $response < 0 ;
        status_204(); # or 201

And those are standard HTTP status codes. Here's the Top Ten:
  • Success
    • 200 OK
    • 201 Created
    • 204 No Content
  • Redirection
    • 304 Not Modified
  • Client Error
    • 400 Bad Request
    • 401 Unauthorized
    • 403 Forbidden
    • 404 Not Found
    • 409 Conflict
  • Server Error
    • 500 Internal Server Error
Right now, I have the pathinfo stuff mostly handled in an elegant way. I see no good way of creating a big thing without using params, and the APIs generally use them too, I think.

My failing right now is that I'm not varying on request method and I'm basically sending 200s for everything, fail or not, and my first pass will likely be specific to individual modules, not pulled out to reusable code. Will have to work on that.