/var/log/rant: March 2010

2010/03/19

Working through a Module Problem

I have a module. This module is too big to be managed. So I've broken it up. It was Dumb.pm and I've broken it into Dumb/Database.pm, Dumb/HTML.pm, Dumb/This.pm, Dumb/That.pm, and Dumb/TheOtherThing.pm.

Yes, I am convinced of the low quality of this code. Yes, I wrote every dumb line of it myself. No, I will not post it for the amusement of the masses.

Under all this, there's 2 databases. One for test, one for production. If you want to test things, test against data that isn't vital, right? With Dumb.pm, I took something from Data::Dumper, specifically how you set indentation, so that you put in $Dumb::Database = 'test' ; if the you wanted to use the test DB. Usually this is connect to an if statement, like $Dumb::Database = 'test' if defined $cgi->param('test') ; It's a nice, compact solution. But if you balkanize the code base, there's not one bit you can set to say "look to the DB".

My first pass semi-solved this by having hash refs for getting variables into subroutines. Dumb::That::this_function( $hashref ), so as long as $hashref->{database} is set, or the underlying subroutine knows that an unset var means production, not test. But I am requested to make it more like the previous. More like Dumb::That::this_function( $value ).

Let me say that, if you're going to put several dozen variables through, enough so that ensuring order is a major concern, sending hashes or hash refs is a good idea. There are points where I hit that threshold, and I've kept it as the general case, even when I'm just passing one or two variables.

As I can figure it, this means having a reborn Dumb.pm, call it Dumber.pm as a central module, which has the $Database var. Then, every time I want Dumb::This::subroutine(), I go for Dumber::this_subroutine(), which is a simple wrapper that handles setting $hash->{database} and passes it on to Dumb/This.pm.

I see this as a tree issue. Child nodes can see what's in the parent node, but parallel children can't see each other. I don't see how biting the bullet and finally getting into Object Orientation would help, and I'm resistant anyway because that's one more dang thing to learn and kick this back. If there was another obvious solution, I wish somebody would give it to me.

As I wrap this up and resign myself to diving into this, let me mention the usefulness of passing hashes or hash refs if you're moving massive amounts of data into a subroutine. $result = my_subroutine( $a , $b, $c, $d , $e , $f , $g , $h , $j , $i , $k , $l ) will make you lose track of your variables quick, and passing them with a hash means you know what's what. CGI pushed me that way, too. Not useful for everything, but it has points.

Higher Order MP3 Directory Organization, A First Step

I have a huge number of MP3s. I am sure I haven't heard all of them. Some have weird tags. Some have no tags. Some are not really MP3s, but "you can't download this" HTML files, or just zero-sized files. When I bump into them, I can fix these things (read: delete the bad files) but it can take some time.

I had wanted to use the power of Perl to help with this, but while there are great numbers of modules to help with just about anything, I didn't have a directory walker I liked.

Then I started trying to go through Higher Order Perl by Mark Jason Dominus. And one of the first examples is a directory walker which takes anonymous subroutines. Exactly!

First step was to make a script that counts my MP3s.

#!/usr/bin/perl
use 5.010 ;
use strict ;
use warnings ;
use Carp ;
use Data::Dumper ;
use MP3::Tag ;
use MP3::Info ;
use Digest::SHA1 ;
use lib '/home/jacoby/lib' ;
use HOP ':all' ;

my $x   = 1 ;

# dirwalk home directory , file handing sub , directory handling sub
dir_walk(
    '/home/jacoby/Music',
    sub {
        my $file = $_[ 0 ] ;
        return if $file !~ m/mp3$/imx ;
        $x++ ;
        } ,
    sub { },
        ) ;

say $x . ' MP3 files' ;
exit ;

Everything that claims to be an MP3 gets counted. Yay! (Just so you know, the current count is 37604.) There's lots of included modules that I don't use yet. HOP.pm simply puts MJD's directory walker into a module where I can get it on demand, so I don't have to copy and paste. Having a command-line set for the directory would be good, but not today.

And needless to say, you can adjust this to do a lot of other things. Check file sizes. Find file names without track numbers. Stuff like that. There are three downsides so far: You don't have hashes to find repeated songs, you don't have MP3 tag information, and you have to run it again (with the associated lag of running a directory walker on 30,000+ MP3s.

But there are solutions.

Digest::SHA1. MP3::Info and/or MP3::Tag. DBI.

I run Linux. sudo apt-get install mysql-server gets me a DB. Run once, save the data and query until you're sick. I started out with this schema.

CREATE TABLE music (
    id              int(20) NOT NULL auto_increment primary key ,
    album           VARCHAR(255),
    artist          VARCHAR(255),
    filename        VARCHAR(255),
    filepath        VARCHAR(255),
    filesize        int(32),
    length          int(32),
    release_year    VARCHAR(4),
    run_length      VARCHAR(32),
    sha1_hash       VARCHAR(255),
    title           VARCHAR(255)
    ) ;

length is song length in seconds. run_length is song length in HH:MM:SS format, and yeah, I have some MP3s that push that, if not exceed it. Or that's the theory, at least.

And some would say it's bad schema design, but I'm not so much worried about grouping by artist or album or year. Those tell me if the file has ID3 tags or not. I'm focused on the MP3 file itself here.

#!/usr/bin/perl
use 5.010 ;
use strict ;
use warnings ;
use Carp ;
use Data::Dumper ;
use MP3::Tag ;
use MP3::Info ;
use Digest::SHA1 ;
use lib '/home/jacoby/lib' ;
use HOP ':all' ;
use MusicDB 'db_connect' ;

$Data::Dumper::Indent = 1 ;
$MP3::Info::try_harder = 1 ;

my $sql = <<"SQL" ;
INSERT INTO music
    (
    album       , artist , filename     , filepath      ,
    filesize    , length , release_year , run_length    ,
    sha1_hash   , title
    )
    VALUES
    (
    ? , ? , ? , ? ,
    ? , ? , ? , ? ,
    ? , ?
    )
SQL

my $dbh = MusicDB::db_connect() ;
my $sth = $dbh->prepare( $sql ) ;
my $count = 1;

dir_walk( '/home/jacoby/Music' , \&mp3_check, sub { } ) ;
exit ;

sub mp3_check {
    my $file = $_[ 0 ] ;
    return if $file !~ m/mp3$/imx ;
    my $filename = ( split m{/}mx , $file )[-1] ;  # just the file name
    open my $fh, '<', $file or return ;    # for SHA1 HASH
    my $hash = Digest::SHA1->new ;         # for SHA1 HASH
    $hash->addfile( $fh ) ;                # for SHA1 HASH
    my $digest = $hash->hexdigest ;        # for SHA1 HASH
    my $mp3    = MP3::Tag->new( $file ) ;  # for MP3 tags
    my $size   = -s $file ;                # for MP3 tags
    my ($title, $track, $artist, $album,   # for MP3 tags
        $comment, $year, $genre )
      = $mp3->autoinfo() ;                 # for MP3 tags
    my $total_secs = $mp3->total_secs_int() ;
    my $time       = $mp3->time_mm_ss() ;
    $sth->execute(
        $album       ,
        $artist ,
        $filename     ,
        $file ,
        $size    ,
        $time ,
        $year ,
        $total_secs    ,
        $digest ,
        $title
        ) ;
    say $count if $count % 1000 == 0 ; #to keep track of progress
    $count++ ;
    }

This is still a work in progress. I don't use Carp here, but I generally include it when I should. As I'm debugging, I always have Data::Dumper floating around so I can see what the data structures are. I could probably just use MP3::Info instead of MP3::Tag. Haven't decided yet. Digest::SHA1 gives a cryptographically-secure hash of the MP3, so that should detect duplicates. HOP was mentioned earlier, and MusicDB is a wrapper module that allows me to have my DB passwords in one convenient place, so I just have to worry about the actual SQL. There are some bugs — length doesn't give the right info yet — but I have all the info on any discrete MP3 file.

Notice though, that the function has become sufficiently big and complicated that I've pulled it out and given it a name. Also notice how I'm starting to use placeholders, which should make my DB interface more efficient.

A good thing to add would be to see if a file has been put into the DB, and if so, to get the unique index, file size and hash to check for changes, then update only if there's changes, rather than inputting it in again.

2010/03/16

John Hodgman on Net Neutrality

2010/03/15

Better Than Chocolate Milk

They say that the browser serves as the command-line for the internet. Quix wants to improve the syntax. And what little I have tried, I like it.

2010/03/09

Heavy Boots of Lead

I've joined the Perl Iron Man competition. Or whatever it is. Which means I have to write about Perl.

I use it all the time, so that should be no problem. Just have to make it interesting.

Which I can't right now.

2010/03/06

Am I Dreaming? No...

Max Headroom is coming to DVD!

So great. Max Headroom pointed to the future in which we now live.

2010/03/04

Desktop Dead in Three Years?

Silicon Republic (via Gizmodo)
Google believes that in three years or so desktops will give way to mobile as the primary screen from which most people will consume information and entertainment. That’s according to Google Europe boss John Herlihy who said that smart phones enhance Google’s mission to make information universal.

I have to agree with that. Maybe not that specific timetable, but it certainly is reasonable.

Cookie Notice