/var/log/rant: ironman

Showing posts with label ironman. Show all posts

2011/03/16

The Schwartzian Mindset

Yeah, but your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should. Dr. Ian Malcolm

I have recently come to a knowledge of the power of map and grep when dealing with hashes and arrays of data.

The key to entering this mindset is the Schwartzian Transform. Here is a canonical example.

@sorted = map  { $_->[0] }
          sort { $a->[1] cmp $b->[1] }
          map  { [$_, foo($_)] }
               @unsorted;

For years, I looked at that and said "Huh?", then wrote for my $i ( @unsorted ) { ... } or something like that. In fact, let's look at a for-loop implementation that would get similar results.

my @sorted ;
for my $i ( @unsorted ) {
          my $j = foo($i) ;
          $hash{$j} = $i ;
          }
for my $k ( sort { $a cmp $b } keys %hash ) {
          push @sorted , $hash{ $k } ;
          }

That's a lot more code, and it would not surprise me at all if my code was far slower than the Wikipedia example code, but I understood for loops from my first year of programming, so I can look at the second example and immediately understand it, and would still understand it if I came back and looked at it in a month.

Of course, you can comment this sort of thing to explain it to your later self.

@sorted = map  { $_->[0] } # taking away the sorting field
          sort { $a->[1] cmp $b->[1] } #sorting by the 2nd field in the anonymous arrays
          map  { [$_, foo($_)] } # an array of 2-element anonymous arrays
               @unsorted; # the original unsorted array

I find myself using the map and grep method a lot lately. It avoids a lot of extra arrays floating around, but as I looked at the Schwartzian Transform and boggled for years, I could see myself, or the next person holding this seat, looking at the code for hours and failing to get it. But at least, now that I can code like that, I can wonder if I should...

2010/06/11

Absolutely Awesome Perl Modules: IO::Interactive

I'm reinstalling on another box and have a few moments to knock this out.

Say you have a batch job that does lots and lots and lots of stuff overnight, or even several times an hour. You want to know what's going on with it when you're working on it, so you add print or say statements all over.

say &do( $this ) ;
say &do( $that ) ;
say &do( $the_other_thing ) ;

And it emails you all the time, giving you status updates you don't need. So, you change it.

$debug and say &do( $this ) ;
$debug and say &do( $that ) ;
$debug and say &do( $the_other_thing ) ;

And you have to remember to set the $debug flag for testing.

Along comes IO::Interactive

use IO::Interactive { interactive } ;
say { interactive } &do( $this ) ;
say { interactive } &do( $that ) ;
say { interactive } &do( $the_other_thing ) ;

Now, it can tell if you're running interactively or not, and suppress your prints (by giving a non-writing file handle) when you don't want to print. Absolutely awesome.

2010/06/01

Absolutely Awesome Perl Modules: Cwd

It's the libraries that make a language. It's nice to have cool ways to do loops and such, but if the tool does the task you need it to do, that's what pays the bills.

This is one I've found very useful. Cwd gives you your current working directory with getcwd and it also takes a relative path or and gives you the absolute path, with abs_path.

This is a small script that sends an image path to gconftool to set as the background image. I call it set_background.pl. gconftool is a stickler for absolute paths ('/home/user/Pictures/foo.png', for example), and I could either hack something up to make absolute paths, or I can just use Cwd 'abs_path' ;

#!/usr/bin/perl

use 5.010 ;
use strict ;
use warnings ;
use Cwd 'abs_path' ;
use IO::Interactive qw{ interactive } ;

my $command = join ' ' ,
    qw{gconftool -t string -s /desktop/gnome/background/picture_filename} ;

my $img = shift @ARGV ;
my $bg  = abs_path $img ;

say { interactive } $bg ;
say qx{$command $bg} ;

IO::Interactive is simlarly cool, but we'll get into that later.

2010/05/03

So, 1000 words each?

I have a webcam. I run it via camorama and it takes a shot every 20 minutes. I start it and forget it, mostly using it as a rear-view mirror to tell me if someone is coming up behind me. (I have a real mirror too. I hate people sneaking up on me.)

So, I end up with lots of shots that look like this, like the wall behind me with the lights off. They're all the same shot, except they're not. They are different pictures of the same view, and thus have different sizes and different hashes. Which meant that all the standard measures of uniqueness I knew how to code are useless.

So, I did some looking and some coding, and I found a means using ImageMagick to pull some data out, which I could then use to tell ... it's not really motion-sensing, but rather whether the light is on or not. Which is enough.

#!/usr/bin/perl

use 5.010 ;
use strict ;
use warnings ;
use lib '/home/jacoby/lib' ;
use Carp ;
use Cwd 'abs_path' ;
use Data::Dumper ;
use Digest::SHA1 ;
use File::Copy ;
use HOP ':all' ;
use subs qw{ main all_files image_check } ;

my %digest ;
main '/home/jacoby/Webcam/' ;
exit ;

sub main {
    my ( $dir ) = @_ ;
    my $first = '' ;
    my $prev = '' ;
    for my $curr ( all_files $dir ) {
        my $del = 0 ;
        if ( $prev ne '' ) {
            my $check = image_check( $prev , $curr ) ;
            say join "\t" , $check , $prev , $curr ;
            $del = 1 if $check < 1 ;
            }
        if ( $del ) {
            unlink $curr ;
            next ; 
            }
        $first = $curr if $first eq '' ;
        $prev = $curr ;
        }
    } ;

sub image_check {
    my ( $pic1 , $pic2 ) = @_ ;
    my $out = qx{ compare -verbose -metric MAE $pic1 $pic2 /dev/null 2>&1 } ;
    my ( $all ) = grep m{all}mx , split m{\n} , $out ;
    $all = ( split m{\s+}mx , $all )[2] ;
    $all = int $all ;
    return $all  < 800 ? 0 : 1 ;
    }

sub all_files {
    my $dir = shift ;
    return sort { $a cmp $b } <webcam*.png> ;
    }

There are a few modules in there that I don't need, or at least that I don't need at this point in the process. There's also a qx() where I should be using Image::Magick, I know, but I just could not get that working. I tried both CPAN (which has had networking troubles for me) and apt-get (which is kinda hammered due to Lucid coming out), and neither made me happy. I figure this is another issue and I can make it better later. And better it is, because after running it, I get this:

Well, I think it's an improvement....

2010/03/19

Higher Order MP3 Directory Organization, A First Step

I have a huge number of MP3s. I am sure I haven't heard all of them. Some have weird tags. Some have no tags. Some are not really MP3s, but "you can't download this" HTML files, or just zero-sized files. When I bump into them, I can fix these things (read: delete the bad files) but it can take some time.

I had wanted to use the power of Perl to help with this, but while there are great numbers of modules to help with just about anything, I didn't have a directory walker I liked.

Then I started trying to go through Higher Order Perl by Mark Jason Dominus. And one of the first examples is a directory walker which takes anonymous subroutines. Exactly!

First step was to make a script that counts my MP3s.

#!/usr/bin/perl
use 5.010 ;
use strict ;
use warnings ;
use Carp ;
use Data::Dumper ;
use MP3::Tag ;
use MP3::Info ;
use Digest::SHA1 ;
use lib '/home/jacoby/lib' ;
use HOP ':all' ;

my $x   = 1 ;

# dirwalk home directory , file handing sub , directory handling sub
dir_walk(
    '/home/jacoby/Music',
    sub {
        my $file = $_[ 0 ] ;
        return if $file !~ m/mp3$/imx ;
        $x++ ;
        } ,
    sub { },
        ) ;

say $x . ' MP3 files' ;
exit ;

Everything that claims to be an MP3 gets counted. Yay! (Just so you know, the current count is 37604.) There's lots of included modules that I don't use yet. HOP.pm simply puts MJD's directory walker into a module where I can get it on demand, so I don't have to copy and paste. Having a command-line set for the directory would be good, but not today.

And needless to say, you can adjust this to do a lot of other things. Check file sizes. Find file names without track numbers. Stuff like that. There are three downsides so far: You don't have hashes to find repeated songs, you don't have MP3 tag information, and you have to run it again (with the associated lag of running a directory walker on 30,000+ MP3s.

But there are solutions.

Digest::SHA1. MP3::Info and/or MP3::Tag. DBI.

I run Linux. sudo apt-get install mysql-server gets me a DB. Run once, save the data and query until you're sick. I started out with this schema.

CREATE TABLE music (
    id              int(20) NOT NULL auto_increment primary key ,
    album           VARCHAR(255),
    artist          VARCHAR(255),
    filename        VARCHAR(255),
    filepath        VARCHAR(255),
    filesize        int(32),
    length          int(32),
    release_year    VARCHAR(4),
    run_length      VARCHAR(32),
    sha1_hash       VARCHAR(255),
    title           VARCHAR(255)
    ) ;

length is song length in seconds. run_length is song length in HH:MM:SS format, and yeah, I have some MP3s that push that, if not exceed it. Or that's the theory, at least.

And some would say it's bad schema design, but I'm not so much worried about grouping by artist or album or year. Those tell me if the file has ID3 tags or not. I'm focused on the MP3 file itself here.

#!/usr/bin/perl
use 5.010 ;
use strict ;
use warnings ;
use Carp ;
use Data::Dumper ;
use MP3::Tag ;
use MP3::Info ;
use Digest::SHA1 ;
use lib '/home/jacoby/lib' ;
use HOP ':all' ;
use MusicDB 'db_connect' ;

$Data::Dumper::Indent = 1 ;
$MP3::Info::try_harder = 1 ;

my $sql = <<"SQL" ;
INSERT INTO music
    (
    album       , artist , filename     , filepath      ,
    filesize    , length , release_year , run_length    ,
    sha1_hash   , title
    )
    VALUES
    (
    ? , ? , ? , ? ,
    ? , ? , ? , ? ,
    ? , ?
    )
SQL

my $dbh = MusicDB::db_connect() ;
my $sth = $dbh->prepare( $sql ) ;
my $count = 1;

dir_walk( '/home/jacoby/Music' , \&mp3_check, sub { } ) ;
exit ;

sub mp3_check {
    my $file = $_[ 0 ] ;
    return if $file !~ m/mp3$/imx ;
    my $filename = ( split m{/}mx , $file )[-1] ;  # just the file name
    open my $fh, '<', $file or return ;    # for SHA1 HASH
    my $hash = Digest::SHA1->new ;         # for SHA1 HASH
    $hash->addfile( $fh ) ;                # for SHA1 HASH
    my $digest = $hash->hexdigest ;        # for SHA1 HASH
    my $mp3    = MP3::Tag->new( $file ) ;  # for MP3 tags
    my $size   = -s $file ;                # for MP3 tags
    my ($title, $track, $artist, $album,   # for MP3 tags
        $comment, $year, $genre )
      = $mp3->autoinfo() ;                 # for MP3 tags
    my $total_secs = $mp3->total_secs_int() ;
    my $time       = $mp3->time_mm_ss() ;
    $sth->execute(
        $album       ,
        $artist ,
        $filename     ,
        $file ,
        $size    ,
        $time ,
        $year ,
        $total_secs    ,
        $digest ,
        $title
        ) ;
    say $count if $count % 1000 == 0 ; #to keep track of progress
    $count++ ;
    }

This is still a work in progress. I don't use Carp here, but I generally include it when I should. As I'm debugging, I always have Data::Dumper floating around so I can see what the data structures are. I could probably just use MP3::Info instead of MP3::Tag. Haven't decided yet. Digest::SHA1 gives a cryptographically-secure hash of the MP3, so that should detect duplicates. HOP was mentioned earlier, and MusicDB is a wrapper module that allows me to have my DB passwords in one convenient place, so I just have to worry about the actual SQL. There are some bugs — length doesn't give the right info yet — but I have all the info on any discrete MP3 file.

Notice though, that the function has become sufficiently big and complicated that I've pulled it out and given it a name. Also notice how I'm starting to use placeholders, which should make my DB interface more efficient.

A good thing to add would be to see if a file has been put into the DB, and if so, to get the unique index, file size and hash to check for changes, then update only if there's changes, rather than inputting it in again.

2010/03/09

Heavy Boots of Lead

I've joined the Perl Iron Man competition. Or whatever it is. Which means I have to write about Perl.

I use it all the time, so that should be no problem. Just have to make it interesting.

Which I can't right now.

Cookie Notice