I had wanted to use the power of Perl to help with this, but while there are great numbers of modules to help with just about anything, I didn't have a directory walker I liked.
Then I started trying to go through Higher Order Perl by Mark Jason Dominus. And one of the first examples is a directory walker which takes anonymous subroutines. Exactly!
First step was to make a script that counts my MP3s.
#!/usr/bin/perl use 5.010 ; use strict ; use warnings ; use Carp ; use Data::Dumper ; use MP3::Tag ; use MP3::Info ; use Digest::SHA1 ; use lib '/home/jacoby/lib' ; use HOP ':all' ; my $x = 1 ; # dirwalk home directory , file handing sub , directory handling sub dir_walk( '/home/jacoby/Music', sub { my $file = $_[ 0 ] ; return if $file !~ m/mp3$/imx ; $x++ ; } , sub { }, ) ; say $x . ' MP3 files' ; exit ;Everything that claims to be an MP3 gets counted. Yay! (Just so you know, the current count is 37604.) There's lots of included modules that I don't use yet.
HOP.pm
simply puts MJD's directory walker into a module where I can get it on demand, so I don't have to copy and paste. Having a command-line set for the directory would be good, but not today.And needless to say, you can adjust this to do a lot of other things. Check file sizes. Find file names without track numbers. Stuff like that. There are three downsides so far: You don't have hashes to find repeated songs, you don't have MP3 tag information, and you have to run it again (with the associated lag of running a directory walker on 30,000+ MP3s.
But there are solutions.
Digest::SHA1. MP3::Info and/or MP3::Tag. DBI.
I run Linux.
sudo apt-get install mysql-server
gets me a DB. Run once, save the data and query until you're sick. I started out with this schema. CREATE TABLE music ( id int(20) NOT NULL auto_increment primary key , album VARCHAR(255), artist VARCHAR(255), filename VARCHAR(255), filepath VARCHAR(255), filesize int(32), length int(32), release_year VARCHAR(4), run_length VARCHAR(32), sha1_hash VARCHAR(255), title VARCHAR(255) ) ;
length
is song length in seconds. run_length
is song length in HH:MM:SS format, and yeah, I have some MP3s that push that, if not exceed it. Or that's the theory, at least.And some would say it's bad schema design, but I'm not so much worried about grouping by artist or album or year. Those tell me if the file has ID3 tags or not. I'm focused on the MP3 file itself here.
#!/usr/bin/perl use 5.010 ; use strict ; use warnings ; use Carp ; use Data::Dumper ; use MP3::Tag ; use MP3::Info ; use Digest::SHA1 ; use lib '/home/jacoby/lib' ; use HOP ':all' ; use MusicDB 'db_connect' ; $Data::Dumper::Indent = 1 ; $MP3::Info::try_harder = 1 ; my $sql = <<"SQL" ; INSERT INTO music ( album , artist , filename , filepath , filesize , length , release_year , run_length , sha1_hash , title ) VALUES ( ? , ? , ? , ? , ? , ? , ? , ? , ? , ? ) SQL my $dbh = MusicDB::db_connect() ; my $sth = $dbh->prepare( $sql ) ; my $count = 1; dir_walk( '/home/jacoby/Music' , \&mp3_check, sub { } ) ; exit ; sub mp3_check { my $file = $_[ 0 ] ; return if $file !~ m/mp3$/imx ; my $filename = ( split m{/}mx , $file )[-1] ; # just the file name open my $fh, '<', $file or return ; # for SHA1 HASH my $hash = Digest::SHA1->new ; # for SHA1 HASH $hash->addfile( $fh ) ; # for SHA1 HASH my $digest = $hash->hexdigest ; # for SHA1 HASH my $mp3 = MP3::Tag->new( $file ) ; # for MP3 tags my $size = -s $file ; # for MP3 tags my ($title, $track, $artist, $album, # for MP3 tags $comment, $year, $genre ) = $mp3->autoinfo() ; # for MP3 tags my $total_secs = $mp3->total_secs_int() ; my $time = $mp3->time_mm_ss() ; $sth->execute( $album , $artist , $filename , $file , $size , $time , $year , $total_secs , $digest , $title ) ; say $count if $count % 1000 == 0 ; #to keep track of progress $count++ ; }This is still a work in progress. I don't use
Carp
here, but I generally include it when I should. As I'm debugging, I always have Data::Dumper
floating around so I can see what the data structures are. I could probably just use MP3::Info
instead of MP3::Tag
. Haven't decided yet. Digest::SHA1
gives a cryptographically-secure hash of the MP3, so that should detect duplicates. HOP
was mentioned earlier, and MusicDB
is a wrapper module that allows me to have my DB passwords in one convenient place, so I just have to worry about the actual SQL. There are some bugs — length doesn't give the right info yet — but I have all the info on any discrete MP3 file.Notice though, that the function has become sufficiently big and complicated that I've pulled it out and given it a name. Also notice how I'm starting to use placeholders, which should make my DB interface more efficient.
A good thing to add would be to see if a file has been put into the DB, and if so, to get the unique index, file size and hash to check for changes, then update only if there's changes, rather than inputting it in again.
I wrote File::Next based on MJD's HOP examples, so that we wouldn't have to re-invent the wheel. I think you'll find it useful.
ReplyDeleteShameless plug here, I'd like to suggest trying Audio::Scan instead of MP3::Info, it is much faster and more accurate for reading MP3 file info and ID3 tags. And if you have other types of audio files in addition to MP3's, it'll handle those as well.
ReplyDeleteAndy(1): Will look into File::Next. Thanks.
ReplyDeleteAndy(2): Same with Audio::Scan. I try to avoid non-MP3 audio files, as OGGs tend to not work with my phone and/or MP3 players and WMAs and ... whatever the Apple one is ... are proprietary and sometimes don't work on my preferred desktop, but I do have some of those depricated formats and that sounds like a win.
Besides File::Next, there is also my File-Find-Object (originally by Nanardon), which also has File-Find-Object-Rule.
ReplyDeleteI propose a bitrate column in your table. If, for instance, ten years ago you ripped Muskrat Love with the original Fraunhafer encoder 128kbps fixed rate (which we all did, because Lame hadn't been invented yet and hard drives were still small) and then later happened to rerip it at 256kbps VBR, you'll end up with different file sizes and different SHA1 hashes, even though everything else might be the same.
ReplyDeleteYou could also implement some heuristics for detecting possible duplicates based on artist/year/title match or partial match, and so forth.
Bug report: crashes when finding a zero-sized file. Should find and take care of such things.
ReplyDelete