Cookie Notice

As far as I know, and as far as I remember, nothing in this page does anything with Cookies.

2016/11/19

Graphs are not that Scary!



As with most things I blog about, this starts with Twitter. I follow a lot of people on Twitter, and I use Lists. I want to be able to group people more-or-less on community, because there's the community where they talk about programming, for example, and the community where they talk about music, or the town I live in.

I can begin to break things up myself, but curation is a hard thing, so I wanted to do it automatically. And I spent a long time not knowing what to do. I imagined myself traversing trees in what looks like linked lists reimagined by Cthulhu, and that doesn't sound like much fun at all.

Eventually, I decided to search on "graphs and Perl". Of course, I probably should've done it earlier, but oh well. I found Graph. I had used GD::Graph before, which is a plotting library. (There has to be some index of how overloaded words are.) And once I installed it, I figured it out: As a programmer, all you're dealing with are arrays and hashes. Nothing scary.

Word Ladder


We'll take a problem invented by Lewis Carroll called a "word ladder", where you find your way from one word (for example, "cold") to another ("warm") by changing one letter at a time:

    cold
    coRd
    cArd
    Ward
    warM

Clearly, this can and is often done by hand, but if you're looking to automate it, there are three basic problems: what are the available words, how do you determine when words are one change away, and how do you do this to get the provable shortest path?

First, I went to CERIAS years ago and downloaded word lists. Computer security researchers use them because real words are bad passwords, so, lists of real words can be used to create rainbow tables and the like. My lists are years old, so there may be new words I don't account for, but unlike Lewis Carroll, I can get from APE to MAN in five words, not six.

    ape
    apS
    aAs
    Mas
    maN

Not sure that Lewis Carroll would've accepted AAS, but there you go

There is a term for the number of changes it takes to go from one word to another, and it's called the Levenshtein Distance. I first learned about this from perlbrew, which is how, if you type "perlbrew isntall", it guesses that you meant to type "perlbrew install". It's hardcoded there because perlbrew can't assume you have anything but perl and core modules. I use the function from perlbrew instead of Text::Levenshtein but it is a module worth looking into.

And the final answer is "Put it into a graph and use Dijkstra's Algorithm!"

Perhaps not with the exclamation point.

Showing Code


Here's making a graph of it:

  1. #!/usr/bin/env perl  
  2.   
  3. use feature qw{say} ;  
  4. use strict ;  
  5. use warnings ;  
  6.   
  7. use Data::Dumper ;  
  8. use Graph ;  
  9. use List::Util qw{min} ;  
  10. use Storable ;  
  11.   
  12. for my $l ( 3 .. 16 ) {  
  13.     create_word_graph($l) ;  
  14.     }  
  15. exit ;  
  16.   
  17. # -------------------------------------------------------------------  
  18. # we're creating a word graph of all words that are of length $length  
  19. # where the nodes are all words and the edges are unweighted, because  
  20. # they're all weighted 1. No connection between "foo" and "bar" because   
  21. # the distance is "3".  
  22.   
  23. sub create_word_graph {  
  24.     my $length = shift ;  
  25.     my %dict = get_words($length) ;  
  26.     my @dict = sort keys %dict ; # sorting probably is unnecessary  
  27.     my $g    = Graph->new() ;  
  28.   
  29.     # compare each word to each word. If the distance is 1, put it  
  30.     # into the graph. This implementation is O(N**2) but probably  
  31.     # could be redone as O(NlogN), but I didn't care to.  
  32.   
  33.     for my $i ( @dict ) {  
  34.         for my $j ( @dict ) {  
  35.             my $dist = editdist( $i$j ) ;  
  36.             if ( $dist == 1 ) {  
  37.                 $g->add_edge( $i$j ) ;  
  38.                 }  
  39.             }  
  40.         }  
  41.   
  42.     # Because I'm using Storable to store the Graph object for use  
  43.     # later, I only use this once. But, I found there's an endian  
  44.     # issue if you try to open Linux-generated Storable files in  
  45.     # Strawberry Perl.  
  46.   
  47.     store $g , "/home/jacoby/.word_$length.store" ;  
  48.     }  
  49.   
  50. # -------------------------------------------------------------------  
  51. # this is where we get the words and only get words of the correct  
  52. # length. I have a number of dictionary files, and I put them in  
  53. # a hash to de-duplicate them.  
  54.   
  55. sub get_words {  
  56.     my $length = shift ;  
  57.     my %output ;  
  58.     for my $d ( glob'/home/jacoby/bin/Toys/Dict/*' ) ) {  
  59.         if ( open my $fh'<'$d ) {  
  60.             for my $l ( <$fh> ) {  
  61.                 chomp $l ;  
  62.                 $l =~ s/\s//g ;  
  63.                 next if length $l != $length ;  
  64.                 next if $l =~ /\W/ ;  
  65.                 next if $l =~ /\d/ ;  
  66.                 $outputuc $l }++ ;  
  67.                 }  
  68.             }  
  69.         }  
  70.     return %output ;  
  71.     }  
  72.   
  73. # -------------------------------------------------------------------  
  74. # straight copy of Wikipedia's "Levenshtein Distance", straight taken  
  75. # from perlbrew. If I didn't have this, I'd probably use   
  76. # Text::Levenshtein.  
  77.   
  78. sub editdist {  
  79.     my ( $f$g ) = @_ ;  
  80.     my @a = split //, $f ;  
  81.     my @b = split //, $g ;  
  82.   
  83.     # There is an extra row and column in the matrix. This is the  
  84.     # distance from the empty string to a substring of the target.  
  85.     my @d ;  
  86.     $d$_ ][ 0 ] = $_ for ( 0 .. @a ) ;  
  87.     $d[ 0 ][ $_ ] = $_ for ( 0 .. @b ) ;  
  88.   
  89.     for my $i ( 1 .. @a ) {  
  90.         for my $j ( 1 .. @b ) {  
  91.             $d$i ][ $j ] = (  
  92.                   $a$i - 1 ] eq $b$j - 1 ]  
  93.                 ? $d$i - 1 ][ $j - 1 ]  
  94.                 : 1 + min( $d$i - 1 ][ $j ], $d$i ][ $j - 1 ], $d$i - 1 ][ $j - 1 ] )  
  95.                 ) ;  
  96.             }  
  97.         }  
  98.   
  99.     return $d@a ][ @b ] ;  
  100.     }  

Following are what my wordlists can do. Something tells me that, when we get to 16-letter words, it's more a bunch of disconnected nodes than a graph.

1718 3-letter words
6404 4-letter words
13409 5-letter words
20490 6-letter words
24483 7-letter words
24295 8-letter words
19594 9-letter words
13781 10-letter words
8792 11-letter words
5622 12-letter words
3349 13-letter words
1851 14-letter words
999 15-letter words
514 16-letter words

My solver isn't perfect, and the first thing I'd want to add is ensuring that both the starting and ending words are actually in the word list. Without that, your code goes on forever.

So, I won't show off the whole program below, but it does use Storable, Graph and feature qw{say}.

  1. dijkstra( $graph , 'foo' , 'bar' ) ;  
  2.   
  3. # -------------------------------------------------------------------  
  4. # context-specific perl implementation of Dijkstra's Algorithm for  
  5. # shortest-path  
  6.   
  7. sub dijkstra {  
  8.     my ( $graph$source$target, ) = @_ ;  
  9.   
  10.     # the graph pre-exists and is passed in   
  11.     # $source is 'foo', the word we're starting from  
  12.     # $target is 'bar', the word we're trying to get to  
  13.   
  14.     my @q ; # will be the list of all words  
  15.     my %dist ; # distance from source. $dist{$source} will be zero   
  16.     my %prev ; # this holds our work being every edge of the tree  
  17.                # we're pulling from the graph.   
  18.   
  19.     # we set the the distance for every node to basically infinite, then   
  20.     # for the starting point to zero  
  21.   
  22.     for my $v ( $graph->unique_vertices ) {  
  23.         $dist{$v} = 1_000_000_000 ;    # per Wikipeia, infinity  
  24.         push @q$v ;  
  25.         }  
  26.     $dist{$source} = 0 ;  
  27.   
  28. LOOP: while (@q) {  
  29.   
  30.         # resort, putting words with short distances first  
  31.         # first pass being $source , LONG WAY AWAY  
  32.   
  33.         @q = sort { $dist{$a} <=> $dist{$b} } @q ;  
  34.         my $u = shift @q ;  
  35.   
  36.         # say STDERR join "\t", $u, $dist{$u} ;  
  37.   
  38.         # here, we end the first time we see the target.  
  39.         # we COULD get a list of every path that's the shortest length,  
  40.         # but that's not what we're doing here  
  41.   
  42.         last LOOP if $u eq $target ;  
  43.   
  44.         # this is a complex and unreadable way of ensuring that  
  45.         # we're only getting edges that contain $u, which is the   
  46.         # word we're working on right now  
  47.   
  48.         for my $e (  
  49.             grep {  
  50.                 my @a = @$_ ;  
  51.                 grep {/^${u}$/} @a  
  52.             } $graph->unique_edges  
  53.             ) {  
  54.   
  55.             # $v is the word on the other end of the edge  
  56.             # $w is the distance, which is 1 because of the problem  
  57.             # $alt is the new distance between $source and $v,   
  58.             # replacing the absurdly high number set before  
  59.   
  60.             my ($v) = grep { $_ ne $u } @$e ;  
  61.             my $w   = 1 ;  
  62.             my $alt = $dist{$u} + $w ;  
  63.             if ( $alt < $dist{$v} ) {  
  64.                 $dist{$v} = $alt ;  
  65.                 $prev{$v} = $u ;  
  66.                 }  
  67.             }  
  68.         }  
  69.   
  70.     my @nodes = $graph->unique_vertices ;  
  71.     my @edges = $graph->unique_edges ;  
  72.     return {  
  73.         distances => \%dist,  
  74.         previous  => \%prev,  
  75.         nodes     => \@nodes,  
  76.         edges     => \@edges,  
  77.         } ;  
  78.     }  

I return lots of stuff, but the part really necessary is %prev, because that, $source and $target are everything you need. Assuming we're trying to go from FOR to FAR, a number of words will satisfy $prev{FOR}, but it's the one we're wanting. In the expanded case of FOO to BAR, $prev->{BAR} = 'FAR', $prev->{FAR} is 'FOR', and $prev->{FOR} is 'FOO'.

And nothing in there is complex. It's all really hashes or arrays or values. Nothing a programmer should have any problem with.

CPAN has a number of other modules of use: Graph::Dijkstra has that algorithm already written, and Graph::D3 allows you to create a graph in such a way that you can use it in D3.js. Plus, there are a number of modules in Algorithm::* that do good and useful things. So go in, start playing with it. It's deep, there are weeds, but it isn't scary.

No comments:

Post a Comment