Friday, February 03, 2006

Simple fuzzy search in Perl

Couldn't find a module on CPAN that did a n-pass fuzzy search over increasingly detailed data from a text file so I cooked up this for GoodnessDirect


package fuzzysearch;

use strict;
use vars qw($VERSION);
$VERSION='0.9';

sub search ($$$$) {

(my $strings, my $term, my $regexp, my $threshfn)=@_;

$term=" $term ";
my @frags=$term=~/(?=(...))/g;

my %scores;
foreach (@{$strings}) {
(my $stock_code, my $text)=/$regexp/;
$text=" $text ";
my $score=0;
map {$score+=()=$text=~/$_/ig} @frags;
$scores{$stock_code}=$score/length($text);
}

my @stock_codes=sort {$scores{$b}<=>$scores{$a}} keys %scores;

return grep {&$threshfn($scores{$_},$scores{$stock_codes[0]})} @stock_codes;

}

1;

__END__



The prototype for the search function is as follows:

\@strings - array reference containing the string to query
$term - string containing the search term
$regexp - compiled match pattern for extracting key and text from each string
$threshfn - subroutine reference returning boolean for filtering results as f(string_score,max_score)

It can be called as follows:


#!/usr/bin/perl -w

use lib '.';

use strict;
use fuzzysearch;

my $query="Vegan Spread";

open(FIN,'stock.txt');
my @lines=<FIN>
close(FIN);

my @items=fuzzysearch::search(\@lines,$query,qr/^(.*?)\t.*?\t(.*?)\t/,sub {$_[0]>$_[1]/2});
push(@items,fuzzysearch::search(\@lines,$query,qr/^(.*?)\t(.*)$/,sub {$_[0]>$_[1]/2}));

my %temp=(); # Remove duplicates
@items=grep ++$temp{$_}<2,@items;



The following is an extract from the stock.txt data file:


402125 CDAB Biona Toscana Olive, Tomato and Basil Creamy Soft Spread 125g Non-dairy Spreads Rich in Soluble fibre, cholesterol free-creamy spread with olives, tomato and basil. Less than 0.005% Cholesterol Organic Vegan Soya beans*, extra virgin olive oil*, green spelt*, fresh bell peppers* 7%, tomatoes*3%, black kalamata olives 2%, fresh lemons*, basil*0.26%, italians herbs and spices*, crystal minerals unrefined rock salt, natural nutritional yeast.
402679 CDAB Biona Organic Country-Wild Garlic Creamy Soft Spread 125g Non-dairy Spreads With Green Spelt Less than 0.005% Cholesterol Organic Vegan Soya beans*, unrefined sunflower oil*, green spelt* 4%, garden fresh carrots*, sunflower seeds*, wild garlic* 0.7%, fresh lemon juice*, chives*, herbs and spices*, crystal mineral unrefined rock salt, natural nutritional yeast. * = organically grown ingredients
403180 CDAB Biona Organic Non-Hydrogenated Vegetable Margarine 250g Non-dairy Spreads Organic Vegan Sunflower oil, palm oil, coconut oil, water, carrot juice, emulsifier: lecithin, lemon juice, natural flavouring.
403210 CDAB Biona Organic Non-Hydrogenated Vegetable Margarine 500g Non-dairy Spreads Dairy Free Organic Vegan Sunflower oil, palm oil, coconut oil, water, carrot juice, emulsifier: lecithin, lemon juice, natural flavouring.
408509 CDAB Pure Soya Spread 500g Non-dairy Spreads Dairy free margarine Dairy Free Gluten Free Lactose Free Vegan Soya oil (45%), water, palm oil, salt (0.75%), emulsifier (mono and diglycerides of vegetable fatty acids), Vitamin E, natural flavouring, Vitamin A, colour (natural carotenes), Vitamin D as D2, Vitamin B12.

0 Comments:

Post a Comment

<< Home