Search tweaks

Written at evening time in English • Tags: ,

Some time ago I added some UTF-8 pages to this site. I promptly started receiving one error message per page from the Swish-e search engine I’m using. Of course, the search results also showed “garbage” since Swish-e only handles single byte characters.

Today I got tired of the errors arriving in my mailbox, so I dug into the code and fixed things.

Turns out the errors were from HTML::LinkExtor and had to do with changes in Perl’s internal character handling. Fortunately the HTTP::Response object had also grown a decoded_content member that held the “right” data.

--- /usr/pkg/lib/swish-e/spider.pl      2006-02-08 00:11:29.000000000 +0200
+++ spider.pl   2008-06-01 19:56:16.000000000 +0300
@@ -1138,7 +1138,8 @@

     # Extract out links (if not too deep)

-    my $links_extracted = extract_links( $server, \$content, $response )
+    my $dcontent = $response->decoded_content;
+    my $links_extracted = extract_links( $server, \$dcontent, $response )
         unless defined $server->{max_depth} && $depth >= $server->{max_depth};

I kept the changes minimal, although I have some concerns about e.g. compressed content from the web server. Works for me, though, so I will not be spending any more time with that.

The contents still needed to get indexed more intelligently. The “fix” I picked for that was to recode the content in ISO-8859-1. This code in the filter_content callback did the trick:

    # Make the content always iso-8859-1
    for ($response->header('content-type')) {
        my $charset = lc $1 if /\bcharset=([^;]+)/;
        if ($charset && $charset ne 'iso-8859-1') {
            $$content = Encode::encode(
                'iso-8859-1',
                $response->decoded_content
            );
        }
    }

Text that cannot be represented in ISO-8859-1 is lost, but fortunately that is all just supplemental information on my site.

Staring at the search results a lot reminded me about another longer standing issue: the “fancy quotes” were not handled well — they were just stripped away. As a quick and dirty fix I added the most common ones at the end of the filter_content callback.

    $$content =~ s,\×,x,gs;
    $$content =~ s,\&#(8211|8212);,-,gs;
    $$content =~ s,\&#(0?34|8216|8217|8218|8242);,',gs;
    $$content =~ s,\&#(8220|8221|8222|8243);,",gs;
    $$content =~ s,\…,...,gs;
    $$content =~ s,\™,(tm),gs;

    $$content =~ s,\&#0?38;,&,gs;

Wrapped up by updating the list of sections to ignore for search purposes and by fixing a couple more files on the web server to be served correctly as UTF-8.