Some time ago I added some UTF-8 pages to this site. I promptly started receiving one error message per page from the Swish-e search engine I’m using. Of course, the search results also showed “garbage” since Swish-e only handles single byte characters.
Today I got tired of the errors arriving in my mailbox, so I dug into the code and fixed things.
Turns out the errors were from HTML::LinkExtor and had to do with changes in Perl’s internal character handling.  Fortunately the HTTP::Response object had also grown a decoded_content member that held the “right” data.
--- /usr/pkg/lib/swish-e/spider.pl      2006-02-08 00:11:29.000000000 +0200
+++ spider.pl   2008-06-01 19:56:16.000000000 +0300
@@ -1138,7 +1138,8 @@
     # Extract out links (if not too deep)
-    my $links_extracted = extract_links( $server, \$content, $response )
+    my $dcontent = $response->decoded_content;
+    my $links_extracted = extract_links( $server, \$dcontent, $response )
         unless defined $server->{max_depth} && $depth >= $server->{max_depth};
I kept the changes minimal, although I have some concerns about e.g. compressed content from the web server. Works for me, though, so I will not be spending any more time with that.
The contents still needed to get indexed more intelligently.  The “fix” I picked for that was to recode the content in ISO-8859-1.  This code in the filter_content callback did the trick:
    # Make the content always iso-8859-1
    for ($response->header('content-type')) {
        my $charset = lc $1 if /\bcharset=([^;]+)/;
        if ($charset && $charset ne 'iso-8859-1') {
            $$content = Encode::encode(
                'iso-8859-1',
                $response->decoded_content
            );
        }
    }
Text that cannot be represented in ISO-8859-1 is lost, but fortunately that is all just supplemental information on my site.
Staring at the search results a lot reminded me about another longer standing issue: the “fancy quotes” were not handled well — they were just stripped away.  As a quick and dirty fix I added the most common ones at the end of the filter_content callback.
    $$content =~ s,\×,x,gs;
    $$content =~ s,\&#(8211|8212);,-,gs;
    $$content =~ s,\&#(0?34|8216|8217|8218|8242);,',gs;
    $$content =~ s,\&#(8220|8221|8222|8243);,",gs;
    $$content =~ s,\…,...,gs;
    $$content =~ s,\™,(tm),gs;
    $$content =~ s,\�?38;,&,gs;
Wrapped up by updating the list of sections to ignore for search purposes and by fixing a couple more files on the web server to be served correctly as UTF-8.