Some time ago I added some UTF-8 pages to this site. I promptly started receiving one error message per page from the Swish-e search engine I’m using. Of course, the search results also showed “garbage” since Swish-e only handles single byte characters.
Today I got tired of the errors arriving in my mailbox, so I dug into the code and fixed things.
Turns out the errors were from HTML::LinkExtor
and had to do with changes in Perl’s internal character handling. Fortunately the HTTP::Response
object had also grown a decoded_content
member that held the “right” data.
--- /usr/pkg/lib/swish-e/spider.pl 2006-02-08 00:11:29.000000000 +0200
+++ spider.pl 2008-06-01 19:56:16.000000000 +0300
@@ -1138,7 +1138,8 @@
# Extract out links (if not too deep)
- my $links_extracted = extract_links( $server, \$content, $response )
+ my $dcontent = $response->decoded_content;
+ my $links_extracted = extract_links( $server, \$dcontent, $response )
unless defined $server->{max_depth} && $depth >= $server->{max_depth};
I kept the changes minimal, although I have some concerns about e.g. compressed content from the web server. Works for me, though, so I will not be spending any more time with that.
The contents still needed to get indexed more intelligently. The “fix” I picked for that was to recode the content in ISO-8859-1. This code in the filter_content
callback did the trick:
# Make the content always iso-8859-1
for ($response->header('content-type')) {
my $charset = lc $1 if /\bcharset=([^;]+)/;
if ($charset && $charset ne 'iso-8859-1') {
$$content = Encode::encode(
'iso-8859-1',
$response->decoded_content
);
}
}
Text that cannot be represented in ISO-8859-1 is lost, but fortunately that is all just supplemental information on my site.
Staring at the search results a lot reminded me about another longer standing issue: the “fancy quotes” were not handled well — they were just stripped away. As a quick and dirty fix I added the most common ones at the end of the filter_content
callback.
$$content =~ s,\×,x,gs;
$$content =~ s,\&#(8211|8212);,-,gs;
$$content =~ s,\&#(0?34|8216|8217|8218|8242);,',gs;
$$content =~ s,\&#(8220|8221|8222|8243);,",gs;
$$content =~ s,\…,...,gs;
$$content =~ s,\™,(tm),gs;
$$content =~ s,\�?38;,&,gs;
Wrapped up by updating the list of sections to ignore for search purposes and by fixing a couple more files on the web server to be served correctly as UTF-8.