Summary.Net Archives
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Summary-Talk] unicode / multibyte



On 10/17/03 12:50 PM Cameron Knowlton (cameronk@igods.com) wrote:

>I have a client who gets a fair amound of Japanese traffic. I see in 
>their search phrases what obviously started as Kanji.
>
>Unfortunately, we're not able to reverse what we have back to the 
>original multibyte characters.
>
>Is this a shortcoming of the client's web server / logger, or are we 
>pooched even if they do capture the multibyte strings? Do we have a 
>snowball's chance of analyzing these with Summary?

As far as I can tell, this is hopeless/impractical. In order to display 
the characters correctly you need to know which particular character 
set/encoding they come from. There are a couple of different character 
sets possible (though Unicode is by far the most common), and several 
different character encodings that might have been used. There doesn't 
seem to be any reasonable way to know which one applies to a particular 
search string.

If you happen to know the character set/encoding used with a particular 
string, you should (in theory) be able to transfer the string from 
Summary into a document in that set/encoding and get it to display. 
However this will be tricky in practice, as many programs you might use 
to do this will force any such transfer into 8 bit characters during the 
paste.

I suspect that displaying the strings correctly is possible in theory. 
The search engine they came from presumably knows how to display them. So 
it might be possible to build a database of search engines and how they 
each display characters, tag the search phrases with that information as 
they come in, convert everything to a uniform character set/encoding, and 
then display that.

If anyone knows of a simpler way to do this I would love to hear about it.

Jason

-----------------
Jason@Summary.Net
-----------------
Dr. Seuss books . . . can be read and enjoyed on several levels. For
example, 'One Fish Two Fish, Red Fish Blue Fish' can be deconstructed
as a searing indictment of the narrow-minded binary counting system.
  -- Peter van der Linden, Expert C Programming, Deep C Secrets
-------------
Go to <http://summary.net/list.html> to update subscription info.