Archives
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Summary-Talk] unicode / multibyte
On 10/17/03 12:50 PM Cameron Knowlton (cameronk@igods.com) wrote: >I have a client who gets a fair amound of Japanese traffic. I see in >their search phrases what obviously started as Kanji. > >Unfortunately, we're not able to reverse what we have back to the >original multibyte characters. > >Is this a shortcoming of the client's web server / logger, or are we >pooched even if they do capture the multibyte strings? Do we have a >snowball's chance of analyzing these with Summary? As far as I can tell, this is hopeless/impractical. In order to display the characters correctly you need to know which particular character set/encoding they come from. There are a couple of different character sets possible (though Unicode is by far the most common), and several different character encodings that might have been used. There doesn't seem to be any reasonable way to know which one applies to a particular search string. If you happen to know the character set/encoding used with a particular string, you should (in theory) be able to transfer the string from Summary into a document in that set/encoding and get it to display. However this will be tricky in practice, as many programs you might use to do this will force any such transfer into 8 bit characters during the paste. I suspect that displaying the strings correctly is possible in theory. The search engine they came from presumably knows how to display them. So it might be possible to build a database of search engines and how they each display characters, tag the search phrases with that information as they come in, convert everything to a uniform character set/encoding, and then display that. If anyone knows of a simpler way to do this I would love to hear about it. Jason ----------------- Jason@Summary.Net ----------------- Dr. Seuss books . . . can be read and enjoyed on several levels. For example, 'One Fish Two Fish, Red Fish Blue Fish' can be deconstructed as a searing indictment of the narrow-minded binary counting system. -- Peter van der Linden, Expert C Programming, Deep C Secrets ------------- Go to <http://summary.net/list.html> to update subscription info.
|