[bedevtalk] BFile::Read and foreign characters
Pete Goodeve
pete at jwgibbs.cchem.berkeley.edu
Fri Jan 20 17:09:28 BRST 2006
On Fri, Jan 20, 2006 at 08:00:00AM -0800, Marco Nelissen wrote:
> Note that with the code you posted, there is an edge-case where things
> could potentially go wrong, namely if there is a UTF-8 multibyte
> character right on a buffer-boundary. In that case convert_to_utf8()
> won't be able to convert that last character, because it doesn't have
> all the bytes for it. You will have to take into account how many bytes
> convert_to_utf8() actually used when doing the next Read().
I had to solve this problem for one of my apps ('matt'). I did so by
using a buffer that is actually two bytes longer than the 'specified'
length. I first attempt to fill the specified amount (using fread
rather than a BFile in my case), and if I *have* filled the buffer
(otherwise I assume EOF) I check the last couple of bytes to see if
they are part of a multibyte UTF-8; if so, I read the appropriate
extra bytes into the 'overflow' section. Here's the approximate code:
int n = fread(buff, 1, BUFLEN, f);
bufend = buff+n;
if (strip->useutf && n == BUFLEN) { // otherwise assume EOF
// is it UTF? -- max of 3 bytes
if ((*(bufend-2) & 0xe0) == 0xe0 // incomplete 3-byte group
&& fread(bufend++, 1, 1, f) == 1) n++; // one more needed
else if ((*(bufend-1) & 0xe0) == 0xe0 // start of 3-byte group
&& fread(bufend++, 1, 2, f) == 2) n+=2, bufend++; // get 2 more
else if ((*(bufend-1) & 0xe0) == 0xc0 // 2-byte group
&& fread(bufend++, 1, 1, f) == 1) n++;
}
[Yeah, I know... it isn't very error-proof -- especially if it gets
8-bit non-UTF. Improve as you wish...]
-- Pete --
More information about the bedevtalk
mailing list