[bedevtalk] BFile::Read and foreign characters

Pete Goodeve pete at jwgibbs.cchem.berkeley.edu
Fri Jan 20 17:09:28 BRST 2006


On Fri, Jan 20, 2006 at 08:00:00AM -0800, Marco Nelissen wrote:
> Note that with the code you posted, there is an edge-case where things
> could potentially go wrong, namely if there is a UTF-8 multibyte
> character right on a buffer-boundary. In that case convert_to_utf8()
> won't be able to convert that last character, because it doesn't have
> all the bytes for it. You will have to take into account how many bytes
> convert_to_utf8() actually used when doing the next Read().

I had to solve this problem for one of my apps ('matt').  I did so by
using a buffer that is actually two bytes longer than the 'specified'
length.  I first attempt to fill the specified amount (using fread
rather than a BFile in my case), and if I *have* filled the buffer
(otherwise I assume EOF) I check the last couple of bytes to see if
they are part of a multibyte UTF-8; if so, I read the appropriate
extra bytes into the 'overflow' section.  Here's the approximate code:

 int n = fread(buff, 1, BUFLEN, f);
 bufend = buff+n;
 if (strip->useutf && n == BUFLEN) {	// otherwise assume EOF
	// is it UTF? -- max of 3 bytes
	if ((*(bufend-2) & 0xe0) == 0xe0	// incomplete 3-byte group
	  && fread(bufend++, 1, 1, f) == 1) n++;	// one more needed
	else if ((*(bufend-1) & 0xe0) == 0xe0	// start of 3-byte group
	  && fread(bufend++, 1, 2, f) == 2) n+=2, bufend++; // get 2 more
	else if ((*(bufend-1) & 0xe0) == 0xc0	// 2-byte group
	  && fread(bufend++, 1, 1, f) == 1) n++;
 }

[Yeah, I know... it isn't very error-proof -- especially if it gets
8-bit non-UTF.  Improve as you wish...]
					-- Pete --



More information about the bedevtalk mailing list