Wstring ? international chars ?

The Partridge Family were neither partridges nor a family. Discuss.
binbinhfr
Posts: 78
Joined: May 9th, 2019, 10:57 pm

Re: Wstring ? international chars ?

Post by binbinhfr » April 5th, 2020, 4:13 pm

Slidy wrote:
April 5th, 2020, 3:28 am
For simple concatenation, you don't need anything complicated, at least with UTF8. Concatenation, finding substrings, and single ASCII char operations all work on a UTF8 strings as they would on dumb ASCII char arrays.
By UTF8 string, you mean std::u32string that can handle anything, and will work on a true unicode basis ?
Slidy wrote:
April 5th, 2020, 3:28 am
Overall I'd recommend you just stick to char/std::string with UTF8 and make sure you convert to UTF16 wide string for any WinAPI stuff.
I am using the SDL2 lib, so I do not use a lot of WinAPI stuff.
albinopapa wrote:
April 5th, 2020, 7:19 am
If you're still looking for information and example, I just so happened to come across this on stackoverflow
https://stackoverflow.com/questions/257 ... or-wchar-t
Well it's still a little cryptic to me... I just understood that all non-ascii unicodes start with the upper bit on. That's a good way to filter them. ;-)

Slidy
Posts: 80
Joined: September 9th, 2017, 1:19 pm

Re: Wstring ? international chars ?

Post by Slidy » April 6th, 2020, 4:38 am

No, std::u32string is (as the name implies) UTF32 and not UTF8. UTF32 has it's own uses but I don't think you don't really need it.

By UTF8 string I mean just a regular char array/std::string that uses UTF8 encoding. Here's an example of what concatenating 2 UTF8 strings would look like:

Code: Select all

std::string utf8_str1 = GetSomeUtf8String();
std::string utf8_str2 = GetAnotherUtf8String();
std::string concatenated = utf8_str1 + utf8_str2;
Here's an example of what finding a UTF8 substring looks like:

Code: Select all

std::string utf8_str1 = GetSomeUtf8String();
std::string utf8_str2 = GetAnotherUtf8String();
size_t found = utf8_str1.find(utf8_str2); 
if (found != string::npos) 
    // Found
Here's an example for looking through a string for an ASCII character:

Code: Select all

std::string utf8_str = GetSomeUtf8String();
for( size_t i = 0; i < utf8_str.size(); i++ )
{
  if( utf8_str[i] == '9' )
  {
    // Found the character '9'
  }
}
And finally here's an example of passing a UTF8 encoded string into a function that expects an ASCII encoded string:

Code: Select all

void DoSomethingWithAsciiStr( const char* ascii_str );

std::string utf8_str = GetSomeUtf8String();
DoSomethingWithAsciiStr( utf8_str.c_str() );
Notice how everything above works exactly as it would if you had a normal ASCII encoded string. UTF8 encoding was specifically designed in this way so the above stuff "just works".

binbinhfr
Posts: 78
Joined: May 9th, 2019, 10:57 pm

Re: Wstring ? international chars ?

Post by binbinhfr » April 6th, 2020, 10:10 pm

ok thanks.

The problem is more when you want to deal with number of true-characters. String.length will not return the number of real chars. And substr won't cut at real-char edges. But Maybe I can handle that.

I just wonder what you mean in your last example with c_str : because if you treat a UTF8 string (converted to char* c_str) as a ascii string, what will happen when you encounter char with upper bit set ? Those char are not ascii and belong to a unicode... So I suppose that DoSomethingWithAsciiStr must deal with that and ignore char with upper bit set ?

Slidy
Posts: 80
Joined: September 9th, 2017, 1:19 pm

Re: Wstring ? international chars ?

Post by Slidy » April 7th, 2020, 3:41 am

binbinhfr wrote:
April 6th, 2020, 10:10 pm
The problem is more when you want to deal with number of true-characters. String.length will not return the number of real chars. And substr won't cut at real-char edges. But Maybe I can handle that.
As I mentioned before, the concept of "true character" is kind of iffy in Unicode. It can mean different things in different contexts. Refer to the link I posted earlier: http://utf8everywhere.org/#characters. Most of the time you don't even really need those operations anyway.
binbinhfr wrote:
April 6th, 2020, 10:10 pm
I just wonder what you mean in your last example with c_str : because if you treat a UTF8 string (converted to char* c_str) as a ascii string, what will happen when you encounter char with upper bit set ? Those char are not ascii and belong to a unicode... So I suppose that DoSomethingWithAsciiStr must deal with that and ignore char with upper bit set ?
Obviously if the UTF8 string includes characters that aren't a part of the ASCII character set then it won't know what to do with them, what I meant was that ASCII characters in UTF8 are encoded in exactly the same way as they are in an ASCII string. So you can sort of double up and use the same container for both ASCII-only strings and UTF8 strings, as opposed to with UTF16 where even if you only have characters that are a part of the ASCII character set you will have to do some kind of conversion to go from a wide string (wchar_t) to a narrow string (char).

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Wstring ? international chars ?

Post by albinopapa » April 7th, 2020, 7:24 am

Sounds like what you really need to do is make a class that sets a locale or a character set so you can determine how to handle your std::strings. Your program won't have to do a lot of conversions between character sets since that isn't the domain of your program. You just need enough of a class for switching character sets based on a chosen language.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Wstring ? international chars ?

Post by albinopapa » April 7th, 2020, 5:26 pm

Looking through a project called Crispy Doom and came across something related to what you are doing @binbinhfr

Code: Select all

unsigned int TXT_DecodeUTF8(const char **ptr)
{
    const char *p = *ptr;
    unsigned int c;

    // UTF-8 decode.

    if ((*p & 0x80) == 0)                     // 1 character (ASCII):
    {
        c = *p;
        *ptr += 1;
    }
    else if ((p[0] & 0xe0) == 0xc0            // 2 character:
          && (p[1] & 0xc0) == 0x80)
    {
        c = ((p[0] & 0x1f) << 6)
          |  (p[1] & 0x3f);
        *ptr += 2;
    }
    else if ((p[0] & 0xf0) == 0xe0            // 3 character:
          && (p[1] & 0xc0) == 0x80
          && (p[2] & 0xc0) == 0x80)
    {
        c = ((p[0] & 0x0f) << 12)
          | ((p[1] & 0x3f) << 6)
          |  (p[2] & 0x3f);
        *ptr += 3;
    }
    else if ((p[0] & 0xf8) == 0xf0            // 4 character:
          && (p[1] & 0xc0) == 0x80
          && (p[2] & 0xc0) == 0x80
          && (p[3] & 0xc0) == 0x80)
    {
        c = ((p[0] & 0x07) << 18)
          | ((p[1] & 0x3f) << 12)
          | ((p[2] & 0x3f) << 6)
          |  (p[3] & 0x3f);
        *ptr += 4;
    }
    else
    {
        // Decode failure.
        // Don't bother with 5/6 byte sequences.

        c = 0;
    }

    return c;
}

// Count the number of characters in a UTF-8 string.

unsigned int TXT_UTF8_Strlen(const char *s)
{
    const char *p;
    unsigned int result = 0;
    unsigned int c;

    for (p = s; *p != '\0';)
    {
        c = TXT_DecodeUTF8(&p);

        if (c == 0)
        {
            break;
        }

        ++result;
    }

    return result;
}
Just a snippet, but I thought it would be useful for creating a string class wrapper or something that handles utf8.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

binbinhfr
Posts: 78
Joined: May 9th, 2019, 10:57 pm

Re: Wstring ? international chars ?

Post by binbinhfr » April 8th, 2020, 5:52 pm

Thx man.
If you wonder if your comments are helpful, here is a little screen to show you that my game is advancing. ;-)

https://imgur.com/a/RVygcv2

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Wstring ? international chars ?

Post by albinopapa » April 8th, 2020, 11:03 pm

Looks good. The scripting language looks clean, if that's what is in the left window.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

Slidy
Posts: 80
Joined: September 9th, 2017, 1:19 pm

Re: Wstring ? international chars ?

Post by Slidy » April 9th, 2020, 3:26 pm

Game looks sweet :O
You should definitely post a demo once it's playable

Your scripting language has some interesting syntactic choices with everything being function-like, kind of reminds me of Lisp.

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Wstring ? international chars ?

Post by albinopapa » April 9th, 2020, 5:19 pm

Slidy wrote:
April 9th, 2020, 3:26 pm
... kind of reminds me of Lisp.
Yeah, thought it looked familiar. I don't know Lisp, just snippets from CppCon videos and around the net.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

Post Reply