Wstring ? international chars ?

The Partridge Family were neither partridges nor a family. Discuss.
binbinhfr
Posts: 78
Joined: May 9th, 2019, 10:57 pm

Wstring ? international chars ?

Post by binbinhfr » April 3rd, 2020, 6:57 am

Hi there,

For the moment I develop a game using classical string and char. But I would like to be able to integrate special characters from input (from std::cin, or on the fly keyb) and from files, that could be typed by foreign users (chineese, arabic, etc...). I wonder if there is a simple way to do this ?

These characters are accessible by the Windows accessory "character table"
For example :

Code: Select all

abdcefgh
àéèïöôâ
ÆÇÑÐßæñƎ
傛僀僖傕傭傯僃
ﯣﯦﺡﺸﺵﻆ
I noticed that UTF8 files created with Notepad++ can hold these extended chars.

But I wonder how to :
- read them from a file, one by one,
- how to compare them with classical ascii char,
- how to make string operations on them (operator[], concat, substr...).
- how to output them one by one on screen (cout).

Any idea, or any useful tutorial around ? (I searched on chili but did not find anything).
I suppose that wstring can do the job, but I can't make them work as I would like too.

thanks and have a nice c++ day !

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Wstring ? international chars ?

Post by albinopapa » April 3rd, 2020, 9:20 am

You can try looking through this:

https://en.cppreference.com/w/cpp/locale/locale
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

binbinhfr
Posts: 78
Joined: May 9th, 2019, 10:57 pm

Re: Wstring ? international chars ?

Post by binbinhfr » April 3rd, 2020, 9:30 am

Waooo, strong english for me... What is a facet ?
Don't you have a C++ example that would read such UTF8 files, store them in some String, work on the String (get char, concat, etc...) and store them back into UTF8 file ?
It's for my robot prog game : I'm in the work of writing a code editor, and I would like users to be able to integrate their own "letters" in the code (for comments, literal strings, etc...).

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Wstring ? international chars ?

Post by albinopapa » April 3rd, 2020, 5:13 pm

No, I don't have that information. Localization is one thing I never planned on doing.

The SFML library has a custom String class and Utf<> template classes that deal with representing those characters, but I'm not sure about reading/writing from files with different Utf encodings.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

Slidy
Posts: 80
Joined: September 9th, 2017, 1:19 pm

Re: Wstring ? international chars ?

Post by Slidy » April 3rd, 2020, 5:55 pm

There are 2 main approaches with Windows.

One is to use what's known as the Unicode charset, which means instead of using a single byte for each char (std::string) you use a 2 byte wchar_t (std::wstring). This is known as UTF16. In Windows terminology whenever you see Unicode mentioned it very likely means UTF16.

The other option is to use the Multi-Byte charset (MBCS). This option still uses chars but each letter is not necessarily a single char. Depending on the letter the length of it in chars varies. This can complicate things since getting the length of a string in letters isn't as simple as doing std::string::size(). UTF8 is a MBCS encoding, but be careful when using WinAPI functions that use MBCS (ie. the ones that end in A) because they don't support UTF8 encoding and instead support some other legacy MBCS encodings (relevant: https://stackoverflow.com/questions/329 ... on-windows).

If you want to use UTF8 with WinAPI functions you'll have to convert to UTF16 first then call the W version of things then convert back to UTF8.

Windows just recommends you use the first Unicode (UTF16) option, more info here: https://docs.microsoft.com/en-us/cpp/te ... ew=vs-2019

That being said, I believe you can store UTF8 strings in a normal std::string and print them out as normal, you just have to be careful with certain operations & your string manipulations.

binbinhfr
Posts: 78
Joined: May 9th, 2019, 10:57 pm

Re: Wstring ? international chars ?

Post by binbinhfr » April 4th, 2020, 6:05 am

So UTF16 is 2 bytes/16 bits and UTF8 is not 1 byte/8bits but multibytes. Very annoying. That's why my test file in attachement can store a lot of weird chars. And it's 96 bytes long for 40 characters recorded...

I'll try the UTF8 <> 16 conversion with wstrings manipulations.

Thx.
Attachments
file.txt
(96 Bytes) Downloaded 249 times

Slidy
Posts: 80
Joined: September 9th, 2017, 1:19 pm

Re: Wstring ? international chars ?

Post by Slidy » April 4th, 2020, 9:05 am

To be clear, UTF16 is (or can be) multi-byte as well. Depending on the letter it may need more than 1 wchar_t to store it. It originated as a fixed-width 16 bit format which can store up to 65536 letters, but soon after they realized that they need more than that and it was upgraded to a variable length encoding so they could support more letters than that. Since UTF16 was already variable length the 16 bits were no longer necessary and they made UTF8.

Windows adopted UTF16 when it was first made, and to maintain backward compatibility they still use it today. I believe UTF8 is the "standard" these days and used almost everywhere except Windows which uses UTF16.

If you'd like to use UTF8, the advice I've read says to store everything as a normal char/std::string then convert to UTF16 at the last possible moment when you need to call a Windows W function.

Lastly, you should be careful with that you consider a "letter"/"character" with your strings. There are lots of definitions in in Unicode for characters, you can read up on them here: http://utf8everywhere.org/#characters (I recommend reading that whole page but I linked the relevant section)

It might help to know the exact context and what kind of operations you're planning on doing if you want a recommendation on how to approach this.

binbinhfr
Posts: 78
Joined: May 9th, 2019, 10:57 pm

Re: Wstring ? international chars ?

Post by binbinhfr » April 4th, 2020, 7:57 pm

Thanks for the answer Slidy.

I'm currently writting a game. I would like to make it international, and to offer different translations of the UI, menus, messages, etc... These messages will be constructed by concatenation of strings. So some of these strings will contain language specific characters (like Â Ñ àéèïöôâ
ÆÇÑÐßæñƎ 傛僀僖傕傭傯僃 ﯣﯦﺡﺸﺵﻆ). These language specific messages will be stored into text files, one for each translation (UTF8 seems to do the job as far I as can see in Notepad++).

That's why I need to manage all this stuff. So I would like to be as "wide" as possible. Even if I do not propose every translations at the beginning, I would like to prepare my code to accept this variety of chars.

Slidy
Posts: 80
Joined: September 9th, 2017, 1:19 pm

Re: Wstring ? international chars ?

Post by Slidy » April 5th, 2020, 3:28 am

For simple concatenation, you don't need anything complicated, at least with UTF8. Concatenation, finding substrings, and single ASCII char operations all work on a UTF8 strings as they would on dumb ASCII char arrays.

Overall I'd recommend you just stick to char/std::string with UTF8 and make sure you convert to UTF16 wide string for any WinAPI stuff.

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Wstring ? international chars ?

Post by albinopapa » April 5th, 2020, 7:19 am

If you're still looking for information and example, I just so happened to come across this on stackoverflow

https://stackoverflow.com/questions/257 ... or-wchar-t
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

Post Reply