String Insights and the LSet trick
( As with many interesting insights and tricks I was shown this https://eileenslounge.com/viewtopic....323073#p323073 )
We are still not far enough to tackle finally in explicit detail strings in the win 32 api, so once again we are talking around them, looking at string characteristics generally
Some basic maths review
Once again the relevant very basic low level number computer number issues are good to revise again.
For clarity in this post I am restricting to 16 digits. From previous posts related to string things it may be obvious why, and if not, it will be after this revision
All this in this initial review section is based on conventional school maths stuff. We will discuss the Microsoft / computer deviations from the norm in the next section, LSet Type trick to explore UTF-16 LE
We are probably all aware of the base 2 ( binary ) system, and are certainly aware of the base 10 ( decimal ) system.
We can have any base system, and the basic idea and workings are the same.
Let us consider a few bases using 16 digits, for a decimal number , an old friend of ours, a number, which in decimal is 8230
Unicode code point 8230.JPG
( In binary, so deep down in computer 0s and 1s, we need 14 digits for decimal 8230, so 16 digits is sufficient )
We consider a spread of bases with 16 digits ( bits ) :
base 2 (binary) ;
base 16 (Hexadecimal);
and base 256
The following sketch shows that fundamentally the base 2, or 0/1 state bits are the same in either base system.
The final number we see or use, whether it is
0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0
or
2 0 2 6
or
32 38
or
8230
, is, well… the final number the software or system we are using presents the same 0/1 state bits to us
Although the 0/1 state thing ( a bit ) is the most fundamental computer number unit, for many reasons , in many computer systems, we consider a Byte ( 8 bits ) as a fundamental unit. For example, in computer memory, if a position has been defined as address 123456788, then the next 8 bits along ( so the next Byte along ) will have the address 123456789 The address in both cases refers to 8 bits. So a Byte could have a decimal value from 0 to ( 128+64+32+16+8+4+2+1 )= 255, so 256 numbers 0-255Code:' Base 2 (Binary) with 16 digits ' 2^15 2^14 2^13 2^12 2^11 2^10 2^9 2^8 2^7 2^6 2^5 2^4 2^3 2^2 2^1 2^0 ' 32768 16384 8192 4096 2048 1024 512 256 128 64 32 16 8 4 2 1 ' 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 - Binary ( Base 2 ) ' 0 + 0 + 8192 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 32 + 0 + 0 + 4 + 2 + 0 = 8230 - calculating the decimal 8230 ' 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 ' Base 16 (Hexadecimal) with 16 digits ' 16^3 = 4096 | 16^2 = 256 | 16^1 = 16 | 16^0 = 1 ' 2^3 2^2 2^1 2^0 | 2^3 2^2 2^1 2^0 | 2^3 2^2 2^1 2^0 | 2^3 2^2 2^1 2^0 ' 8 4 2 1 | 8 4 2 1 | 8 4 2 1 | 8 4 2 1 ' 0 0 1 0 | 0 0 0 0 | 0 0 1 0 | 0 1 1 0 ( - Binary ( Base 2 ) ) ' 0 + 0 + 2 + 0 =2 | 0 + 0 + 0 + 0 = 0 | 0 + 0 + 2 + 0=2 | 0 + 4 + 2 + 0 = 6 ' ( 2 x 256 ) + 0 + ( 2 x 16 ) + ( 6 x 1 ) = 8230 - calculating the decimal 8230 ' 2 0 2 6 2 0 2 6 - Hexadecimal ( Base 16 ) ' Base 256 with 16 digits ' 256^1 = 256 256^0 = 1 ' 2^7 2^6 2^5 2^4 2^3 2^2 2^1 2^0 | 2^7 2^6 2^5 2^4 2^3 2^2 2^1 2^0 ' 128 64 32 16 8 4 2 1 | 128 64 32 16 8 4 2 1 ' 0 0 1 0 0 0 0 0 | 0 0 1 0 0 1 1 0 ( - Binary ( Base 2 ) ) ' 0 + 0 + 32 + 0 + 0 + 0 + 0 + 0 =32 | 0 + 0 + 32 + 0 + 0 + 4 + 2 + 0 = 38 ' ( 32 x 256 ) + ( 38 x 1 ) = 8230 - calculating the decimal 8230 ' 32 38 32 38 - Base 256
Microsoft choose to use the number system similar to the last in the sketch above, but they have the two bytes placed the other way around. They call this (2 byte) UTF-16 LE, as we have discussed before. This means that we are likely to see the character ChrW(8230) , … , somehow represented in the form 38 32
We explore this in the next section
LSet Type trick to explore UTF-16 LE
We will not directly be looking at strings , just the number 8230, to investigate how it may look in memory, or rather how it might be presented to us. In other words we investigate the theoretical memory storage that we mentioned at the end of the last section….. Microsoft choose to use the number system similar to the last in the sketch above, but they have the two bytes placed the other way around. They call this UTF-16 LE, as we have discussed before. This means that we are likely to see the character ChrW(8230) , … , somehow represented in the form 38 32 ….
We are interested in simple whole number variables. Long is like this , with 4 bytes. Integer is like it as well, but with 2 bytes
If things go as we expect, then 2 bytes as a Destination for where we use the LSet trick to copy our number to, should be sufficient, but we will use a destination user defined type made of 6 bytes, just to see what happens
So that above is our destination.Code:Private Type MeDestBytes byte0 As Byte byte1 As Byte byte2 As Byte byte3 As Byte byte4 As Byte byte5 As Byte End Type
We will use 2 different sources, the Long and Integer structures of these u ser defined types
So the coding below which uses those will have two similar sections. In those two similar sections we place the number 8230 in the Long or integer structure, which we are expecting to get put in memory in the two byte ( 256 base ) form of the last part of the sketch above, but with the bytes the other way aroundCode:Private Type UTF_16LEint CharBytes As Integer End Type Private Type UTF_16LElng CharBytes As Long End Type
Each of the two resulting bytes in memory will get LeftSeted at our destination memory place.
The Debug.Printed results seem to tie up with the prediction, - for example, the 38 of the last byte from our sketch above gets switched around to the first position
Code:Sub CharBytesInMemory() ' https://www.excelfox.com/forum/showthread.php/2824-Tests-Copying-Pasting-API-Cliipboard-issues-and-Rough-notes-on-Advanced-API-stuff?p=17891&viewfull=1#post17891 Rem 1 Long Dim LELng As UTF_16LElng, Dest As MeDestBytes Let LELng.CharBytes = 8230 ' In Microsoft Unicode UTF-16 LE encoding, this will look like 38 32 LSet Dest = LELng Debug.Print Dest.byte0, Dest.byte1, Dest.byte2, Dest.byte3, Dest.byte4, Dest.byte5 ' 38 32 0 0 0 0 The last 2 bytes never get used here so stay at 0 Rem 2 Integer Dim LEInt As UTF_16LEint Let LEInt.CharBytes = 8230 LSet Dest = LEInt Debug.Print Dest.byte0, Dest.byte1, Dest.byte2, Dest.byte3, Dest.byte4, Dest.byte5 ' 38 32 0 0 0 0 The last 4 bytes never get used here so stay at 0 End Sub
As things seem to look exactly as expected we can do a neater tidier representation of the way Microsoft represent characters ( or rather their code point number ) in memory by restricting ourselves to 2 bytes everywhere, as in the initial reference
Code:Private Type UTF_16LElng CharBytes As Long End Type Private Type DestHiLoBytePair ' For the two bytes, as in memory for a character in Microsoft Unicode UTF-16 LE 2 byte encodung byteHi As Byte ' we are expecting this to get 38 byteLo As Byte ' we are expecting this to get 32 End Type Private Type SourceUTF_16LEint CharBytes As Integer ' this will be given the single character that looks like 3 samll dots, ChrW(8230) (it is also most Chr(133)) … End Type Sub UTF_16LE2ByteChr() ' https://eileenslounge.com/viewtopic.php?p=323073#p323073 Dim Dest As DestHiLoBytePair, Source As SourceUTF_16LEint Let Source.CharBytes = 8230 LSet Dest = Source Debug.Print Dest.byteHi, Dest.byteLo ' 38 32 End Sub
Remember finally, that we are just dealing with numbers here. All we have really done is shown that the decimal number of 8230 will be held in a backward base 256 when we hold it in a 2 byte Integer
In other words, the little Indian backward byte shuffle dance is done by an Integer as it is with a Long.
Here is a quick check / correlation with a short coding, using the 1 to 1 Byte to string character phenomena discussed at the start of this post (
https://www.excelfox.com/forum/showt...ll=1#post17885 )
Code:Sub ChrW8230ByteArray() Dim arrBytes() As Byte Let arrBytes() = "…" ' 1 to 1 Byte to string character phenomena https://www.excelfox.com/forum/showthread.php/2824-Tests-Copying-Pasting-API-Cliipboard-issues-and-Rough-notes-on-Advanced-API-stuff?p=17885&viewfull=1#post17885 Debug.Print arrBytes(LBound(arrBytes)), arrBytes(UBound(arrBytes)) ' 38 32 End Sub





Reply With Quote
Bookmarks