Say I do this:
>>> 'é' #1
'\xc3\xa9'
>>> u'é' #2
u'\xe9'
>>> print u'é' #3
é
This is my understanding:
- When I pasted
'é'
into my Python session, a bytearray containing 2 bytes somehow landed in tostdin
, which Python read from. The same bytes are sent tostdout
and displayed in hexadecimal form. - This time Python has to decode the bytes: it reads
sys.stdin.encoding
, findsutf-8
, and decodes the 2 bytes into unicode. Then I am not sure what happens. Can we send a unicode string tostdout
? Or maybe Python takes the hexadeximal representation of the unicode code point, encodes it inutf-8
and sends tostdout
? - Python decodes the 2 bytes into unicode. Then
print
encodes it again inutf-8
and sends the result tostdout
.
Is my understanding correct?
The Python interactive interpreter echos the result of any expression except if that result is None
. Echoing always uses the repr()
function to create a useable representation. Under the hood, objects have a __repr__
special method that does all the hard work here.
For strings, a value is printed that can be used directly in Python again to recreate the string, and any non-printable, non-ASCII bytes are represented with an escape sequence. Newlines become \n
, for example, and the two UTF-8 bytes for é
are represented with the \xhh
hex escape.
Thus, for point 1, Python indeed received two bytes from the terminal, stored those in a string, and the representation of the string consists of the characters '
, \
, x
, c
, 3
, etc. If you pasted that back into Python, you'd get the same string value again.
For 2., you created a Unicode string object. The terminal sent two UTF-8 bytes, but you now told Python to parse a u'..'
string literal, which is indeed decoded by using sys.stdin.encoding
.
The representation for a Unicode string object is another string literal, prefixed with u
to show it is a Unicode string, not a regular string. Unicode codepoints in the range U+0080 through to U+00FF (the Latin 1 range) are represented by the \xhh
escape code. é
is Unicode codepoint U+00E9, so is represented by \xe9
. Codepoints from U+0100 up to U+FFFF use the \uhhhh
representation, for higher codepoints \Uhhhhhhhh
is used.
Again, you can copy this representation, paste it back into Python and get the exact same value again.
print
writes directly to sys.stdout
, and if you give print
a Unicode string object, will use sys.stdout.encoding
to first encode the Unicode string value to a bytestring before writing it to sys.stdout
.
Say I do this:
>>> 'é' #1
'\xc3\xa9'
>>> u'é' #2
u'\xe9'
>>> print u'é' #3
é
This is my understanding:
- When I pasted
'é'
into my Python session, a bytearray containing 2 bytes somehow landed in tostdin
, which Python read from. The same bytes are sent tostdout
and displayed in hexadecimal form. - This time Python has to decode the bytes: it reads
sys.stdin.encoding
, findsutf-8
, and decodes the 2 bytes into unicode. Then I am not sure what happens. Can we send a unicode string tostdout
? Or maybe Python takes the hexadeximal representation of the unicode code point, encodes it inutf-8
and sends tostdout
? - Python decodes the 2 bytes into unicode. Then
print
encodes it again inutf-8
and sends the result tostdout
.
Is my understanding correct?
The Python interactive interpreter echos the result of any expression except if that result is None
. Echoing always uses the repr()
function to create a useable representation. Under the hood, objects have a __repr__
special method that does all the hard work here.
For strings, a value is printed that can be used directly in Python again to recreate the string, and any non-printable, non-ASCII bytes are represented with an escape sequence. Newlines become \n
, for example, and the two UTF-8 bytes for é
are represented with the \xhh
hex escape.
Thus, for point 1, Python indeed received two bytes from the terminal, stored those in a string, and the representation of the string consists of the characters '
, \
, x
, c
, 3
, etc. If you pasted that back into Python, you'd get the same string value again.
For 2., you created a Unicode string object. The terminal sent two UTF-8 bytes, but you now told Python to parse a u'..'
string literal, which is indeed decoded by using sys.stdin.encoding
.
The representation for a Unicode string object is another string literal, prefixed with u
to show it is a Unicode string, not a regular string. Unicode codepoints in the range U+0080 through to U+00FF (the Latin 1 range) are represented by the \xhh
escape code. é
is Unicode codepoint U+00E9, so is represented by \xe9
. Codepoints from U+0100 up to U+FFFF use the \uhhhh
representation, for higher codepoints \Uhhhhhhhh
is used.
Again, you can copy this representation, paste it back into Python and get the exact same value again.
print
writes directly to sys.stdout
, and if you give print
a Unicode string object, will use sys.stdout.encoding
to first encode the Unicode string value to a bytestring before writing it to sys.stdout
.
0 commentaires:
Enregistrer un commentaire