=encoding utf8 =head1 NAME Unicode introduction =head1 LANGUAGE en =head1 ABSTRACT This tutorial will give you a first notion of Unicode standard. It explains how to get started with Unicode in Perl and tries to focus on the most common errors. =head1 DESCRIPTION This tutorial will give you a first notion of Unicode standard. It explains how to get started with Unicode in Perl and tries to focus on the most common errors. =head1 TUTORIAL =head2 Unicode == Standard I is a computing industry standard for the consistent representation and handling of text expressed in most of the world's writing systems. Perl language is known for its excellent Unicode handling capabilities. There is a lot to know about peculiarities of dealing with Unicode, but in this tutorial we'll concentrate on the most basic things to get you started with Unicode in Perl right away. =head2 Code points and UTF Very simply put, main part of the Unicode standard is just a giant table, which assigns a number to every L, would that be a letter, a punctuation, a diacritic and so on. Those numbers are called I and normally a Unicode code point is referred to by writing I followed by its number in hexadecimal form. For example, B refers to B, and B is the B etc. But this giant table of code points itself yet has nothing to do with programming, computers or whatsoever. To actually use the power of Unicode in your programs you have to deal with the notion of I encodings, i.e. rules by which the code points could be translated into sequence of bits. The dominant (and most preferrable) character encoding for the World-Wide Web is I and in this tutorial we'll be dealing only with this representation of Unicode standard. =head2 Beware of wide characters Before diving into examples we need to take precaution against a very common problem when dealing with Unicode in Perl. When you will start trying to output some non-ASCII characters, chances are you will run into following warning message: Wide character in say in ./my_script.pl line 3 Well, what's that? What is I? How did it sneak into my coding chef d'œuvre?! This warning usually happens when you output a Unicode string to a non-unicode filehandle, i.e. a filehandle with no unicode-compatible I on it. IO layers is kinda close topic but we won't go into it right now, instead we'll show you possible solution to avoid this warning: binmode FILEHANDLE, ":encoding(UTF-8)"; This command should be put before your printing statement and it will specify the encoding layer for desired C. So, to be able to print to console C should be C, which is the filehandle used by Perl's C and C functions by default. =head2 Printing symbols First of all, let's look how we can output symbols denoted by code points. Your first option is to simply use hexadecimal code point number inside C<\x{}>. Let's look at some simple examples from the mathematic logic. To denote I (more commonly known as B operator), mathematicians use symbol B<∧>, which has code point B. So in Perl you should binmode STDOUT, ":encoding(UTF-8)"; say "1 \x{2227} 0 = 0"; =head4 Exercise Try to the write same example for I (B operator). B: logical disjunction symbol comes right after the conjunction one in Unicode table. binmode STDOUT, ":encoding(UTF-8)"; say ''; __TEST__ like($code, qr|\\x{[0-9a-f]+?}|i, '\x{} notation should be used'); like($stdout, qr/1 ∨ 0 = 1/, 'Should print out 1 ∨ 0 = 1'); You can also use the name of a code point, which would make your script more readable. To do that you would use C<\N{}> syntax instead of C<\x{}> (if you are using version of Perl less than 5.16 you'll need to put C at the top of you script in order to use C<\N{}>). The name of a code point could be seen directly in the L or, for example, with L utility. Remember I operator? Yes, it's just good old B. As B is just an addition modulo 2, mathematicians write it as a plus sing in a circle — B<⊕>. use charnames qw(:full); binmode STDOUT, ":encoding(UTF-8)"; say "1 \N{CIRCLED PLUS} 0 = 1"; =head4 Exercise Let's do some negation. Write an equation for negating bit 1. B: use Unicode B. use charnames qw(:full); binmode STDOUT, ":encoding(UTF-8)"; say ''; __TEST__ like($code, qr|\\N{[ A-Z]+?}|, '\N{} notation should be used'); like($stdout, qr/¬\s*1\s*=\s*0/, 'Should print out ¬ 1 = 0'); =head2 Unicode in your source But you don't have to type code points or even their names everytime. You can use any Unicode symbol in source code, for example in your string literals. All you have to do, is to use C pragma and then to save your script as UTF-8 text file. Watch this: use utf8; binmode STDOUT, ":encoding(UTF-8)"; my %notes = ( quarter => '♩', eighth => '♪', ); while (my( $k, $v ) = each %notes) { say "$k note is $v"; } Even more, you can use Unicode symbols in your indentifiers use utf8; my $cliché = 42; say "The answer is $cliché!"; and inside regular expressions! use utf8; my $snowman = "Hello, I'm \x{2603}."; say 'Snowman is here!' if $snowman =~ /☃/; =head4 Exercise Edit this piece of code by filling internationalized country code top-level domains. B: L might help. use utf8; binmode STDOUT, ":encoding(UTF-8)"; my %tlds = ( Russia => ..., Ukraine => ..., ); say "ccTLD for $_ is '$tlds{$_}'" for keys %tlds; __TEST__ like($code, qr/рф/, 'cyrillic tld for Russia should be used'); like($code, qr/укр/, 'cyrillic tld for Ukraine should be used'); =head2 Decode-process-encode Our previous examples had Unicode symbols in source code itself. When dealing with real world application this is not usually the case. Most of the time you'll perform processing of some kind of data that came from an external source, would that be a database, World-Wide Web of something else. Outside of your program data exists in form of bytes, and a set of rules which one would use to convert writing symbols into sequence of bytes is called I. L module is the tool for doing encoding convertions in Perl. So very typical workflow for some script would be following: =over =item * Decode you input data using L module. =item * Do some some stuff with your I data. =item * Encode your text into suitable encoding and pass it outside. =back Note that last step can include printing into some filehandle, in which case you can, for example, use C function as we did before, instead of C function. Also, you don't have to always perform steps 1 and 3 by yourself. In case you are using some encoding-aware module to fetch or parse data, decoding/encoding steps can be automaticaly taken for you by that module (e.g. L, L). Main point here is that you should always be aware of what state your data is in and carefully read the docs for modules you use. But for simple string juggling three steps above should be enough to get you going. =head2 Byte vs. character There is one common B about encodings: I<"1 character takes 1 byte">. This is obviously true for single-byte encodings, such as I. use Encode qw(encode); say length encode('latin1', '$'); # says 1 But since Unicode now defines more then B<1 million> code points (1,114,112 to be precise) it's absolutely impossible to use one byte (which can take only 256 combinations of bits) to hold them all. That's where L step in. And UTF-8 is one of the most interesting of them all. See, depending on code point UTF-8 encoded Unicode symbol can take from 1 byte use Encode qw(encode); say length encode('utf8', '$'); # says 1 up to 6 bytes for a single code point! Once again, see the difference between I and I representing that symbol: use utf8; use Encode qw(encode); say length '€'; # says 1 say length encode('utf8', '€'); # says 3 =head2 See also =over =item * L =item * L =back =head1 AUTHOR Sergey Romanov, C =head1 LICENSE L