=encoding utf8 =head1 NAME Unicode introduction =head1 LANGUAGE en =head1 ABSTRACT This tutorial will give you a first notion of Unicode standard. It explains how to get started with Unicode in Perl and tries to focus on the most common errors. =head1 DESCRIPTION This tutorial will give you a first notion of Unicode standard. It explains how to get started with Unicode in Perl and tries to focus on the most common errors. =head1 TUTORIAL =head2 Unicode == Standard I is a computing industry standard for the consistent representation and handling of text expressed in most of the world's writing systems. Perl language is known for its excellent Unicode handling capabilities. There is a lot to know about peculiarities of dealing with Unicode, but in this tutorial we'll concentrate on the most basic things to get you started with Unicode in Perl right away. =head2 Code points and UTF Very simply put, main part of the Unicode standard is just a giant table, which assigns a number to every L, would that be a letter, a punctuation, a diacritic and so on. Those numbers are called I

 and normally a Unicode code point is referred to by
writing I followed by its number in hexadecimal form. For example, B
refers to B, and B is the B etc.

But this giant table of code points itself yet has nothing to do with
programming, computers or whatsoever. To actually use the power of Unicode in
your programs you have to deal with the notion of
I encodings, i.e. rules by which the code
points could be translated into sequence of bits.
The dominant (and most preferrable) character encoding for the World-Wide Web
is I and in this tutorial we'll be dealing only with this representation
of Unicode standard.

=head2 Beware of wide characters

Before diving into examples we need to take precaution against a very common
problem when dealing with Unicode in Perl. When you will start trying to
output some non-ASCII characters, chances are you will run into following
warning message:

    Wide character in say in ./my_script.pl line 3

Well, what's that? What is I? How did it sneak into my coding
chef d'œuvre?!

This warning usually happens when you output a Unicode string to a
non-unicode filehandle, i.e. a filehandle with no unicode-compatible I on it. IO layers is kinda close topic but we won't go into it
right now, instead we'll show you possible solution to avoid this warning:

    binmode FILEHANDLE, ":encoding(UTF-8)";

This command should be put before your printing statement and it will specify
the encoding layer for desired C. So, to be able to print to
console C should be C, which is the filehandle used by
Perl's C and C functions by default.

=head2 Printing symbols

First of all, let's look how we can output symbols denoted by code points.

Your first option is to simply use hexadecimal code point number inside C<\x{}>.

Let's look at some simple examples from the mathematic logic. To denote
I (more commonly known as B operator), mathematicians
use symbol B<∧>, which has code point B. So in Perl you should

    binmode STDOUT, ":encoding(UTF-8)";

    say "1 \x{2227} 0 = 0";

=head4 Exercise

Try to the write same example for I (B operator).

B: logical disjunction symbol comes right after the conjunction one in
Unicode table.

    binmode STDOUT, ":encoding(UTF-8)";

    say '';

    __TEST__
    like($code, qr|\\x{[0-9a-f]+?}|i, '\x{} notation should be used');
    like($stdout, qr/1 ∨ 0 = 1/, 'Should print out 1 ∨ 0 = 1');

You can also use the name of a code point, which would make your script more
readable. To do that you would use C<\N{}> syntax instead of C<\x{}> (if you
are using version of Perl less than 5.16 you'll need to put C
at the top of you script in order to use C<\N{}>). The name
of a code point could be seen directly in the
L or,
for example, with L utility.

Remember I operator? Yes, it's just good old B.
As B is just an addition modulo 2, mathematicians write it as a
plus sing in a circle — B<⊕>.

    use charnames qw(:full);

    binmode STDOUT, ":encoding(UTF-8)";

    say "1 \N{CIRCLED PLUS} 0 = 1";

=head4 Exercise

Let's do some negation. Write an equation for negating bit 1.

B: use Unicode B.

    use charnames qw(:full);

    binmode STDOUT, ":encoding(UTF-8)";

    say '';

    __TEST__
    like($code, qr|\\N{[ A-Z]+?}|, '\N{} notation should be used');
    like($stdout, qr/¬\s*1\s*=\s*0/, 'Should print out ¬ 1 = 0');


=head2 Unicode in your source

But you don't have to type code points or even their names everytime. You
can use any Unicode symbol in source code, for example in your string
literals. All you have to do, is to use C pragma and then to save your
script as UTF-8 text file. Watch this:

    use utf8;

    binmode STDOUT, ":encoding(UTF-8)";

    my %notes = (
        quarter => '♩',
        eighth  => '♪',
    );

    while (my( $k, $v ) = each %notes) {
        say "$k note is $v";
    }

Even more, you can use Unicode symbols in your indentifiers

    use utf8;

    my $cliché = 42;

    say "The answer is $cliché!";

and inside regular expressions!

    use utf8;

    my $snowman = "Hello, I'm \x{2603}.";

    say 'Snowman is here!' if $snowman =~ /☃/;

=head4 Exercise

Edit this piece of code by filling internationalized country code
top-level domains.

B: L might help.

    use utf8;

    binmode STDOUT, ":encoding(UTF-8)";

    my %tlds = (
        Russia  => ...,
        Ukraine => ...,
    );

    say "ccTLD for $_ is '$tlds{$_}'" for keys %tlds;

    __TEST__
    like($code, qr/рф/, 'cyrillic tld for Russia should be used');
    like($code, qr/укр/, 'cyrillic tld for Ukraine should be used');

=head2 Decode-process-encode

Our previous examples had Unicode symbols in source code itself. When
dealing with real world application this is not usually the case. Most of
the time you'll perform processing of some kind of data that came from an
external source, would that be a database, World-Wide Web of something else.

Outside of your program data exists in form of bytes, and a set of rules
which one would use to convert writing symbols into sequence of bytes is
called I. L module is the tool for doing encoding
convertions in Perl.

So very typical workflow for some script would be following:

=over

=item *

Decode you input data using L module.

=item *

Do some some stuff with your I data.

=item *

Encode your text into suitable encoding and pass it outside.

=back

Note that last step can include printing into some filehandle, in which case
you can, for example, use C function as we did before, instead of
C function.

Also, you don't have to always perform steps 1 and 3 by yourself. In case you
are using some encoding-aware module to fetch or parse data, decoding/encoding
steps can be automaticaly taken for you by that module (e.g. L, L).

Main point here is that you should always be aware of what state your data is
in and carefully read the docs for modules you use. But for simple string
juggling three steps above should be enough to get you going.

=head2 Byte vs. character

There is one common B about encodings:
I<"1 character takes 1 byte">.
This is obviously true for single-byte encodings, such as I.

    use Encode qw(encode);

    say length encode('latin1', '$'); # says 1

But since Unicode now defines more then B<1 million> code points (1,114,112 to
be precise) it's absolutely impossible to use one byte (which can take only
256 combinations of bits) to hold them all. That's where
L step in.
And UTF-8 is one of the most interesting of them all. See, depending on code
point UTF-8 encoded Unicode symbol can take from 1 byte

    use Encode qw(encode);

    say length encode('utf8', '$'); # says 1

up to 6 bytes for a single code point! Once again, see the difference
between I and I representing that symbol:

    use utf8;

    use Encode qw(encode);

    say length '€';                 # says 1

    say length encode('utf8', '€'); # says 3

=head2 See also

=over

=item * L

=item * L

=back

=head1 AUTHOR

Sergey Romanov, C

=head1 LICENSE

L