Character sets and iconv
-
Upload
danielrhodes -
Category
Technology
-
view
4.349 -
download
0
Transcript of Character sets and iconv
Character sets and iconv
This presentation is about character sets and the iconv library (with usage examples in PHP)
By Daniel Rhodes of Warp Asylumhttp://www.warpasylum.co.uk
What is a character set?
Mapping of character x in human language y is value z
Western European languages often use 8-bit ISO 8859-1
English possible in 7-bit ASCII!
Some languages have complex / numerous characters and need 2, 3 or even 4 bytes to represent one character!
So, many different character sets exist
More about character sets
Even same language may have many different character sets
Character sets tend not to be compatible
So, conversion is necessary and useful
But Unicode is coming through as a modernising, unifying character set
Unicode is one HUGE character set that can be used to represent any character from any language!
Character sets? Who cares!
Anglophones very lucky as everything seems to just work (even if in the background different character sets are interacting)
English is not the only language!
An app expecting character set x but getting y (or an incorrect character set conversion) will result in mojibake
Mojibake? What's that?
A great Japanese word meaning garbled (bake) characters (moji)
Often encountered in Japanese computing with its two traditional character sets, Unicode and a separate character set for emails!
Shouldn't really happen at all in modern computing
But it still does, mostly due to lack of implementation knowledge
Mojibake in English
A slight case of mojibake here, the pound symbols () have garbled
Mojibake in German
More severe now, umlauted vowels (, and ) have garbled
Mojibake in Japanese
Ouch!
What is the iconv library?
API to convert between character sets
Works on strings
Some support for transliteration (changing / substituting characters in source character set that don't exist in target character set)
Your implementation may vary, but a HUGE number of character sets are supported
Some iconv use cases
Convert legacy character set Unicode
Convert backend frontend character sets
Convert file's character set for import / export
Transliterate to remove unwanted characters
Transliterate to make safe for URL / filename
Let's look at some iconv usage examples in PHP..
What is PHP's iconv extension?
Interface to iconv library
See http://uk.php.net/manual/en/book.iconv.php
iconv library should be on your OS
If not, need to install it before using the PHP extension
See http://www.gnu.org/software/libiconv
iconv extension presence
phpinfo() will look something like:
A few directives
iconv.input_encoding currently unused
iconv.output_encoding for ob_iconv_handler() [iconv handler for PHP's output buffering]
iconv.internal_encoding for ob_iconv_handler(), iconv_mime_*() and iconv's string utility functions (which are present from PHP 5)
First play
Basic usage
iconv() is the conversion function
Pass it the input string's character set,
the desired output character set
and the input string
BUT within reason...
Within reason
Character mapping
You might not get every character from set x present in set y
So what to do if character absent? Bomb out and return an empty string?
NO! iconv gives us a few options
Let's look at transliteration first...
First transliteration
Transliteration
Append //TRANSLIT to output character set as passed to iconv()
Approximates characters not present in output character set with closest equivalent
Closest equivalent might simply be '?' for wildly different character sets
More realistic transliteration
Ignore option
We can also append //IGNORE to the output character set as passed to inconv()
This will simply skip over any characters that are absent from the output character set
Ignore example
Transliterate and ignore
You may (or may not!) be able to combine the //TRANSLIT and //IGNORE behaviours
This will transliterate transliteratable characters and ignore the rest
Action it by appending //TRANSLIT//IGNORE to the output character set as passed to iconv()
Output buffer handler
We also get a handler for PHP's output buffering
Allows us to, for example, output everything to the browser as ISO-8859-1 though our PHP scripts etc are using UTF-8
An automatic way to convert character sets for output without necessarily touching anything internally
Let's take a look...
ob_iconv_handler
Utility functions
As of PHP 5, we also get some non-conversion utility functions
iconv_strlen()
iconv_strpos(), iconv_strrpos()
iconv_substr()
Character equivalents of core strlen(), strpos(), strrpos() and substr() [which are really byte functions]
Quite trivial so we'll look only at one, iconv_strlen()...
iconv_strlen()
Food for thought
Unicode is the character set of the future
PHP iconv extension uses sytem locale [setlocale()] for transliteration
PHP iconv extension issues a notice even when //IGNORE is used
iconv library has no mechanism for custom character maps
Summary
iconv library can be accessed on the command line
But extension for PHP (and many other languages!)
Many character sets supported
All or nothing conversion or softer transliteration
Links
Should be able to get a PHP source code pack from wherever you got this presentation
http://spin.atomicobject.com/2011/07/13/some-useful-iconv-functionality
http://developer.loftdigital.com/blog/php-utf-8-cheatsheet
http://blog.grayproductions.net/articles/encoding_conversion_with_iconv
http://czyborra.com/charsets/iso8859.html
Klicken Sie, um das Format des Titeltextes zu bearbeiten
Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte GliederungsebeneFnfte GliederungsebeneSechste GliederungsebeneSiebente GliederungsebeneAchte GliederungsebeneNeunte Gliederungsebene
Click to edit the title text format
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level
Made with OpenOffice.org
Pulse para editar el formato del texto de ttulo
Pulse para editar los formatos del texto del esquemaSegundo nivel del esquemaTercer nivel del esquemaCuarto nivel del esquemaQuinto nivel del esquemaSexto nivel del esquemaSptimo nivel del esquemaOctavo nivel del esquemaNoveno nivel del esquema
Klicken Sie, um das Format des Titeltextes zu bearbeiten
Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte GliederungsebeneFnfte GliederungsebeneSechste GliederungsebeneSiebente GliederungsebeneAchte GliederungsebeneNeunte Gliederungsebene
Klicken Sie, um das Format des Titeltextes zu bearbeiten
Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte GliederungsebeneFnfte GliederungsebeneSechste GliederungsebeneSiebente GliederungsebeneAchte GliederungsebeneNeunte Gliederungsebene
Klicken Sie, um das Format des Titeltextes zu bearbeiten
Klicken Sie, um das Format des Titeltextes zu bearbeiten
Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte GliederungsebeneFnfte GliederungsebeneSechste GliederungsebeneSiebente GliederungsebeneAchte GliederungsebeneNeunte Gliederungsebene
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level
Klicken Sie, um das Format des Titeltextes zu bearbeiten
23456789
Klicken Sie, um das Format des Titeltextes zu bearbeiten
Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte GliederungsebeneFnfte GliederungsebeneSechste GliederungsebeneSiebente GliederungsebeneAchte GliederungsebeneNeunte Gliederungsebene
Pulse para editar el formato del texto de ttulo
Pulse para editar los formatos del texto del esquemaSegundo nivel del esquemaTercer nivel del esquemaCuarto nivel del esquemaQuinto nivel del esquemaSexto nivel del esquemaSptimo nivel del esquemaOctavo nivel del esquemaNoveno nivel del esquema
Pulse para editar el formato del texto de ttulo
Pulse para editar los formatos del texto del esquemaSegundo nivel del esquemaTercer nivel del esquemaCuarto nivel del esquemaQuinto nivel del esquemaSexto nivel del esquemaSptimo nivel del esquemaOctavo nivel del esquemaNoveno nivel del esquema
Pulse para editar el formato del texto de ttulo
Pulse para editar los formatos del texto del esquemaSegundo nivel del esquemaTercer nivel del esquemaCuarto nivel del esquemaQuinto nivel del esquemaSexto nivel del esquemaSptimo nivel del esquemaOctavo nivel del esquemaNoveno nivel del esquema
Pulse para editar el formato del texto de ttulo
Pulse para editar los formatos del texto del esquemaSegundo nivel del esquemaTercer nivel del esquemaCuarto nivel del esquemaQuinto nivel del esquemaSexto nivel del esquemaSptimo nivel del esquemaOctavo nivel del esquemaNoveno nivel del esquema