Unicode
DrModiford (Talk | contribs) (→Unicode) |
|||
Line 17: | Line 17: | ||
== Using UTF-8 in FreeBSD == | == Using UTF-8 in FreeBSD == | ||
First we need to set the LC_ALL and LANG variables, find out which locales can support UTF-8. | First we need to set the LC_ALL and LANG variables, find out which locales can support UTF-8. | ||
− | cd /usr/share/locale/; ls *UTF-8 -d | + | $ cd /usr/share/locale/; ls *UTF-8 -d |
− | Add the following environment | + | Add the following environment variable to the appropriate file, ~/.profile or ~/.login or ~/.bashrc. |
− | + | ||
export LC_ALL=sv_SE.UTF-8 | export LC_ALL=sv_SE.UTF-8 | ||
Now login and logout to have the effects apply. | Now login and logout to have the effects apply. | ||
+ | After that you should enable UTF-8 support in your terminal, see the application section for this. | ||
+ | |||
+ | === Converting files === | ||
+ | Now you're ready to convert some files, this is done with the command iconv, install it if you don't already have it. | ||
+ | # pkg_add -r libiconv | ||
+ | |||
+ | Then use the following to convert a file. | ||
+ | $ iconv -f iso8859-1 -t utf-8 file > file.new | ||
+ | |||
+ | This is a small script that converts a bunch of files and creates a backup of them in another directory. | ||
== Applications == | == Applications == | ||
=== xterm === | === xterm === | ||
To make xterm play nice i added | To make xterm play nice i added | ||
− | echo "xterm*locale: UTF-8" >> ~/.Xdefaults | + | $ echo "xterm*locale: UTF-8" >> ~/.Xdefaults |
=== irssi + screen === | === irssi + screen === | ||
− | + | Unfortunately I haven't found any way to get irssi+screen+FiSH to work with out a restart of irssi. | |
− | + | So restart screen with the new locales, this config will enable you to send ISO8859-1 by default in irssi. | |
− | + | ||
− | + | ||
/set term_charset UTF-8 | /set term_charset UTF-8 | ||
/set recode_out_default_charset ISO8859-1 | /set recode_out_default_charset ISO8859-1 | ||
Line 42: | Line 50: | ||
/set recode_transliterate no | /set recode_transliterate no | ||
/recode add #utf8channel UTF-8 | /recode add #utf8channel UTF-8 | ||
+ | |||
+ | For use with FiSH (an IRC encryption module [http://fish.sekure.us/]) some more adjustment are needed. | ||
+ | Read instructions an apply patches from [http://iiice.net/~ice/programs/FiSH/ http://iiice.net/~ice/programs/FiSH/] | ||
== External Links == | == External Links == | ||
* [http://opal.com/freebsd/unicode.html Unicode support on FreeBSD] | * [http://opal.com/freebsd/unicode.html Unicode support on FreeBSD] |
Revision as of 10:16, 27 September 2007
Contents |
Unicode
The simplest form of character set used on computers uses an 8-bit (one byte) numerical value to represent a letter from the English and Latin alphabets and certain accented characters (normally seen in French writing). This system is called ASCII, the American Standard Code for Information Interchange. Almost all modern day operating systems use it as well as many older computer systems.
Unicode, in particular the UTF-8 standard, takes this concept of a numerical value representing a character and extends it to host the alphabets of (virtually) all the known languages in the world. This is around 100,000 characters and as such UTF-8 can use 1-, 2-, 3- and even 4-byte values to represent them:
- the 1-byte character set is used to cover the simple English alphabet;
- the 2-byte character set is used to cover the more common alphabets, including Arabic, Armenian, Cyrillic, Greek, Hebrew, Latin, and Syriac;
- the 3-byte character set is used to cover additional language alphabets;
- the 4-byte character set is used to cover additional, but rarer, language alphabets, as such it is not used often.
In addition to the character sets used the standard also defines "handedness", as in which way the text flows. Typically Western languages are written left-to-right (as per the text on this page) while other, typically middle-Eastern languages, write from right-to-left.
While ASCII uses one character-per-byte and so a 100 letter document would be (theoretically) 100 bytes on disk a Unicode document could be 2, 3 or 4 times that size, depending on the encoding used. The Unicode standard is backwards compatible with ASCII when used in 1-byte character set.
There is another character set typically found on older mainframes, most notably from IBM, called EBCDIC, the Extended Binary-Coded Decimal Interchange Code. There is a variation called UTF-EBCDIC to enable legacy applications running on these systems to utilise Unicode.
Using UTF-8 in FreeBSD
First we need to set the LC_ALL and LANG variables, find out which locales can support UTF-8.
$ cd /usr/share/locale/; ls *UTF-8 -d
Add the following environment variable to the appropriate file, ~/.profile or ~/.login or ~/.bashrc.
export LC_ALL=sv_SE.UTF-8
Now login and logout to have the effects apply. After that you should enable UTF-8 support in your terminal, see the application section for this.
Converting files
Now you're ready to convert some files, this is done with the command iconv, install it if you don't already have it.
# pkg_add -r libiconv
Then use the following to convert a file.
$ iconv -f iso8859-1 -t utf-8 file > file.new
This is a small script that converts a bunch of files and creates a backup of them in another directory.
Applications
xterm
To make xterm play nice i added
$ echo "xterm*locale: UTF-8" >> ~/.Xdefaults
irssi + screen
Unfortunately I haven't found any way to get irssi+screen+FiSH to work with out a restart of irssi. So restart screen with the new locales, this config will enable you to send ISO8859-1 by default in irssi.
/set term_charset UTF-8 /set recode_out_default_charset ISO8859-1 /set recode yes /set recode_autodetect_utf8 no /set recode_fallback ISO8859-1 /set recode_transliterate no /recode add #utf8channel UTF-8
For use with FiSH (an IRC encryption module [1]) some more adjustment are needed. Read instructions an apply patches from http://iiice.net/~ice/programs/FiSH/