Text encoding in Ruby Edit

Created 2007-06-20, updated 2026-03-16

When writing a C extension for Ruby it may be necessary to have knowledge about the text encoding of strings used at runtime. In my specific case I needed to take an input string and convert it to UCS-2 before handing it off to an external library, because that is the encoding expected by the library. In order to perform the conversion, I needed to be sure of the source encoding as well. The question is, how?

I asked about this on comp.lang.ruby here but didn't receive any replies. The following are the results of my own research and analysis.

Build time

Analysis of configure.in and configure suggests that the default $KCODE value is KCODE_NONE, although it can be manually overridden with the --with-default-kcode switch to configure. Before compilation the default is added to config.h as a DEFAULT_KCODE macro.

In defines.h and in the case that DEFAULT_KCODE is not set, it will be set to either KCODE_SJIS or KCODE_EUC, depending on the build platform. However, in reality DEFAULT_KCODE should never be unset (in all the files I've looked at, including ruby.h, config.h is included before defines.h) so I don't believe this ever occurs.

It appears that in reality, the only place that DEFAULT_KCODE is used is in re.c.

Analysis of `string.c`

There are three main string creation methods defined in string.c and they appear to be largely encoding-agnostic:

str_new
rb_str_new
rb_str_new2

Ultimately it seems that String objects are nothing more than containers for bytes of data. There is no default encoding enforced at the Ruby level.

`str_new`

Merely allocates memory and copies memory using memcpy. Expects a pointer and length describing the source block from which to copy. No notion of encoding, just works with raw bytes. Takes a klass parameter which allows you to create instances of String-like classes.

`rb_str_new`

Calls str_new and uses it to create a standard String instance. Expects a pointer and length describing the source block from which to copy. No notion of encoding.

`rb_str_new2`

Expects a pointer to a null-terminated string. Calls strlen to determine string length. strlen should be encoding-agnostic insofar as it just counts bytes until it hits a terminating NUL character. This means that all standard C strings should work. Non-ASCII encodings might work unless it contains embedded (non-terminating) NUL bytes.

Analysis of `io.c`

Methods like puts eventually end up calling the io_write function, which in turn calls io_fwrite which in turn calls write. So there is generally no awareness of encoding when emitting strings, just like when creating strings.

Basically, then, the input encoding depends on one of several things:

The input file encoding, if reading from a file
The terminal encoding, if running inside a terminal (for example, when using IRB)
The transmission encoding, if sent by a client to a web server
The database encoding, if the input comes from a database

Empirical analysis

Mac OS X, PowerPC (big-endian)

Output of uname -psrv:

Darwin 8.9.0 Darwin Kernel Version 8.9.0: Thu Feb 22 20:54:07 PST 2007; root:xnu-792.17.14~1/RELEASE_PPC powerpc

Output of ruby -v:

ruby 1.8.2 (2004-12-25) [powerpc-darwin8.0]

On Mac OS X the environment settings in LANG and LC_* have no effect. The encoding setting in the Terminal (accessible by pressing Command-I) controls what encoding is used for input text, and also how output is interpreted.

Mac OS X, Intel (little-endian)

Output of uname -psrv:

Darwin 8.9.1 Darwin Kernel Version 8.9.1: Thu Feb 22 20:55:00 PST 2007; root:xnu-792.18.15~1/RELEASE_I386 i386

Output of ruby -v:

ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.8.1]

Red Hat Enterprise Linux, AMD (little-endian)

Output of uname -srvmpio:

Linux 2.4.21-50.EL #1 Tue May 8 17:18:10 EDT 2007 i686 athlon i386 GNU/Linux

Output of ruby -v:

ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-linux]

Windows, Intel (little-endian)

Banner on running cmd.exe:

Microsoft Windows XP [Version 5.1.2600]

Output of ruby -v:

ruby 1.8.6 (2007-03-13 patchlevel 0) [i386-mswin32]

wiki

Text encoding in RubyEdit

Build time

Analysis of string.c

str_new

rb_str_new

rb_str_new2

Analysis of io.c

Empirical analysis

Mac OS X, PowerPC (big-endian)

Mac OS X, Intel (little-endian)

Red Hat Enterprise Linux, AMD (little-endian)

Windows, Intel (little-endian)

Text encoding in Ruby Edit

Analysis of `string.c`

`str_new`

`rb_str_new`

`rb_str_new2`

Analysis of `io.c`