Encoding problem

'John Keeping' john at keeping.me.uk
Sun Oct 6 12:46:33 CEST 2013


On Sat, Oct 05, 2013 at 11:32:54AM +0100, Jorge Bastos wrote:
> > On Sat, Sep 28, 2013 at 12:19:38AM +0100, Jorge Bastos wrote:
> > > Is it possible to define charset in cgitrc?
> > >
> > > I'm having encoding problems in the frontend, in the latest version
> > > 1.8.4 from version 0.9.2, and now non-ascii chars are shown with ??
> > or
> > > some other char instead of the correct one.
> > >
> > >
> > >
> > > Is there a charset option for cgit ? I can't find it.
> > 
> > The charset is hardcoded to "UTF-8", which should be the default
> > encoding for Git commit messages and CGit does attempt to transcode Git
> > messages to the correct encoding.
> > 
> > Are you seeing '??' in the commit message or in blob/tree content?
> > 
> > Do you have a public repository that is exhibiting these symptoms?
> 
> I was checking and the file in question was indeed in ANSI, changed the file
> encoding to utf8 and it's OK.
> Anyway, I have gitweb install side-by-side, and in gitweb it was shown
> correctly.
> 
> I have other places where chars are not shown OK but didn't get any
> conclution about the file encoding, I'll tell you later,

I've had another look at this, and Gitweb is doing this for all data it
outputs:

    # decode sequences of octets in utf8 into Perl's internal form,
    # which is utf-8 with utf8 flag set if needed.  gitweb writes out
    # in utf-8 thanks to "binmode STDOUT, ':utf8'" at beginning
    sub to_utf8 {
        my $str = shift;
        return undef unless defined $str;

        if (utf8::is_utf8($str) || utf8::decode($str)) {
            return $str;
        } else {
            return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
        }
    }

Do you know what the fallback encoding on your Gitweb installation is?
(The default is 'latin1').

If you're not using any other source filter with CGit, you should get
the same result by configuring the following script as "source-filter"
in your cgitrc file.

We'll still get it wrong in "plain" view though, since we
unconditionally set the charset to UTF-8 there and dump the content out
raw; that can be tweaked in the config file but it looks like we get
that wrong and unconditionally append a "charset=" to the MIME type even
for binary types.

-- >8 --
#!/usr/bin/perl
use strict;
use warnings;
use Encode;

binmode STDOUT, ':utf8';

my $str = do { local $/; <STDIN> };

if (utf8::decode($str)) {
        print $str;
} else {
        print decode('latin1', $str, Encode::FB_DEFAULT);
}


More information about the CGit mailing list