Saturday, August 16, 2008

man2po2tmx2po2man \n bug


Tento zápisek slouží pouze jako podklad pro bugreport. Navíc je anglicky.

This entry is just an attachment to a bugreport. Nothing interesting here.




My goal is to move the current Linux manual pages translation framework (translating to czech) to OmegaT. As OmegaT handles .properties and .po files only, I'm using po4a to transform the man pages to po files and back. To ease the transition I'm creating translation memories from existing translations using po2tmx.



The original en/intro.1 looks like this:

...
.BI "knuth login: " aeb
.BI "Password: " ********
.BI "% " date
Tue Aug  6 23:50:44 CEST 2002
.BI "% " cal
August 2002
Su Mo Tu We Th Fr Sa
1  2  3
4  5  6  7  8  9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
.BI "% " ls
...

Its czech translation cs/intro.1 looks like this:
...
.BI "knuth login: " aeb
.BI "Password: " ********
.BI "% " date
Út Srp  6 23:50:44 CEST 2002
.BI "% " cal
Srpen 2002
Ne Po Út St Čt Pá So
1  2  3
4  5  6  7  8  9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
.BI "% " ls
...


This is how it looks when viewed using man en/intro.1 or man cs/intro.1:





To create a po file from the man pages I use the following command:

$ po4a-gettextize -f man -m en/intro.1 -l cs/intro.1 -L utf8 -p intro.1.po

The relevant part of the created intro.1.po file looks like this:
...
# type: Plain text
#: en/intro.1:106
#, fuzzy, no-wrap
msgid ""
"B<knuth login: >I<aeb>\n"
"B<Password: >I<********>\n"
"B<% >I<date>\n"
"Tue Aug  6 23:50:44 CEST 2002\n"
"B<% >I<cal>\n"
"     August 2002\n"
"Su Mo Tu We Th Fr Sa\n"
"             1  2  3\n"
" 4  5  6  7  8  9 10\n"
"11 12 13 14 15 16 17\n"
"18 19 20 21 22 23 24\n"
"25 26 27 28 29 30 31\n"
msgstr ""
"B<knuth login: >I<aeb>\n"
"B<Password: >I<********>\n"
"B<% >I<date>\n"
"Út Srp  6 23:50:44 CEST 2002\n"
"B<% >I<cal>\n"
"     Srpen 2002\n"
"Ne Po Út St Čt Pá So\n"
"             1  2  3\n"
" 4  5  6  7  8  9 10\n"
"11 12 13 14 15 16 17\n"
"18 19 20 21 22 23 24\n"
"25 26 27 28 29 30 31\n"
...

I have to remove the ', fuzzy' indications that were automatically inserted by po4a, otherwise po2tmx would ignore all the messages:
$ sed -i 's/, fuzzy//' intro.1.po

Because of bug 2021007 in OmegaT, I also have to remove the strings ', no-wrap' from the po file:
$ sed -i 's/, no-wrap//' intro.1.po

Then I create a tmx file from the po file:
$ po2tmx -l cs -i intro.1.po -o intro.1.tmx

The relevant part of the intro.1.tmx file looks like this:
...
<tu>
<tuv xml:lang="en">
<seg>B&lt;knuth login: &gt;I&lt;aeb&gt;
B&lt;Password: &gt;I&lt;********&gt;
B&lt;% &gt;I&lt;date&gt;
Tue Aug  6 23:50:44 CEST 2002
B&lt;% &gt;I&lt;cal&gt;
August 2002
Su Mo Tu We Th Fr Sa
1  2  3
4  5  6  7  8  9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
</seg>
</tuv>
<tuv xml:lang="cs">
<seg>B&lt;knuth login: &gt;I&lt;aeb&gt;
B&lt;Password: &gt;I&lt;********&gt;
B&lt;% &gt;I&lt;date&gt;
Út Srp  6 23:50:44 CEST 2002
B&lt;% &gt;I&lt;cal&gt;
Srpen 2002
Ne Po Út St Čt Pá So
1  2  3
4  5  6  7  8  9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
</seg>
</tuv>
</tu>
...

Then I launch OmegaT, select 'Project'->'New' and give 'manpagestest' as the new project's name. The Source files language is en-us, target language is cs. Sentence-level segmenting is disabled. I quit the project.

Then I copy the tmx to the project:
$ cp intro.1.tmx manpagestest/tm/

Then I create an english po file from the original man page and place it to the source directory of the project:
$ po4a-gettextize -f man -m en/intro.1 -p manpagestest/source/intro.1.po


Then I launch OmegaT, open the manpagestest project and click on the intro.1.po file to edit it. When I change to the message containing the calendar, I can see that the suggested translation (based on the tmx) is not 100% as with other strings, it's only 67 % :



Still hoping it will turn out ok I put the fuzzy match as new translation:



save the project and click 'Project'->'Create translated documents'. I close OmegaT now.


The relevant part of the translated manpagestest/target/intro.1.po file now looks like this:

...
# type: Plain text
#: en/intro.1:106
#, no-wrap
msgid ""
"B<knuth login: >I<aeb>\n"
"B<Password: >I<********>\n"
"B<% >I<date>\n"
"Tue Aug  6 23:50:44 CEST 2002\n"
"B<% >I<cal>\n"
"     August 2002\n"
"Su Mo Tu We Th Fr Sa\n"
"             1  2  3\n"
" 4  5  6  7  8  9 10\n"
"11 12 13 14 15 16 17\n"
"18 19 20 21 22 23 24\n"
"25 26 27 28 29 30 31\n"
msgstr B<knuth login: >I<aeb>
B<Password: >I<********>
B<% >I<date>
Út Srp  6 23:50:44 CEST 2002
B<% >I<cal>
Srpen 2002
Ne Po Út St Čt Pá So
1  2  3
4  5  6  7  8  9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
...

Obviously this is already wrong but for the sake of the experiment, I'm going on:
$ po4a-translate -f man -m en/intro.1 -p manpagestest/target/intro.1.po -l intro.cs.1
...
manpagestest/target/intro.1.po:134: (po4a::po)
Strange line: --><--
manpagestest/target/intro.1.po:165: (po4a::po)
Strange line: -->msgstr B<knuth login: >I<aeb><--
manpagestest/target/intro.1.po:166: (po4a::po)
Strange line: -->B<Password: >I<********><--
manpagestest/target/intro.1.po:167: (po4a::po)
Strange line: -->B<% >I<date><--
manpagestest/target/intro.1.po:168: (po4a::po)
Strange line: -->Út Srp  6 23:50:44 CEST 2002<--
manpagestest/target/intro.1.po:169: (po4a::po)
Strange line: -->B<% >I<cal><--
manpagestest/target/intro.1.po:170: (po4a::po)
Strange line: -->     Srpen 2002<--
manpagestest/target/intro.1.po:171: (po4a::po)
Strange line: -->Ne Po Út St Čt Pá So<--
manpagestest/target/intro.1.po:172: (po4a::po)
Strange line: -->             1  2  3<--
manpagestest/target/intro.1.po:173: (po4a::po)
Strange line: --> 4  5  6  7  8  9 10<--
manpagestest/target/intro.1.po:174: (po4a::po)
Strange line: -->11 12 13 14 15 16 17<--
manpagestest/target/intro.1.po:175: (po4a::po)
Strange line: -->18 19 20 21 22 23 24<--
manpagestest/target/intro.1.po:176: (po4a::po)
Strange line: -->25 26 27 28 29 30 31<--
manpagestest/target/intro.1.po:236: (po4a::po)
Strange line: --><--
manpagestest/target/intro.1.po:309: (po4a::po)
Strange line: --><--
...

as you can see, po4a complains about some extra blank lines in the file and also does not like at all the translated part. This is because po4a expects the message to have opening and closing doublequotes, and to have them on each line (a bug in OmegaT?). I fix that manually:
$ cp manpagestest/target/intro.1.po manpagestest/target/intro.1.fixed.po
$ vi manpagestest/target/intro.1.fixed.po
$ cat manpagestest/target/intro.1.fixed.po
...
# type: Plain text
#: en/intro.1:106
#, no-wrap
msgid ""
"B<knuth login: >I<aeb>\n"
"B<Password: >I<********>\n"
"B<% >I<date>\n"
"Tue Aug  6 23:50:44 CEST 2002\n"
"B<% >I<cal>\n"
"     August 2002\n"
"Su Mo Tu We Th Fr Sa\n"
"             1  2  3\n"
" 4  5  6  7  8  9 10\n"
"11 12 13 14 15 16 17\n"
"18 19 20 21 22 23 24\n"
"25 26 27 28 29 30 31\n"
msgstr "B<knuth login: >I<aeb>"
"B<Password: >I<********>"
"B<% >I<date>"
"Út Srp  6 23:50:44 CEST 2002"
"B<% >I<cal>"
"     Srpen 2002"
"Ne Po Út St Čt Pá So"
"             1  2  3"
" 4  5  6  7  8  9 10"
"11 12 13 14 15 16 17"
"18 19 20 21 22 23 24"
"25 26 27 28 29 30 31"
...

and retry:
$ po4a-translate -f man -m en/intro.1 -p manpagestest/target/intro.1.fixed.po -l intro.cs.1

The important part of the created intro.cs.1 looks like this:
...
\fBknuth login: \fP\fIaeb\fP\fBPassword: \fP\fI********\fP\fB% \fP\fIdate\fPÚt Srp  6 23:50:44 CEST 2002\fB% \fP\fIcal\fP     Srpen 2002Ne Po Út St Čt Pá So             1  2  3 4  5  6  7  8  9 1011 12 13 14 15 16 1718 19 20 21 22 23 2425 26 27 28 29 30 31
...


and this is how it looks when viewed with man ./intro.cs.1, i.e. badly:




All the mentioned files are attached.


No comments:

Post a Comment