Discussion:
TeXML, the XML vocabulary for TeX
Oleg Paraschenko
2004-03-25 11:28:41 UTC
Permalink
Hello colleagues,

I'd like to introduce you TeXML, the XML vocabulary for TeX:

http://getfo.sourceforge.net/texml/

I think that you can use TeXML to some extent in the your project.

| Example of TeXML to TeX translation
|
| TeXML:
|
| <cmd name="documentclass">
| <opt>12pt</opt>
| <parm>letter</parm>
| </cmd>
|
| TeX:
|
| \documentclass[12pt]{letter}

One of the main benefits of TeXML usage is an automatical translation
of the TeX special symbols.

| Example of translation of special TeX symbols
|
| TeXML:
|
| <TeXML>\section{No&#xa0;break}</TeXML>
|
| TeX:
|
| $\backslash$section\{No~break\}

Default output encoding is utf8. TeXML processor escapes
out-of-encoding chars automatically.

| Example of translation of non-ASCII characters
|
| TeXML:
|
| <TeXML>&#x422;&#x435;&#x425;</TeXML>
|
| TeX in ASCII encoding:
|
| \cyrchar\CYRT \cyrchar\cyre \cyrchar\CYRH
|
| TeX in Russian encoding
|
| TeX

There are some profits to generate TeXML instead of TeX:

* you avoid painful handling of TeX special characters,
* you should not bother about encodings,
* there are chances to write more error-free code.

About last item. For example, you want to generate

| {\bf bold}

One of the approaches is to generate "{", then "\bf " (with trailing
space) and then "}". It is easy enough to miss space or to forget
a brace or write an incorrect brace. But when you use TeXML, it cares
for you:

| <group><cmd name="bf"/>bold</group>

Your comments are welcome.

Regards, Oleg


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
Torsten Bronger
2004-03-25 11:42:49 UTC
Permalink
Halloechen!
Post by Oleg Paraschenko
I think that you can use TeXML to some extent in the your project.
| Example of TeXML to TeX translation
|
|
| <cmd name="documentclass">
| <opt>12pt</opt>
| <parm>letter</parm>
| </cmd>
|
|
| \documentclass[12pt]{letter}
One of the main benefits of TeXML usage is an automatical translation
of the TeX special symbols.
Interesting, but how is it implemented? In XSLT, or a scripting
language, or what? How fast is it (I'm not prepared to accept a
further significant drop down in speed)?

How are different \usepackage[???]{inputenc}'s dealt with?

Tschoe,
Torsten.
--
Torsten Bronger, aquisgrana, europa vetus



-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
Oleg Paraschenko
2004-03-25 12:26:13 UTC
Permalink
Hi!

On Thu, 25 Mar 2004 12:42:49 +0100
Post by Torsten Bronger
Halloechen!
...
Post by Torsten Bronger
Post by Oleg Paraschenko
One of the main benefits of TeXML usage is an automatical
translation
of the TeX special symbols.
Interesting, but how is it implemented? In XSLT, or a scripting
language, or what?
It is implemented in the Python scripting language. It uses only core
Python modules (expat XML parser, unicode database, something other),
so it should work on any recent system. Mapping from Unicode characters
to LaTeX commands is taken from attachment for the MathML specification
(http://www.w3.org/Math/characters/unicode.xml (note: 1,5 Mb)).
Post by Torsten Bronger
How fast is it (I'm not prepared to accept a
further significant drop down in speed)?
It is hard to said exactly, but I think it is fast. In any case,
it should be faster then processing of specials by xslt.
Post by Torsten Bronger
How are different \usepackage[???]{inputenc}'s dealt with?
The processor does not know about \usepackage, it only translates
characters. It is a task of an xslt to insert \usepackage command into
the output, if required.

User can specify an output encoding. The processor attempts to make as
good translation as possible for it. For example, for letter &szlig;, if
output encoding is ascii, then processor outputs "\ss "; if output
encoding is latin1, then processor outputs "ъ". In latter case correct
header should be \usepackage[latin1]{inputenc}, but it is not a task of
processor to create this header.
Post by Torsten Bronger
Tschoe,
Torsten.
--
Torsten Bronger, aquisgrana, europa vetus
Bye!

--
Oleg


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
Torsten Bronger
2004-03-25 12:43:51 UTC
Permalink
Halloechen!
Post by Oleg Paraschenko
On Thu, 25 Mar 2004 12:42:49 +0100
[...] Interesting, but how is it implemented? In XSLT, or a
scripting language, or what?
It is implemented in the Python scripting language.
I don't know Python. How easy can this be installed on a Windows
system?
Post by Oleg Paraschenko
It uses only core Python modules (expat XML parser, unicode
database, something other), so it should work on any recent
system. Mapping from Unicode characters to LaTeX commands is taken
from attachment for the MathML specification
(http://www.w3.org/Math/characters/unicode.xml (note: 1,5 Mb)).
And is it mode-aware? Does an alpha become \alpha in formulae and a
Greek letter elsewhere? What about ligatures like "--"? Is this an
en-dash or two hyphens? What about typographic things like thin
spaces, soft hyphens, zero-width non-joiner and "break permitted
here"? How much of Unicode is covered yet?
Post by Oleg Paraschenko
How fast is it (I'm not prepared to accept a further significant
drop down in speed)?
It is hard to said exactly, but I think it is fast. In any case,
it should be faster then processing of specials by xslt.
Okay; I asked because using it would mean to translate
XML--XML-->text instead of XML-->text-->filter-->text, where
"filter" is *very* fast. But faster than XSLT may be enough.
Post by Oleg Paraschenko
How are different \usepackage[???]{inputenc}'s dealt with?
The processor does not know about \usepackage, it only translates
characters. It is a task of an xslt to insert \usepackage command into
the output, if required.
So I always have to include things like wasy, pifont, textcomp etc?
Wouldn't be a problem, I just need a complete list.
Post by Oleg Paraschenko
User can specify an output encoding. The processor attempts to make as
good translation as possible for it.
Sounds nice. Are you aware of the very new utf-8 that was added to
the LaTeX core two months ago? How good does it work?

Tschoe,
Torsten.
--
Torsten Bronger, aquisgrana, europa vetus



-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
Oleg A. Paraschenko
2004-03-25 15:17:05 UTC
Permalink
Hi!

On Thu, 25 Mar 2004 13:43:51 +0100
Post by Torsten Bronger
Halloechen!
Post by Oleg Paraschenko
On Thu, 25 Mar 2004 12:42:49 +0100
[...] Interesting, but how is it implemented? In XSLT, or a
scripting language, or what?
It is implemented in the Python scripting language.
I don't know Python. How easy can this be installed on a Windows
system?
It should not be a problem. You can download Python from the
http://www.python.org/download/ , install it and run scripts from
a command line. For example:

d:\python23\python.exe texml.py -e ascii test.xml test.tex
Post by Torsten Bronger
Post by Oleg Paraschenko
It uses only core Python modules (expat XML parser, unicode
database, something other), so it should work on any recent
system. Mapping from Unicode characters to LaTeX commands is taken
from attachment for the MathML specification
(http://www.w3.org/Math/characters/unicode.xml (note: 1,5 Mb)).
And is it mode-aware?
Yes, it is mode-aware. It knows text and math.
Post by Torsten Bronger
Does an alpha become \alpha in formulae and a
Greek letter elsewhere?
I tested and found that in both modes result is "\alpha ". (Or the
letter alpha itself if output is in Greek encoding. I consider it is ok
because I see a very small difference between "$\alpha $" and "$a$")
Post by Torsten Bronger
What about ligatures like "--"? Is this an
en-dash or two hyphens?
As ligatures in TeX are the property of fonts and are not the property
of a document, and as the TeXML processor can't guess what font will be
used, the processor ignores ligatures at all. As result, "--" in TeXML is
translating into "--" in TeX, which is interpreted as en-dash. At time
of development I was considering that it is a correct behaviour. Now I'm
changing my mind and adding handling of "--" and "---" to the list of
bugs. Anyway, I don't plan to break ligatires like "fi", "fl" etc.
Post by Torsten Bronger
What about typographic things like thin
spaces, soft hyphens, zero-width non-joiner and "break permitted
here"? How much of Unicode is covered yet?
There are two translation tables, one for text mode, another one for
math mode. There is 2361 symbols for text mode and 195 symbols for math
mode (math mode reuses text mode if symbol not found).

For mentioned typographic things, here is a test:

| TeXML:
|
| <TeXML>&#x3B1;<math>&#x3B1;</math>
| thin space: [&#x2009;]
| soft hyphens: [&#xAD;]
| zero-width non-joiner: [&#8204;] oops here ...
| break permitted here: [&#x82;] ... and here
| </TeXML>
|
| TeX:
| \alpha $\alpha $
| thin space: [\hspace{0.167em}]
| soft hyphens: [\-]
| zero-width non-joiner: [&#x200C;] oops here ...
| break permitted here: [&#x82;] ... and here

As we see, not all characters are mapped. If it is an issue, then it is
an issue for supporters of the unicode map of the MathML specification.
After they approve and fix a problem, the TeXML processor also will be
updated.
Post by Torsten Bronger
Post by Oleg Paraschenko
How fast is it (I'm not prepared to accept a further significant
drop down in speed)?
It is hard to said exactly, but I think it is fast. In any case,
it should be faster then processing of specials by xslt.
Okay; I asked because using it would mean to translate
XML--XML-->text instead of XML-->text-->filter-->text, where
"filter" is *very* fast. But faster than XSLT may be enough.
Post by Oleg Paraschenko
How are different \usepackage[???]{inputenc}'s dealt with?
The processor does not know about \usepackage, it only translates
characters. It is a task of an xslt to insert \usepackage command into
the output, if required.
So I always have to include things like wasy, pifont, textcomp etc?
Wouldn't be a problem, I just need a complete list.
Maybe I don't understand the question well, so repeat the qeustion if I
give no answer. The TeXML processor does not add anything. So (imagine),
if the processor generates "\alpha", and usage of "\alpha" in TeX document
requires package "greekfont", you will probably get an error from LaTeX.
I have no good solution yet.
Post by Torsten Bronger
Post by Oleg Paraschenko
User can specify an output encoding. The processor attempts to make as
good translation as possible for it.
Sounds nice. Are you aware of the very new utf-8 that was added to
the LaTeX core two months ago? How good does it work?
I don't know yet if it works good. One of the problems is that Unicode
itself is not enough. There are right-to-left languages, dynamic ligatures
and other issues, so I'm investigating omega/lambda, not a LaTeX core.
Post by Torsten Bronger
Tschoe,
Torsten.
--
Torsten Bronger, aquisgrana, europa vetus
Bye!

--
Oleg



-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
Loading...