The ModulaTor
Erlangen's First Independent Modula_2 Journal! Nr. 4, May-1994
Character-less Programming
__________________________
by K Hopper, RH Barbour, Dept of Computer Science, University of Waikato,
New Zealand, Email: kh@athens.cs.waikato.ac.nz
Abstract
This paper discusses issues of programming language translator portability. A
proposed mechanism to provide for defining and implementing programming
languages is described. The mechanism depends upon the separation of the
language, character set and other culturally dependent features from the
process of translating a program or executing it when translated.
The mechanism is developed against a background of a review of programming
languages and their visible manifestation. The proposed mechanism applies the
principles of abstraction using abstract data typesin order to specify some
aspects of programming languages. A corresponding strategy for the
implementation of programming language tools and the environment in which
they may be used is also described.
Introduction
New techniques are urgently required for providing access to Information
Technology to people from all nations and cultures. To help meet this need,
research into portable software for programming language translation has been
carried out in New Zealand.
Such software isolates translator activities from the peculiarities of various
machine and operating system environments. The research questioned what
was common across cultures in every mapping from an external representation
of programmer's source text to an internal encoding suitable for a programming
language translator.
Further work is continuing to prepare lists of those common features of
individual cultures which require specific hardware or software solutions. These
lists, together with the portable translators referred to will form the enabling
technology for each culture to participate in Informaton Technology by invitation
and on its own terms.
Cultural Needs
The need for improved inter-cultural communication poses a challenge to the
Information Technology community. This need has indeed generated interest
and effort among international standards groups working on behalf of the IT
industry and its clients. A recent development from this effort has been the
recognition of a requirement for a generic set of tools within a coherent
theoretical framework. These are needed to provide culturally appropriate
representations of ideas to, from and between computer systems [Bar].
The use of a programming language for describing to a computer the desired
manipulation of ideas is the principal currently available technology. The general
concept of language is intended to describe the sounds, marks and signs which
people use for communication. The mechanism proposed for programming
language translation reflects the nature of the human communication process
rather than any specific cultural perspective.
It is suggested, based upon experimental implementation, that the proposed
mechanisms and techniques are as applicable to those programming languages
currently in use as to those yet to be developed.
The Expression of Meaning
Languages vary in complexity and expressive power. All language users use the
notion of 'token-in-shared-context' for determining the meaning of a particular
utterance (a mark or sign) in a particular context. Examples are two friends in
idle conversation or a programmer and a translation tool.
The shared context is provided in two parts :--
- The structure of the utterance itself -- that is the syntax of the language.
- The common milieu of those communicating. For example, when using the
English word path, the two friends in conversation would probably be referring to
a walkway between two hedgerows 'over there'. The programmer
communicating with his computer almost certainly considers that path refers to
a sequence of tokens used to identify some name space.
The token made visible as the word path therefore has no intrinsic meaning of
its own. All English speakers who are computer literate, however, have an
internal map from visible (audible) token to (at least) the two
'meanings-in-context' given as examples.
To extend this notion further, homophones are just as able to have different
'mappings-in-context' as may homographs. Providing the recipient of the
utterance knows the context then all is well. However, if the shared context is
not in fact determinate as, for example, on reading the isolated visible token fin
-- is it English and part of a fish or aeroplane, or French and this is the end! --
then communication is not possible.
Token Abstraction
Linguists are aware that no token has an intrinsic meaning in isolation from a
context. This deduction has rarely been applied in computing -- particularly not
in the design and implementation of computer programming languages. This is
understandable from a historical perspective. In the past, computer
representations of visible character sets were in use well before the
development of language theory as it is known today. The definition of the Algol
60 language was given in terms of the visible marks in the document because of
the differences in available character sets.
Current practice also reflects the natural attitude of programmers who are
working within a single culture. A programmer would take for granted that other
people would share both context and the meaning of the tokens used without
need for further expl
The idea of token abstraction was first used in the programming language APL.
It was defined in this way because the richness of the visible forms of the tokens
used to express it was not available in any computer character encoding at the
time of its development. The latest version of this language, Extended APL (DIS
13751 [ISO-a]), follows this scheme by naming tokens rather than giving them
explicit visible expression in the standard (although, naturally, the names
themselves are visible when printed).
An implementation of APL is required to provide a mapping from some external
representation of a token to the language-specified token. No distinction is
necessarily made, however, between these two conceptually different things in
the language implementation.
The programming language Ada (including its latest revision as Ada-9X [ISO-b])
goes a step further in indicating a possible solution to the overall problem
(although it must be pointed out that this has not yet been taken to its inevitable
conclusion as seen from the perspective of the suggestions offered in this
paper). The Ada language defines enumeration types, each value of which is
itself a token, provided with a position (in the enumeration). In a real sense the
value of this token is its position in the value domain defined by the type.
However, Ada goes one step further -- it provides for an explicit representation
to be defined for each value. Referring to the heading of this section and
inverting the above statement, the Ada language provides an abstract value for
each representation (visible/audible token) in an enumeration.
The Modula-2 programming language (DIS 10514 [ISO-c]) offers a variant,
though related, definition of the pre-defined language type CHAR which it
defines as an enumeration. Some of the names for the values of the type are
pre-defined in the standard while additional names are (if provided) to be
defined by an implementation. The important point about this for the purposes of
this paper, however, is that Modula-2 leaves a person carrying out an
implementation entirely free to choose any suitable representation for the
transfer of values into and out of a program.
Language Tokens and Representation
While all programming language standards (or other specifying documents) are
provided as written/displayable text, this is solely for the human reader. If it were
necessary to provide standards documents for blind people -- unable to see --
then it is likely that they would be prepared as audio recordings of the spoken
word. The language tokens would then be phoneme rather than grapheme
sequences.
Similarly, in the foreseeable future programming will become an activity in which
some design tool generates the syntactic tokens of a language -- not a string of
lexemes derived from a visible character representation. It is this notion,
together with those ideas taken from APL, Ada and Modula-2 language
definitions which lead to the first part of the proposed solution to token rich but
character-less programming.
Provided that a programming language translator is passed a stream of syntax
tokens in whatever internal representation it requires then it can attempt to carry
out the translation desired. The task of providing the internal representation of
the syntax tokens is a simple one of mapping short sequences of
graphemes/phonemes into a token. The encoding of the lexemes is irrelevant --
provided that a mapping is made available to the translator for its use when
translating! Using this technique it is possible to produce an almost
language-independent lexical analyser. The only problems arise where a
language makes use of constructed tokens (usually only numbers).
What is needed, therefore, is a list of all of those terminal tokens which exist in a
programming language -- a list specified as an ordered enumeration in the
language standard (or other specifying document). A typical list would include
such tokens as -- Full-stop, End-token, Line-mark, For-token, White-space,
Procedure-token, Quote-mark. The list, apart from being ordered in the
language specification must contain all of the tokens specified by the language,
including separately those lexemes used in constructing tokens such as
numbers.
It is necessary to distinguish tokens in this list from those specified in an
individual program. Programmer-defined tokens may be divided into the
following classes :--
- Compositions of language defined lexemes into a constructed token. In all
known programming languages such compositions are restricted to the
representation in visible form of a numeric value (which requires translator
processing in order to determine its value). For languages in which such
compositions are needed then the list of terminal tokens will necessarily include
such additional language-defined tokens as Digit-zero, Digit-one,
Exponent-mark, etc.
- Bracketed data. For such data values a language need only define one or
more tokens to provide the brackets. The single token to 'toggle' between
program source and data (the Quote-mark) as used in this sentence is another
possible mechanism. Many existing languages consider strings of visible
characters in this kind of way (see Note).
- All other programmer-defined tokens -- considered as identifiers.
Note: Where an existing language does interpret the contents of some character
string in a particular way (eg when I/O formatting) there is always some 'escape'
token followed by tokens known to the I/O formatting translator -- sometimes
followed by a further separator to indicate the end of the 'nested' bracketing
(dependent on the syntax of the formatting language). This is just another
language for which syntax tokens need definition. The fact that it may be
incorporated within some programming language translator must not be allowed
to obscure the fact that it is a different language and, as such, needs definition
too.
It is suggested that a lexical analyser has no need to be aware of the external
form of representation of any token at all. The lexical analyser for a
(programming) language translator only needs to be given a mapping of the
external representations for language tokens, including those necessary to
provide for programmer composable numeric value representations. It has no
need to be aware of any character encodings used in its host environment -- for
example any use of codes which a suitable rendering engine may produce as
Arabic, English, Chinese or any other humanly readable/audible language
'words' or 'digits' or 'punctuation'.
Multi-culturalism
A mapping provided for a translator in this way should be made available in the
environment in which it is to execute as part of the culture-dependent
components of the host operating system. A multi-cultural operating system will
provide, say, for Arab, English and Chinese programmers to work in their own
natural language alongside each other on the same joint project and all be able
to use the same programming language tools, including the translator.
The prospect described in the previous paragraph of joint project working in a
multi-cultural environment does raise with it a pair of additional problems which
need to be addressed if the technique of providing mapping files for lexical
analysers is to be a useful step towards improving application portability. There
needs to be :--
- An operating system 'lingua franca' so that those multi-cultural users may
share objects which others have created.
- A way of sharing concepts identified by some encoding of this lingua franca so
that users from different cultures can use their own native terminology and
produce a dynamic addition to what is effectively a project common concept
name space.
While this observation may seem to offer cause for reconsideration of the
practicality of the projected mechanism, it is really only serving to reiterate the
point made earlier that before communication can take place there must be a
shared concept about which communication would be meaningful.
The consequences, therefore, are that there needs to be potentially two
additional facilities where joint working is important, whether this is on a joint
project or a multi-user machine :--
- A facility for creating unique identifiers in the name-space concerned when
creating a new object. This is the equivalent to adding a new element to the
enumerated list of lexical tokens for a programming language definition. It is this
identifier, known only to the underlying operating system or project data-base
(eg where a project is under PCTE [ECMA] control), to which a user may attach
an encoding of an associated lexical token for the associated object.
- An addition to the lexical mapping for the environment tools which need to find
and manipulate such an object.
This, therefore, extends the concept to the dynamic creation (and, necessarily,
destruction too) of new identifiers and encoding mappings. Such a concept has,
it may be remarked, been used in 'secret' by the Unix operating system for many
years -- identifiers for file system objects are contained in uniquely numbered
and dynamically created entities known as i-nodes. The dynamically extensible
mapping is provided by a directory system. The technique is thus well-known by
implementers and this paper is merely suggesting that it can be extended into
other areas for the purposes of promoting application portability and the goals of
culturally sensitive Information Technology products in general.
The Mechanism in Practice
Whether used statically in relation to the definition of a (programming) language
standard, or dynamically for common access to a project name-space, the
technique described is intended to be used by the appropriate tool(s) in the
following general manner.
The initialization phase of a lexical analyser/generator involves first finding (in
an environment dependent way) the mapping file for the currently defined
culture. The contents of the map must then be read into some internal data
structure for later use as a lexicon for analysis (or generation) of external
representation encodings.
During subsequent translation of some unit or program the actions of a lexical
analyser are as follows -- the encodings received in the source stream are
matched against the mappings and further lexical actions proceed in the way
outlined below :--
- If a match is found then the appropriate token value (not the encoding) is used
in further processing.
- If the lexeme found is needed to start a token construction then subsequent
lexemes are used in such construction as may be required by language-specific
rules until a non-composing lexeme is detected.
- If the token is a start-data-delimiter then subsequent encodings (these are not
tokens since there is no mapping for them) are collected as data (for
interpretation by some other language engine as needed at some later time)
until an appropriate end-data-delimiter token is detected.
- Every other sequence of encodings is treated as an identifier token and (most
likely) looked-up/entered into the translator identifier table.
Note: Comments in the source of a translation unit could be treated as data
which may or may not be discarded for the purposes of the translator.
Further translation syntactic and semantic processing can then take place in the
normal way.
For the generation of a stream of lexemes, similar (though rather simpler) rules
of operation may be described.
Note that no characters at all are needed in this description or in the actual
processing involved in implementing such a mechanism -- merely an encoding
map together with three auxiliary lexical rules, only the composition and data
delimiting rules being language-dependent as would be expected.
Annex A contains a set of guidelines for use by programming language and data
abstraction designers and tool implementers, based on the above description of
operation. The formal specification for the format for a universal file to contain
such a mapping for any (programming) language and cultural environment is
given in Annex B and an example programming language lexical token
enumeration in Annex C.
The format specification is suitable for standardisation so that anyone with a
modicum of skill (and/or intelligent tools) can define a mapping file for a
language which is sensitive to the local culture, devices and operating system.
Providing facilities to enable this kind of thing to be done locally is a very
important part of being culturally sensitive in spreading the benefits of IT to a
wider community.
Note 1: Any need to translate the source form of a program exported from some
other site therefore requires a copy of the mapping file used originally.
Note 2: Any need to read the source form will also require a suitable rendering
engine for the encoding used -- possibly also someone who knows the language
involved to interpret the lexical tokens presented.
Data Manipulation
The suggested language-defined translator map completely specifies a
character-less (and hence considerably more portable and culturally-sensitive)
translator mechanism. It does not at first sight appear to solve the allied problem
of data manipulation by an executing program.
Where data appeared in the source of a program, the translator lexical analyser
merely collected encodings in a totally transparent manner. After all, it is the
manipulation of such unknown data, together with that read from some input
channel during processing, which is the principal purpose of a program. The
production of data through some output channel is only of related concern if it is
intended for use by some other tool (eg a rendering engine producing sounds or
visible marks for human users).
The encoding of data, whether as part of some program source or as data
obtained/generated during program execution, is therefore of major concern to
the writer of a program -- but only insofar as the actual encoding employed may
be 'understood' by the program. The solution to this problem given above in the
case of a programming language translator (which is, after all, only another
program), relied upon providing a mapping from encodings to lexical tokens in a
completely portable, transparent, manner.
While it would, perhaps, be ideal for the specification of every program to
indicate its input/output languages in such a way, this is unlikely to be practical
for some time to come (if ever). As an interim solution, therefore, it will be
necessary to provide a simple, practical mechanism which will improve upon the
present situation -- ie it must be possible to determine the incoming tokens from
the external representation.
The current work being undertaken world-wide to collect a number of abstract
concepts which are shared by more than one culture offers, it is suggested, a
small step in the right direction. If, for each shared abstraction, a unique
Abstract Data Type (ADT) is provided as part of the environment within which a
program executes then the program may make use of such a type and its
standard operations without any concern for the culturally specific interpretation
of such a concept. Similarly, in addition to not needing to be concerned with
local interpretation, the program does not need to 'understand' the syntax nor
encoding of any human interpretable form of the abstraction -- that is hidden
inside the implementation of the lexical analyser/generator for import/export of
values of the abstraction.
The implementer of the ADT facility for some specific culture and system will, for
lexical analysis/generation purposes, need to initialise an internal mapping of a
culturally-specific encoding in exactly the same way that a language translator
lexical analyser had to read its lexical mapping. Once again, there is no need for
the culturally-specific ADT implementer to be aware of what form an encoding
may take.
This simple notion of providing input and output of values of an abstract data
type is not, however, quite as simple as this may seem. Until a general
technique for describing language syntax in a simple tabular manner
dynamically can be developed, the implementation of the ADT must contain its
own culturally appropriate syntax rules for the internalisation and externalisation
of values of the type. That this is practiced currently is, of course, known -- but
the separation of the production/analysis of the lexis for the ADT's
human-interpretable language now merely needs to conform to the rules given
for the language translator lexical analyser mapping file.
The use of ADTs in this way for compiler lexical token lists and for cultural
service lexical token lists implies that the provider of a computing environment in
which these operate is only specifying encoding in a mapping file which will
represent the concepts concerned. This in turn implies that these may be
rendered in a visible form in any natural language. It seems, therefore, that this
offers the opportunity for a system programmer who is a member of any culture
to prepare mapping files provided only that he/she shares the concepts used in
defining either a programming language or an ADT.
Are Characters Really Necessary?
While the discussion in this paper has so far hardly referred to the word,
characters are necessary although only in a very limited way -- far more limited
than much current thinking about programming would indicate.
A quick review of any average selection of programs written in whatever
programming language will reveal that, with very few exceptions, characters are
currently used for the following :--
- Expressing in visible form in program source some complete or partial
message for export to a human (or other text reading entity) -- eg ''Death in
family!''.
- For use in converting to/from some internal token or value (converting a
number or date, say).
- Either individually or in combination as a substitute syntax token (in much the
same way as programs are currently written, just as some form of textual data).
The relation of these three to comments/data, language tokens and value
construction as specified for programming languages is evident. What, perhaps,
is not so obvious is that these have essentially all been eliminated from
application programs which :--
- Provide for 'messages' to be external as a list of messages in the same order
as some internal enumeration. Naturally this is done to permit the use of
different natural languages and orthographic forms of message. Such
messages then become a program-specific token list and mapping!
- Use culturally shareable ADTs which embody conversion analyser/generator
code as described earlier.
- Merely use encoding pattern matching to determine whether or not some
encoding sequence matches a known syntax token. This uses encodings but
not characters!
The only point where characters are actually needed, therefore, are as visible
forms (where an encoding uses a suitable rendering engine) for human
interpretation/generation. This could occur, for example where a human user is
preparing a list of messages for some program to use. Note that the list is
prepared using the visible marks on a display, say -- but that what is really
wanted is the encoding of the message. The using program merely uses the
encoding, not the characters.
In essence, programming is not a task which involves characters. In restricted
circumstances detailed manipulation of character encodings may be required,
but this is almost exclusively restricted to the use and preparation of mappings
to lexical tokens, tokens which have some representation internal to a using
program which is not the concern of any external entity.
Words and Strings
It is proposed, as a final part of the interim solution using ADTs suggested
earlier, that two very special ADTs be made mandatory in every program
execution environment -- Word and String, where a String is defined as a
sequence of words separated by white space encodings.
From the point of view of the human user, the ability of a computer to organise
visible text in accordance with some culturally appropriate rules is of paramount
importance. The organised list of names, etc is a very important feature of many
human activities.
Strangely enough, these two ADTs (and others which may be derived from
them) are the only ones which necessarily use the concept of character as
understood by the human user. The need is to divide the encoding stream into
sub-components which, when applied to a rendering engine, result in the
generation of a single identifiably separate mark representing what is called a
character. Once this has been done, then in the semantics of the operations on
these objects, which include such things as ordering, equality/inequality testing,
etc the human concept of character is needed.
Summary
The paper has identified an urgent need for the expansion of information
technology to all nations and cultures. As the result of some New Zealand
research into translator portability, it has been possible to propose a generic
mechanism to extend the way in which the lexis of programming language
standards is defined to improve the cultural sensitivity of implementations. A
similar approach to the definition of shared cultural concepts permits greater
cultural independence of application programs written using existing
programming languages. The key components of this are :--
- A named list of lexical tokens for a language.
- A locally-defined mapping between representation encodings and the items in
this list.
- Where separate translation of program components is permitted, then a tool to
map between different cultural encodings of names defined for use in another
component may be needed.
- Application of the above principles in the design and implementation of ADTs
which are shared by more than one culture.
Acknowledgments
Many thanks are owed to those kind people who have listened to our often
tentative explanations of embryo ideas -- D Andrews, A la Bonte, R Hicks, H
Jespersen, RJ Mathis, PJ Plauger, K Pronk, P Rabin, K Simonsen, M
Woodman, D Wong. Much of the credit for the ideas expressed is due to them,
any errors are ours.
Bibliography
[Bar] Barbour RH, Cunningham S-J & Ford G: Maori word-processing for
indigenous New Zealand young children, British Journal of Educational
Technology, (2), p114-124, 1993
[ECMA] European Computer Manufacturer's Association: Portable Common
Tools Environment -- Abstract Specification, ECMA-149, June 1993
[ISO-a] ISO/IEC: Information Technology -- Programming Languages, their
environments and system software interfaces -- Programming Language
Extended APL, CD 13751 draft, 26 Aug 1993
[ISO-b] ISO/IEC: Information Technology -- Programming Languages, their
environments and system software interfaces -- Programming Language Ada,
CD 8652:1993 draft, 16 Sep 1993
[ISO-c] ISO/IEC: Information Technology -- Programming Languages, their
environments and system software interfaces -- Programming Language
Modula-2, DIS 10514 draft, 31 Jan 1994
[ISO-d] ISO/IEC: Information Technology -- Universal Multiple-Octet Coded
Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane,
ISO/IEC 10646-1, 1 May 1993
[ISO-e] ISO/IEC: Information Technology -- Programming languages and their
environments and system software interfaces -- Vienna Development
Method/Specification language, Part1 -- Base language, CD 13817-1 draft, Dec
1993
Annex A
Design Guidelines
The following paragraphs contain suggested guidelines which, if adopted by the
designers of programming languages and implementers of translators, should
lead to the benefits described in this paper.
It is important to recognise that all tools which provide abstractions of human
concepts are responsible for translating the values embodied in those concepts
between the form in which they may be uttered/absorbed by a human and some
internal concretisation of the abstraction concerned. In this respect a
programming language is one of the most complex abstractions currently
handled by computer software tools. As such it requires greater effort in
providing a tool -- but that tool must operate on the same principles as that
which merely recognises a s
Language Design
The designer of a programming language which has a specified form of
concrete representation should consider the following :--
- What are the legal tokens required by the language? These do not include
user-defined tokens (which may be called names or identifiers) -- which are
defined in a language dependent manner in terms of the legal tokens in this list
and other previously defined tokens in the program unit being translated.
- What are the needs of the language for determining a meaning from some
primitive token structure in order to produce a composed language token? For
example, what is the set of character tokens used in presenting a numeric value
to a language translator? Is this a sub-language to be treated by a separate
recogniser?
- What subordinate (sub-)languages are used in the definition of the
programming language itself? For example does some input/output formatting
sub-language also need definition?
- For the programming language itself and for each sub-language as defined
above, produce an enumeration of all of the tokens which are either a legal
token or needed to compose a legal token. Define this as the lexis for the
language.
Language Implementation
The implementer of a programming language should consider treating each of
the sub-languages defined (eg I/O formatting) as a separate abstraction to be
handled in the way defined below for data abstractions in general.
The implementation of the lexical analyser should be initialised by reading some
lexical token mapping file as defined in Annex B following and then using that for
look-up purposes in determining the lexical tokens needed for syntax analysis.
Where some part of the input stream is determined to be data for interpretation
by some other sub-language abstraction, then the raw encodings should be
passed to that as required without modification.
It should be noted that the look-up tables provided in this way are very similar to
the lexicon or dictionary provided for natural language users.
Data Abstraction Design
It is most important that the concept for which an abstraction is being designed
is fully and unambiguously defined. There is, however, little point in designing an
abstraction around a concept which is not shared by at least two cultures. It is,
for example, important to distinguish between the concept of time as a
countable series of regularly occurring events (eg clock ticks) and time as a
sequence of events which are ordered but not regular -- these two are entirely
different concepts of time -- both equally valid, just different abstract concepts.
Once a concept has been fully defined, this definition should be standardised.
The grammar (syntax) used for humanly visible/audible forms of utterance of
values of the abstraction will differ from culture to culture and a registry of such
grammars should be set up and maintained. It is important that all such
grammars include a
Employing the same mapping file principle as used for programming languages,
individual cultures are then free to prepare appropriate mappings as may be
required.
Annex B
Proposed Lexical Mapping File Format
The format described below for the contents of a mapping file to provide the
lexical independence described is based upon the fact that all standard
character encodings (for example those defined in ISO/IEC 10646-1 [ISO-d])
occupy an integral number of octets.
The lexical mapping file contains a file header followed by one or more mapping
entries as described in the following sections. There shall be as many entries as
there are lexical tokens specified in the lexis of the language for which the
map-file has been created. The effect of there being any lesser number of
entries is tool-dependent.
File header
Apart from the first octet in this header, two encodings are defined. These two
are specified so that they may not form part of any external encoding which the
file defines. They are only used by the translation tool which converts this
mapping file into a lexical look-up facility. As with the external encoding entries
defined after the header, their meaning is determined by their location in the
mapping file.
- First octet -- this shall be an unsigned binary numeric value which specifies the
number of octets of which all other entries in the file are to be a multiple. This is
called size in the following sub-paragraphs.
- The next size octets -- this shall be an encoding for a separator token which is
not used elsewhere in the mapping file as a component of any other external
encoding for a token.
- The next size octets -- this shall be an encoding for a token to be known as an
'alternate token'. The encoding for this alternate token shall not be used
anywhere else in the mapping file as a token with any other meaning than the
following. The alternate token shall indicate that for the entry (defined below) in
which it appears, the external encoding following it is a valid alternative external
encoding to the one(s) preceding it.
- The next size octets -- a separator token.
File Entries
Each entry in the file consists of one or more encoding sequences followed by a
separator. Where there is more than one sequence in an entry then each
sequence is separated from a following one by an alternate token as defined in
the file header. Each encoding sequence must be a multiple of size octets.
Formal Definition
The following abstract syntax (in the form specified by CD 13817 [ISO-e])
formally defines the required file format in such a way that it may be made
concrete by implementing the data structures defined as streams of binary digits
where the abstract type Binary-digit is implemented as a single bit.
types
Map-file ::
header : Map-header
entries : Map-entry+
Map-header ::
size : Encoding-sizes
separator : Map-code
alternate : Map-code
terminator : Map-code
inv mk-Map-header(_,sep,alt,term) =^
(sep = term) &
(sep # alt) &
(sep = SEPARATOR) &
(alt = ALTERNATE) &
(size = OCTET-COUNT)
Encoding-sizes =
N
inv sz : Encoding-sizes =^
let bin : Octet be st (bin = sz)
Map-code =
Octet+
inv mp : Map-code =^
len mp = OCTET-COUNT
Octet =
Binary-digit*
inv begin oct : Octet =^
len oct = 8
Binary-digit = (SET | CLEAR)
values
SEPARATOR : Map-code = <File-dependent>
ALTERNATE : Map-code = <File-dependent>
OCTET-COUNT : Encoding-sizes = <File-dependent>
Annotate: These three values which must, of course, conform to the invariants
not only in their own type-definitions, but also that in the type Map-header are
otherwise arbitrarily chosen according to the practical needs of the encodings
being defined in the mapping file.
types
Map-entry ::
encode : Encoding
alternates : [Option*]
term : Map-code
inv mk-Map-entry(enc,alt,term) =^
forall elem <= dom inds enc . (enc(elem) # SEPARATOR) &
(enc(elem) # ALTERNATE)) &
forall opt <= dom inds alt . let mk-Option(_,val) in
forall elem <= dom inds val . (val(elem) # SEPARATOR) &
(val(elem) # ALTERNATE))) &
(term = SEPARATOR)
Encoding = Map-code+
Option ::
all-sep : Map-code
val : Encoding
inv mk-Option(alt,_) =^
(alt = ALTERNATE)
Annex C
Example Language Lexical Token Enumeration
The following listing of lexical tokens for the Modula-2 language is merely an
exemplar of which the source material was readily available to the authors.
Following the listing there is some discussion about the way in which the entries
in the encoding file could be made.
types
Modula-lexeme =
Digit-0 | Digit-1 | Digit-2 | Digit-3 |
Digit-4 | Digit-5 | Digit-6 | Digit-7 |
Digit-8 | Digit-9 | Digit-10 | Digit-11 |
Digit-12 | Digit-13 | Digit-14 | Digit-15 |
Hex-number-mark | Octal-number-mark | Character-mark |
Exponent-mark | Single-quote | Double-quote |
White-space | Comment-start | Comment-end |
Source-code-directive-start | Source-code-directive-end | Colon |
Comma | Ellipsis | Equals | Period | Semicolon |
Left-parenthesis | Right-parenthesis | Left-bracket | Right-bracket |
Left-brace | Right-brace | Assignment-operator | Plus-operator |
Minus-operator | Logical-disjunction |
Multiplication-operator | Division-operator |
Logical-conjunction | Logical-negation | Inequality | Less-than |
Greater-than | Less-than-or-equal |
Greater-than-or-equal | Dereferencing |
AND-SY | ARRAY-SY | BEGIN-SY | BY-SY |
CASE-SY | CONST-SY | DEFINITION-SY | DIV-SY |
DO-SY | ELSE-SY | ELSIF-SY | END-SY |
EXIT-SY | EXCEPT-SY | EXPORT-SY | FINALLY-SY |
FOR-SY | FORWARD-SY | FROM-SY | IF-SY |
IMPLEMENTATION-SY | IMPORT-SY | IN-SY | LOOP-SY |
MOD-SY | MODULE-SY | NOT-SY | OF-SY |
OR-SY | PACKEDSET-SY | POINTER-SY | PROCEDURE-SY |
QUALIFIED-SY | RECORD-SY | REM-SY | RETRY-SY |
REPEAT-SY | RETURN-SY | SET-SY | THEN-SY |
TO-SY | TYPE-SY | UNTIL-SY | VAR-SY |
WHILE-SY | WITH-SY |
ABS-Ident | BITSET-Ident |
BOOLEAN-Ident | CARDINAL-Ident | CAP-Ident | CHR-Ident |
CHAR-Ident | COMPLEX-Ident | CMPLX-Ident | DEC-Ident |
DISPOSE-Ident | EXCL-Ident | FALSE-Ident | FLOAT-Ident |
HALT-Ident | HIGH-Ident | IM-Ident | INC-Ident |
INCL-Ident | INT-Ident | INTERRUPTIBLE-Ident | INTEGER-Ident |
LENGTH-Ident | LFLOAT-Ident |
LONGCOMPLEX-Ident | LONGREAL-Ident |
MAX-Ident | MIN-Ident | NEW-Ident | NIL-Ident |
ODD-Ident | ORD-Ident | PROC-Ident | PROTECTION-Ident |
RE-Ident | REAL-Ident | SIZE-Ident | TRUE-Ident |
TRUNC-Ident | UNINTERRUPTIBLE-Ident | VAL-Ident
Remarks
These are the some hundred and thirty lexical tokens of the Modula-2
programming language. Some of them are permitted more than one encoding
by the nature of the language, for example the visible marks '#' and '<>' are, as
currently defined, valid alternatives in an 'English' representation of the
Inequality lexical token.
White-space too has several possibilities including a space (' '), a horizontal tab
and a line mark (however that may be indicated/detected in a particular
operating system).
Note that in an 'English' version of a mapping file there are a couple of
potentially awkward problems as the letter 'C' is used as a Character-mark as
well as Digit-12. Such duplication is, however, peculiar to a particular form of
visible representation which is always context dependent in a composition.
Disambiguation of the context dependency is a simple problem within the value
composing routine.
The way in which the enumeration has been defined is designed to show that,
for example, Arabic digits or Chinese digits (as well as Roman digits, etc) coding
could be used in an appropriate culture (and made visible by an appropriate
local rendering engine) for Digit-zero, Digit-one, etc!
[Ed. note: This article was submitted for publication in the ModulaTor by the
autors in 10-Mar-94.]
________________________________________________________________
IMPRESSUM: The ModulaTor is an unrefereed journal. Technical papers are to be
taken as working papers and personal rather than organizational statements.
Items are printed at the discretion of the Editor based upon his judgement on
the interest and relevancy to the readership. Letters, announcements, and
other items of professional interest are selected on the same basis. Office of
publication: The Editor of The ModulaTor is Guenter Dotzel; he can be reached
by tel/fax: [removed due to abuse] or by
mailto:[email deleted due to spam]
ModulaWare home page
The ModulaTor download
![]()