Natural Language Processing of Integer Values

2011-01-03

I just pushed my most recent changes to NaturalNum - a python library for natural language representation of integer values. E.g. usage:

$ python example.py 123456 en_GB [‘one’, ‘hundred’, ‘and’, ‘twenty’, ‘three’, ‘thousand’, ‘four’, ‘hundred’,‘and’, ‘fifty’, ‘six’] $ python example.py 123456 fr_FR [‘cent’, ‘vingt’, ‘trois’, ‘mille’, ‘quatre’, ‘cent’, ‘cinquante’, ‘six’]

Currently, only English and French are supported, for values up to hundreds of thousands. More languages will be added as inspiration strikes. The library can be downloaded from github. I stress that the implementation is a Proof of Concept, and is not a shining example of best practices. Really, it took far too much work to parse and validate the rules. I didn’t want to write a full-blown DSL as I thought the requirements were fairly simple. In future, I would not attempt this kind of thing with manually hand-crafted code. There is heavy use of regexps for validation, and it is probably quite hard to understand what the code is doing. Pyparsing could perhaps yield an alternative implementation. NaturalNum NaturalNum is a python module for easy conversion of numeric values to natural language, with full internationalization. E.g. “2100” can be mapped to “two thousand one hundred”, “deux mille cent”, “2.wav,1000.wav,1.wav,100.wav”, or any other representation, according to rules-based configuration. Quick Start The script example.py provides an example usage of the library, allowing command line evaluation of natural language. E.g.:

$ python example.py Usage: example.py $ python example.py 123456 fr_FR [‘cent’, ‘vingt’, ‘trois’, ‘mille’, ‘quatre’, ‘cent’, ‘cinquante’, ‘six’] $ python example.py 123456 en_GB [‘one’, ‘hundred’, ‘and’, ‘twenty’, ‘three’, ‘thousand’, ‘four’, ‘hundred’,‘and’, ‘fifty’, ‘six’]

The locale code must match a file in the config directory.
example.py can be used as a starting point to use the library any way you want.
If new language rules are required (or amendments are required for existing ones), browse the language config files, and read about how rules are set up below.

Motivation Generation of natural language - even just for numbers - can be complex. Generally, in natural languages a number is represented as a series of words drawn from a relatively small vocabulary (e.g. ‘one’, ‘two’, …), together with some multipliers (e.g. ‘hundred’, ‘thousand’) and linking words (e.g. ‘and’). For example, when learning how to count, you perhaps might start learning how to say one to ten. Then it is relatively easy to get as far as one hundred, by learning the words for twenty, thirty, etc., and combining them with what you already know. Then it is even easier to go into the hundreds and thousands once you know how to group numbers by place-value and combine them with their multiplier. Based on this understanding, a first attempt in converting numeric values to natural language would be to create a lookup table for each word in the vocabulary, then write an algorithm to stitch the words together based on some logic. At first this seems a perfectly adequate approach. In fact it probably would be the best choice for a single language, but as soon as internationalization is attempted the complexity would increase drastically. This is due to differences in not just the vocabulary across different languages, but also the rules for grouping, multipliers and other tricky corner-cases. For example, in English the number 100 is spoken as “one hundred”. But in French the same number is rendered “cent”. So not only is the vocabulary different, but so is the syntax (there is no need to specify the multiplier of hundreds if it is 1, in French). Additionally, the number 101 is said “one hundred and one” in English, but is simply “cent un” in French, missing the linking word. Another interesting example is to compare how tens and units are combined by English and German. In English, 21 is “twenty one”. But in German it is “einundzwanzig” (“one and twenty”). Furthermore, the German rendering is a single word, whereas the English is two separate words. Further complexities can be found in all languages. Features NaturalNum offers the following:

Arbitrary mapping of numeric values to lists of string tokens, allowing a rich set of natural language representations
Rules-based approach, expressing knowledge of natural languages declaratively. This strategy scales horizontally - the rules for different languages are separately maintained, rather than nested among each other in a single algorithm.
Rules are triggered by pattern-matching. This is a natural choice considering how equivalence classes often depend on the ‘place value’ of digits.
Wildcarding and variable-binding is possible (‘digivars’). This allows generic rules to be created for cases where the natural language representation is based on digits within the input value, e.g. when mapping to resource tokens such as audio filenames.
Rules can resolve a subset of the input number recursively, to allow reuse. This means the size of the rule-set scales very well with an increasing domain of input numbers.

Configuration BASICS Each supported natural language must be configured with rules; sample languages are provided in config/. For a given input numeric value, the rules config for the current language is searched from top-to-bottom. A rule is triggered as soon as the input numeric value matches the pattern on the left-hand-side of each rule. The pattern-matching is based first on the number of digits, and then on the values of the digits themselves. As soon as a match is found no further matching will be attempted. This means rules should be specified from the most specific to the most general. The right-hand-side of each rule is always resolved to a list of tokens representing the natural language representation for the matched numeric input. The left-hand and right-hand sides of each rule are separated by ‘=’. A very simple example rule for English follows (note that ‘#’ indicates the beginning of a comment, which is ignored by the reader):

21=twenty,one ## rule for 21

The list of tokens on the right-hand-side can be anything, so they could be used to identify resources, e.g. audio files containing spoken utterences for playback. A very simple example mapping the digits 0-9 to a corresponding audio filename is given below:

0=0.wav ## an input of “0” is matched, and mapped to the output ## string “0.wav” 1=1.wav ## “1” -> “1.wav” ## (…) 9=9.wav ## “9” -> “9.wav”

So far this does not seem too interesting; it is just a simple lookup table mapping literal numeric values to literal strings. It would not scale, and there would be a lot of repetition. However, there is a way to make these rules more compact. DIGIVARS A digivar is a wildcard character which, when appearing on the left-hand side of a rule will match any single digit in the same position. What makes digivars especially useful is that they act as a variable which is bound to the matched digit, and when referenced on the right-hand-side (preceded by $) the digivar will be replaced with its bound value. Digivars can be mixed freely with literal values on the right-hand side of a rule. For example, we can simplify the rules previously given for 0-9, as a single rule:

u=$u.wav

Any single digit value (0-9) will match this rule. The single digit in the matched numeric value will be bound to the digivar ‘u’ (any alphabetic character could have been chosen, but ‘u’ seems a natural choice for ‘units’). When $u is encountered on the right-hand-side, this will be replaced with the digit matched to digivar ‘u’, followed by ‘.wav’. So ‘1’ would resolve to ‘1.wav’ and so on, up to ‘9.wav’. There are a few restrictions to the use of digivars:

A digivar can only be a single alphabetic character (this makes the left-hand-side value easier to read from a pattern-matching perspective). If several alphabetic characters appear on the left-hand side in sequence, it will be assumed they are seperate single-character digivars.
Digivars must be alphabetic characters (case-sensitive), to distinguish from literal digits.
The same digivar cannot be used more than once on the left-hand-side (you cannot bind the same digivar to more than one digit).
It is perhaps intuitive that digivars are scoped only to the single rule they are defined in. So if the same digivar ‘u’ appears in a number of rules, it is bound in each case to the digit matched in that rule only.

Note that it is not mandatory for a digivar to appear on the right-hand-side. So digivars can be used as straightforward wildcards, although there is no obvious use-case for this. Care must be taken in defining rules in the correct order (most specific to least specific) to get the correct result. The following is a more advanced example of how we might continue the previous examples as far as 0-999 using digivars. Note that for the remainder of this example, we will define a minimal vocabulary of audio filenames to be the following:

0.wav, 1.wav, …, 19.wav
20.wav, 30.wav, up to 90.wav
100.wav (represents “hundred”, not “one hundred”)
and.wav (linking word)

The limited vocabularly means we will have to compose tokens together in sequence on the right-hand side, where necessary. For clarity, we will not specify the file extension (.wav) on the right-hand side of each rule (in practice it would be straightforward for a calling application to append the extension to each of the generated tokens)

## Single Digits (units) u=$u ## Units (e.g. 7 = 7.wav)

## Double Digits (tens and units) t0=$t0 ## Exact tens (e.g. 30 = 30.wav) 1u=1$u ## 11-19 (e.g. 19 = 19.wav) tu=$t0,$u ## Remaining ‘tens and units’ ## (e.g 56 = 50.wav, 6.wav)

## Three Digits (hundreds, tens and units) h00=$h,100 ## Exact hundreds ## (e.g. 300.wav = 3.wav, ## 100.wav) h0u=$h,100,and,$u ## Hundreds and unit with no tens ## (e.g. 709 = 7.wav, 100.wav, ## and.wav, 9.wav) ht0=$h,100,and,$t0 ## Hundreds and exact tens ## (e.g. 850 = 8.wav, 100.wav, ## and.wav, 50.wav) h1u=$h,100,and,1$u ## Hundreds and 11-19 ## (e.g. 218 = 2.wav, 100.wav, ## and.wav, 18.wav) htu=$h,100,and,$t0,$u ## Hundreds and remaining tens ## and units ## (e.g. 546 = 5.wav, 100.wav, ## and.wav, 50.wav, 6.wav)

This is some good progress; we have specified the natural language rules for all numbers 0-999 in 9 rules. However this is not perfect, as there is some repetition. The rules for the single and double digits are duplicated further down in the rules for 3 digits. Without any further tricks, this redundancy would fan out as patterns for more digits are provided. This is where recursive rules come into the picture. RECURSIVE RULES It would be useful if a rule could defer part of its evaluation to another rule. For example, a number of languages have the same rules for hundreds, tens and units, regardless of where they appear in the value. More specifically, in English, we should not have to repeat ourselves when we define how the value “34” is represented when evaluating “34,000”, “134”, or “34,034,034”. The latter case could be decomposed into steps in an informal ‘rule’ as follows: 34,034,034: +- (34) million +- (34) thousand +- and +- (34) In all but one of the steps, the value 34 is placed in brackets to show it is delegated to a separate rule which is already defined for ‘tens and units’. This is very similar to how recursive rules work in NaturalNum. In the example above, the rules for single and double digits were repeated within the rules for hundreds, tens and units. This would fan out into increasing redundancy as values of increasing length are supported. However, we could change the way three digits are resolved, as follows:

## Single Digits (units) u=$u ## Units (e.g. 7 = 7.wav)

## Double Digits (tens and units) t0=$t0 ## Exact tens (e.g. 30 = 30.wav) 1u=1$u ## 11-19 (e.g. 19 = 19.wav) tu=$t0,$u ## Remaining ‘tens and units’ ## (e.g 56 = 50.wav, 6.wav)

## Three Digits (hundreds, tens and units) h00=$h,100 ## Exact hundreds ## (e.g. 300.wav = 3.wav, 100.wav) h0u=$h,100,and,$u ## Hundreds and unit with no tens ## (e.g. 709 = 7.wav, 100.wav, ## and.wav, 9.wav) htu=$h,100,and,($t$u) ## Hundreds and tens and units - ## tens and units are deferred to ## a matching rule ## e.g. 546 = 5.wav, ## 100.wav, ## and.wav, ## (56) ## = 50.wav, 6.wav

The single and double digits are specified as in the previous example. But there are now only three rules for three digits, instead of five. This doesn’t sound like a huge improvement, but as more digits are added the benefits are compounded (especially when the rules for three digits are re-used in 4+ digits). The logic here is as follows. When “546” is resolved, the first rule matched is the final one (matching the pattern “htu”). This resolves to (“5”, “100”, “and”), but the tens and units are parenthesised, so are resolved recursively. This means that the digits matching “$t$u” (“56”) are fed back through the rules, matching the rule with pattern “tu”. This rule would then give the remainder of the result, which is “50”, “6”. Conclusion NaturalNum goes a long way towards solving the problem of natural language representation of integer values. It is not perfect however, as shown by the following issues:

It is still hard to write new rules. Defining rules for a new language requires some study, particularly finding the edge cases and identifying where candidates for re-use exist. Validation is time-consuming, perhaps requiring a native speaker of the language to confirm the details. However, this is the nature of the problem domain.
A fairly large assumption is made that only cardinal, nominative numbers are required to be represented. This is fine for counting, but if we are trying to say ‘six chairs’, then the problem of gender and accusative, dative, etc. comes into play. For example, Polish has a table of variations of the number 1, depending on gender, plurality and cases (nominative, accusative, etc). [1], [2]. This can introduce problems if representing currency amounts, for example.
The project to analyse the suitability of this approach has not progressed beyond Europe (due to limited scope of requirements at the outset). It is not yet known how feasible Asian language systems (for example) would adapt to this scheme. Either a Romanized system or Unicode could be used for the representations, but it is not yet known whether the syntax would lend itself well to the approach taken by NaturalNum.

Further Work NaturalNum could be developed in the following ways:

Add new languages, particularly untested classes of language, e.g. Chinese
Expose as a web service
Extend configuration into the millions, and beyond
Consider alternative implementation using a parser generator
Create further abstractions to represent variations with respect to the gender and class of the related noun. A possible way to capture this:

1<masc,norm>=jeden References 1. http://lightning.prohosting.com/~popolsku/Numerals.htm 2. http://german.speak7.com/german_cases.htm