Look at the following little CCG lexicon:
Make sure you understand how you can use the rules of forwards and backwards functional application to generate all and only the following eight sentences:
For example, the following derivation succeeds, telling us that "girls love themselves" is a valid sentence according to this lexicon:
girls love themselves
----- --------- ------------------
NP (S\NP)/NP (S\NP)\((S\NP)/NP)
-------------------------------<
S\NP
-----------------------------------------<
S
But the following one fails, since "girls dance themselves" is not a valid sentence according to our CCG lexicon:
girls dance themselves
----- ----- ------------------
NP S\NP (S\NP)\((S\NP)/NP)
---------------------------<
*****
Note that the ***** denotes "failure" of functional application.
We are now going to look at how to implement this little grammar in the OpenCCG XML notation, and how to run and test it. If you don't understand how CCG derivations work, then now would be a good time to consult an introductory guide. For example, this one. Or you could read a more basic guide here, starting from a recap of context-free grammars.
Atomic (i.e. "saturated") CCG categories, like S and NP, are encoded in the OpenCCG XML representation using atomcat elements, as follows:
<atomcat type="S"/>
<atomcat type="NP"/>
The type attribute is obligatory, and it must be of the XML Schema type NMTOKEN, i.e. it must consist of just letters, digits, full stops, hyphens and colons, with no whitespace.
Complex (i.e. "unsaturated") CCG categories, like (S\NP)/NP or (S\NP)\((S\NP)/NP), are encoded in the OpenCCG XML representation using complexcat elements, as follows:
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP"/>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP"/>
<slash dir="\"/>
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP"/>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</complexcat>
Note that the first child of a complexcat element represents the "ultimate result" category, and hence cannot be another complexcat element, but rather must be an atomcat. All subsequent child elements of a complexcat must be "argument pairs" of a slash element followed by either an atomcat element or a complexcat element. What this means is that OpenCCG required complex categories to be "uncurried", and the slash operators are left-associative. Note also that the dir attribute on slash elements is strictly speaking optional, and its value must be one of /, \ or | (where | means "either / or \").
When we are writing CCG lexicons in OpenCCG, unfortunately we cannot simply assign words to lexical categories directly, as we did when defining the little CCG lexicon at the start of section 1. Instead we have to work through the intermediary of "lexical families". A family is a set of categories, which are assumed to be related in some way. We associate a word directly with a set of lexical families, and thus indirectly with a set of lexical categories.
Lexical families are encoded in the OpenCCG XML representation using family elements with the following general structure:
<family name="..." pos="...">
<entry name="...">...</entry>
<entry name="...">...</entry>
...
</family>
Every family element has an obligatory name attribute, whose value must be unique within the lexicon. A family also has an obligatory pos attribute (not necessarily unique), which is used to link words in the lexicon to families of categories (hence rendering the name attribute irrelevant?).
A family contains a set of entry elements, each of which consists of exactly one atomcat or complexcat element, i.e. a family specifies a set of related lexical categories. Each entry element has an obligatory name attribute (unique to the family it is contained within) (whose purpose is unclear!). The first entry element in each family conventionally has name="Primary".
Our CCG mini lexicon will contain the following four lexical families (one for each distinct lexical category in the grammar), each of which contains just the one category:
<family name="A" pos="plural_common_noun">
<entry name="Primary">
<atomcat type="NP"/>
</entry>
</family>
<family name="B" pos="intransitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP"/>
</complexcat>
</entry>
</family>
<family name="C" pos="transitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP"/>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</entry>
</family>
<family name="D" pos="reflexive_pronoun">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP"/>
<slash dir="\"/>
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP"/>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</complexcat>
</entry>
</family>
Note that since the family names are irrelevant but obligatory, I have chosen to name them A, B, C, etc. The POS values are more sensible.
A lexical assignment associates a word form with a family of categories. Lexical assignments are implemented in the OpenCCG XML format by means of entry elements, with the following schema:
<entry word="..." pos="..."/>
In other words, an entry element associates an orthographic form (in the obligatory word attribute) with a set of families (and hence a set of categories), by means of the obligatory pos attribute. Basically, the set of lexical categories associated with a word is the "set union" of the lexical families which share the word's pos attribute.
It is unfortunate that there are two different kinds of entry element defined in the OpenCCG XML notation - children of family elements, and lexical assignment statements - but we will just have to live with that.
Our little CCG lexicon has the following five lexical entries:
<entry word="girls" pos="plural_common_noun"/>
<entry word="boys" pos="plural_common_noun"/>
<entry word="dance" pos="intransitive_verb"/>
<entry word="love" pos="transitive_verb"/>
<entry word="themselves" pos="reflexive_pronoun"/>
The simplest kind of OpenCCG grammar consists of a directory containing four XML data files:
The grammar.xml file simply lists the other files which make up the grammar. Assuming that our OpenCCG grammar is called "grammar1", since we have to call it something and "grammar1" is as good a name as anything else, it will occupy a directory called grammar1, and the grammar.xml file will look like this:
<grammar name="grammar1">
<rules file="rules.xml"/>
<lexicon file="lexicon.xml"/>
<morphology file="morph.xml"/>
</grammar>
The rules.xml file lists the CCG rules of combination that we want to be used in deriving sentences in our grammar. Since at the moment we are just concerned with functional application rules, this file will look like this:
<rules name="grammar1">
<application dir="forward"/>
<application dir="backward"/>
</rules>
The lexicon.xml file lists the categories that may be referred to in the lexicon, by specifying a list of families. The general schema for this file is as follows, with the details of each family as specified in section 1.3 above:
<ccg-lexicon name="grammar1">
<family name="A" pos="noun_phrase">...</family>
<family name="B" pos="intransitive_verb">...</family>
<family name="C" pos="transitive_verb">...</family>
<family name="D" pos="reflexive_pronoun">...</family>
</ccg-lexicon>
Finally, the morph.xml file contains the lexicon proper, i.e. the list of words assigned to lexical families. For our little lexicon, this file is as follows, with the five lexical entries defined as in section 1.4 above:
<morph name="grammar1">
<entry word="girls" pos="plural_common_noun"/>
<entry word="boys" pos="plural_common_noun"/>
<entry word="dance" pos="intransitive_verb"/>
<entry word="love" pos="transitive_verb"/>
<entry word="themselves" pos="reflexive_pronoun"/>
</morph>
We can run and test this OpenCCG grammar by running the tccg command from within the grammar1 grammar directory:
> tccg
Loading grammar from URL: file:.../grammar1/grammar.xml
Grammar 'grammar1' loaded.
Enter strings to parse.
Type ':r' to realize selected reading of previous parse.
Type ':h' for help on display options and ':q' to quit.
You can use the tab key for command completion,
Ctrl-P (prev) and Ctrl-N (next) to access the command history,
and emacs-style control keys to edit the line.
tccg> girls love themselves
1 parse found.
Parse: S
tccg> boys dance girls
Unable to parse
tccg>
This tells us that: (a) our grammar is well-formed, since it loads without any errors; and (b) it accepts the sentence "girls love themselves" but rejects the non-sentence "boys dance girls".
We can use the :derivs command to view the derivation as well:
tccg> :derivs
tccg> girls love themselves
1 parse found.
Parse: S
------------------------------
(lex) girls :- NP
(lex) love :- S\.NP/.NP
(lex) themselves :- S\.NP\.(S\.NP/.NP)
(<) love themselves :- S\.NP
(<) girls love themselves :- S
tccg>:q
Exiting tccg.
>
You can turn the derivations back off again using the :noderivs command.
Finally, we can add another XML data file to our grammar directory, testbed.xml, which contains a list of sentences to be tested, along with information about whether they should be accepted by the grammar or not (i.e. in the value of the numOfParses attribute):
<regression>
<item numOfParses="1" string="girls dance"/>
<item numOfParses="1" string="boys dance"/>
<item numOfParses="1" string="girls love boys"/>
<item numOfParses="1" string="girls love girls"/>
<item numOfParses="1" string="boys love girls"/>
<item numOfParses="1" string="boys love boys"/>
<item numOfParses="1" string="girls love themselves"/>
<item numOfParses="1" string="boys love themselves"/>
<item numOfParses="0" string="themselves dance"/>
<item numOfParses="0" string="girls dance themselves"/>
<item numOfParses="0" string="boys dance girls"/>
. . .
</regression>
We can run this set of test sentences on the grammar using the ccg-test command:
> ccg-test -norealization
Loading grammar from URL: file:.../grammar1/grammar.xml
Loading: testbed
Parse Realize String
----- ------- ------
ok - girls dance
ok - boys dance
ok - girls love boys
ok - girls love girls
ok - boys love girls
ok - boys love boys
ok - girls love themselves
ok - boys love themselves
ok - *themselves dance
ok - *girls dance themselves
ok - *boys dance girls
. . .
>
We can see from the lefthand column that the grammar passed all the tests.
Look at the following "feature-enhanced" CCG lexicon, where syntactic features have been added to CCG atomic categories:
Using the rules of forwards and backwards functional application, this lexicon allows us to derive all and only the following sentences, where agreement is preserved between subject and verb and between subject and reflexive pronoun:
In this section, we look at how we go about representing this kind of feature-enhanced CCG lexicon in OpenCCG XML notation. If you are unsure about how feature sets work, consult a good introduction to unification-based grammar, such as this one here.
Here are some examples involving feature unification in functional application:
Audrey loves herself
------------------- ----------------- -----------------------------------------
NP[num=sg,gend=fem] (S\NP[num=sg])/NP (S\NP[num=sg,gend=fem][1])\((S\NP[1])/NP)
-----------------------------------------------------------<
S\NP[num=sg,gend=fem]
--------------------------------------------------------------------------------<
S
Audrey loves themselves
------------------ ------------- -------------------------
N[num=sg,gend=fem] S\N[num=sg]/N S\N[num=pl][1]\(S\N[1]/N)
---------------------------------------<
*****
Note that ***** denotes unification failure, which happens here because the two feature sets [num=sg] and [num=pl] are incompatible.
In section 1.1, we assumed the following format for representing atomic categories as OpenCCG XML atomcat elements:
<atomcat type="NP"/>
We henceforth elaborate this by allowing an atomcat element to contain an optional fs (standing for "feature set") child element, which itself contains a set of feat elements, each of which represents an attribute/value pair:
<atomcat type="...">
<fs>
<feat attr="..." val="..."/>
<feat attr="..." val="..."/>
...
</fs>
</atomcat>
In this way, a feature-enhanced saturated category like NP[num=sg,gend=fem] can be encoded as follows:
<atomcat type="NP">
<fs>
<feat attr="num" val="sg"/>
<feat attr="gend" val="fem"/>
</fs>
</atomcat>
An fs element can have an optional id attribute, which acts as a variable for structure sharing with other feature sets across a complex category (see the next section for details). The value of this attribute must always be an integer. Thus, atomic categories like NP[num=pl][1] and NP[2] can be represented as follows in OpenCCG XML:
<atomcat type="NP">
<fs id="1">
<feat attr="num" val="pl"/>
</fs>
</atomcat>
<atomcat type="NP">
<fs id="2"/>
</atomcat>
Consider the lexical category for the female reflexive pronoun "herself": (S\NP[num=sg,gend=fem][1])\((S\NP[1])/NP). Here, we have one feature set shared between two different saturated categories, indicated by the box variable [1]. This category is represented in OpenCCG XML as follows, using the same id value on both fs elements:
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1">
<feat attr="num" val="sg"/>
<feat attr="gend" val="fem"/>
</fs>
</atomcat>
<slash dir="\"/>
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1"/>
</atomcat>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</complexcat>
Now that we have seen how to encode feature-enhanced atomic categories in OpenCCG XML, and how to enforce coindexation across a complex category, we will turn to define a complete OpenCCG grammar, which for convenience we will call grammar2, encoding the CCG lexicon listed at the very start of section 2.
We assume the obvious grammar.xml and rules.xml files, and focus on the structure of lexicon.xml and morph.xml.
The feature-enhanced CCG lexicon has ten distinct lexical categories, each of which will need to be converted into a lexical family in the lexicon.xml file:
<ccg-lexicon name="grammar2">
<family name="A" pos="plural_common_noun">
<entry name="Primary">
<atomcat type="NP">
<fs>
<feat attr="num" val="pl"/>
</fs>
</atomcat>
</entry>
</family>
<family name="B" pos="masculine_proper_noun">
<entry name="Primary">
<atomcat type="NP">
<fs>
<feat attr="num" val="sg"/>
<feat attr="gend" val="masc"/>
</fs>
</atomcat>
</entry>
</family>
<family name="C" pos="feminine_proper_noun">
<entry name="Primary">
<atomcat type="NP">
<fs>
<feat attr="num" val="sg"/>
<feat attr="gend" val="fem"/>
</fs>
</atomcat>
</entry>
</family>
<family name="D" pos="plural_intransitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs>
<feat attr="num" val="pl"/>
</fs>
</atomcat>
</complexcat>
</entry>
</family>
<family name="E" pos="singular_intransitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs>
<feat attr="num" val="sg"/>
</fs>
</atomcat>
</complexcat>
</entry>
</family>
<family name="F" pos="plural_transitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs>
<feat attr="num" val="pl"/>
</fs>
</atomcat>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</entry>
</family>
<family name="G" pos="singular_transitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs>
<feat attr="num" val="sg"/>
</fs>
</atomcat>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</entry>
</family>
<family name="H" pos="plural_reflexive_pronoun">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1">
<feat attr="num" val="pl"/>
</fs>
</atomcat>
<slash dir="\"/>
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1"/>
</atomcat>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</complexcat>
</entry>
</family>
<family name="I" pos="masculine_reflexive_pronoun">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1">
<feat attr="num" val="sg"/>
<feat attr="gend" val="masc"/>
</fs>
</atomcat>
<slash dir="\"/>
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1"/>
</atomcat>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</complexcat>
</entry>
</family>
<family name="J" pos="feminine_reflexive_pronoun">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1">
<feat attr="num" val="sg"/>
<feat attr="gend" val="fem"/>
</fs>
</atomcat>
<slash dir="\"/>
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1"/>
</atomcat>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</complexcat>
</entry>
</family>
Once we have the lexical families all defined, the specification of the lexical entries in the morph.xml file is simplicity itself:
<morph name="grammar2">
<entry word="girls" pos="plural_common_noun"/>
<entry word="boys" pos="plural_common_noun"/>
<entry word="Audrey" pos="feminine_proper_noun"/>
<entry word="Hunter" pos="masculine_proper_noun"/>
<entry word="dance" pos="plural_intransitive_verb"/>
<entry word="dances" pos="singular_intransitive_verb"/>
<entry word="love" pos="plural_transitive_verb"/>
<entry word="loves" pos="singular_transitive_verb"/>
<entry word="themselves" pos="plural_reflexive_pronoun"/>
<entry word="himself" pos="masculine_reflexive_pronoun"/>
<entry word="herself" pos="feminine_reflexive_pronoun"/>
</morph
Again, the pos attribute links each lexical entry to a set of relevant lexical categories in the lexicon.xml file.
We run the grammar2 grammar just as we did with grammar1, this time using the :feats option to view feature sets in the derivation (note that you can turn this off using the :nofeats command):
> tccg
Loading grammar from URL: file:.../grammar2/grammar.xml
Grammar 'grammar2' loaded.
Enter strings to parse.
Type ':r' to realize selected reading of previous parse.
Type ':h' for help on display options and ':q' to quit.
You can use the tab key for command completion,
Ctrl-P (prev) and Ctrl-N (next) to access the command history,
and emacs-style control keys to edit the line.
tccg> :derivs
tccg> :feats
tccg> Audrey loves herself
1 parse found.
Parse: S
------------------------------
(lex) Audrey :- N{gend=fem, num=sg}
(lex) loves :- S\.N{num=sg}/.N
(lex) herself :- S\.N<1>{gend=fem, num=sg}\.(S\.N<1>/.N)
(<) loves herself :- S\.N<1>{gend=fem, num=sg}
(<) Audrey loves herself :- S
tccg> Audrey love herself
Unable to parse
tccg> Audrey loves himself
Unable to parse
tccg> boys love themselves
1 parse found.
Parse: S
------------------------------
(lex) boys :- N{num=pl}
(lex) love :- S\.N{num=pl}/.N
(lex) themselves :- S\.N<1>{num=pl}\.(S\.N<1>/.N)
(<) love themselves :- S\.N<1>{num=pl}
(<) boys love themselves :- S
tccg>
We can also run the appropriate test set:
> ccg-test -norealization
Loading grammar from URL: file:.../grammar2/grammar.xml
Loading: testbed
Parse Realize String
----- ------- ------
ok - girls dance
ok - boys dance
ok - Audrey dances
ok - Hunter dances
ok - *girls dances
ok - *boys dances
ok - *Audrey dance
ok - *Hunter dance
. . .
Here is the 'feature-enhanced' CCG lexicon we have developed, encoded and tested in section 2:
In order to encode this in OpenCCG XML, we had to create a lexicon.xml file listing TEN distinct lexical families, one for each different lexical category. However, many of these families shared a lot of information. In this section, we look at how we can create a more elegant distribution of information in an OpenCCG XML lexicon, using "macros" to eliminate redundancy in the specification of lexical families.
Our feature-enhanced lexicon contains the following lexical entries for common nouns and proper nouns:
In section 2.3, we encoded these in OpenCCG XML by defining THREE distinct lexical families - plural_common_noun, feminine_proper_noun, and masculine_proper_noun:
<family name="A" pos="plural_common_noun">
<entry name="Primary">
<atomcat type="NP">
<fs>
<feat attr="num" val="pl"/>
</fs>
</atomcat>
</entry>
</family>
<family name="B" pos="masculine_proper_noun">
<entry name="Primary">
<atomcat type="NP">
<fs>
<feat attr="num" val="sg"/>
<feat attr="gend" val="masc"/>
</fs>
</atomcat>
</entry>
</family>
<family name="C" pos="feminine_proper_noun">
<entry name="Primary">
<atomcat type="NP">
<fs>
<feat attr="num" val="sg"/>
<feat attr="gend" val="fem"/>
</fs>
</atomcat>
</entry>
</family>
Doing things in this way, by defining a new lexical family for every distinct lexical category, is a very inefficient manner of building a large lexicon, basically because the same information will be specified many times over in the lexicon.xml file. We can counteract this problem by "factorising" our lexical categories into category "spines" and feature set "macros", specifying these independently in the lexicon, and reusing them as necessary.
Let's start with the category spines. All three lexical families for common and proper nouns have in common that they are atomic categories of type NP. We can represent this common spine as NP[...], or in OpenCCG XML as the following family:
<family name="A" pos="noun">
<entry name="Primary">
<atomcat type="NP">
<fs id="1"/>
</atomcat>
</entry>
</family>
Note that the feature set associated with the saturated category in this family has been left empty, but has been assigned the identifier "1". This identifier will be used in macro definitions to refer to this particular feature set, and 'decorate' it with features.
We then define four macros (in the morph.xml file) to add features to the category spine defined in the noun family:
<macro name="@singular">
<fs id="1">
<feat attr="num" val="sg"/>
</fs>
</macro>
<macro name="@plural">
<fs id="1">
<feat attr="num" val="pl"/>
</fs>
</macro>
<macro name="@feminine">
<fs id="1">
<feat attr="gend" val="fem"/>
</fs>
</macro>
<macro name="@masculine">
<fs id="1">
<feat attr="gend" val="masc"/>
</fs>
</macro>
Every macro needs to have a name, and that name needs to start with @, for system-internal reasons. A macro essentially defined a named feature set - for example, the first macro defined above assigns the name @singular to the feature set [num=sg]. The feature set defined in a macro also needs to have an identifier, which allows it to refer to feature sets introduced in lexical families. Thus, the lexical family noun' interacts with the @singular to yield the following "expanded" category:
<atomcat type="NP">
<fs id="1">
<feat attr="num" val="sg"/>
</fs>
</atomcat>
Note that this expansion is only possible, because both the family and the macro use the same identifier for the feature sets to be combined. Thus, the feature set identifiers act as global variables across the entire CCG grammar.
Lexical families are specified in the lexicon.xml file, and macros in morph.xml. Lexical entries now need to specify both the pos and a list of macros as well:
<entry word="girls" pos="noun" macros="@plural"/>
<entry word="boys" pos="noun" macros="@plural"/>
<entry word="Audrey" pos="noun" macros="@singular @feminine"/>
<entry word="Hunter" pos="noun" macros="@singular @masculine"/>
Thus, we have been able to reduce the number of lexical families for nouns from three to just the one, by adding four feature macros to the lexicon.
Our feature-enhanced lexicon from section 2 contains the following lexical entries for transitive verbs:
In section 2.3, we encoded these in OpenCCG XML using two distinct lexical families - plural_transitive_verb and singular_transitive_verb:
<family name="F" pos="plural_transitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs>
<feat attr="num" val="pl"/>
</fs>
</atomcat>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</entry>
</family>
<family name="G" pos="singular_transitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs>
<feat attr="num" val="sg"/>
</fs>
</atomcat>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</entry>
</family>
Again, the lexical families for singular and plural transitive verbs can be factorised into a syntactic spine and feature macros. The following family encodes the syntactic spine, generalising both previous families:
<family name="C" pos="transitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1"/>
</atomcat>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</entry>
</family>
When combined with the @singular and @plural macros defined in section 3.1, this family gets "fleshed out" with the addition of the appropriate features. The relevant lexical entries are thus as follows:
<entry word="love" pos="transitive" macros="@plural"/>
<entry word="loves" pos="transitive" macros="@singular"/>
So, we are able to collapse two lexical families for transitive verbs into just the one, using macros we had already defined for nouns. We can do similar things for intransitive verbs and reflexive pronouns.
Let's brings all this together into a complete OpenCCG XML grammar, this time called grammar3 for convenience. We assume the usual grammar.xml and rules.xml files. Here is the contents of the lexicon.xml file, containing just the syntactic spines of the lexical categories:
<ccg-lexicon name="grammar3">
<family name="A" pos="noun">
<entry name="Primary">
<atomcat type="NP">
<fs id="1"/>
</atomcat>
</entry>
</family>
<family name="B" pos="intransitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1"/>
</atomcat>
</complexcat>
</entry>
</family>
<family name="C" pos="transitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1"/>
</atomcat>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</entry>
</family>
<family name="D" pos="reflexive_pronoun">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1"/>
</atomcat>
<slash dir="\"/>
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1"/>
</atomcat>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</complexcat>
</entry>
</family>
</ccg-lexicon>
Note that we now have just four lexical families, down from ten before we started to use macros.
Finally, here is the contents of the morph.xml file, specifying both the lexical entries themselves, and the various feature macros:
<morph name="grammar3">
<entry word="girls" pos="noun" macros="@plural"/>
<entry word="boys" pos="noun" macros="@plural"/>
<entry word="Audrey" pos="noun" macros="@singular @feminine"/>
<entry word="Hunter" pos="noun" macros="@singular @masculine"/>
<entry word="dance" pos="intransitive" macros="@plural"/>
<entry word="dances" pos="intransitive" macros="@singular"/>
<entry word="love" pos="transitive" macros="@plural"/>
<entry word="loves" pos="transitive" macros="@singular"/>
<entry word="themselves" pos="reflexive_pronoun" macros="@plural"/>
<entry word="herself" pos="reflexive_pronoun" macros="@singular @feminine"/>
<entry word="himself" pos="reflexive_pronoun" macros="@singular @masculine"/>
<macro name="@singular">
<fs id="1">
<feat attr="num" val="sg"/>
</fs>
</macro>
<macro name="@plural">
<fs id="1">
<feat attr="num" val="pl"/>
</fs>
</macro>
<macro name="@feminine">
<fs id="1">
<feat attr="gend" val="fem"/>
</fs>
</macro>
<macro name="@masculine">
<fs id="1">
<feat attr="gend" val="masc"/>
</fs>
</macro>
</morph>
> tccg
Loading grammar from URL: file:.../grammar3/grammar.xml
Grammar 'grammar3' loaded.
Enter strings to parse.
Type ':r' to realize selected reading of previous parse.
Type ':h' for help on display options and ':q' to quit.
You can use the tab key for command completion,
Ctrl-P (prev) and Ctrl-N (next) to access the command history,
and emacs-style control keys to edit the line.
tccg> :derivs
tccg> :feats
tccg> Audrey loves herself
1 parse found.
Parse: S
------------------------------
(lex) Audrey :- N<1>{gend=fem, num=sg}
(lex) loves :- S\.N<2>{num=sg}/.N
(lex) herself :- S\.N<3>{gend=fem, num=sg}\.(S\.N<3>{gend=fem, num=sg}/.N)
(<) loves herself :- S\.N<5>{gend=fem, num=sg}
(<) Audrey loves herself :- S
tccg> Audrey love herself
Unable to parse
tccg> :q
Exiting tccg.
We can use the same test set as before:
> ccg-test -norealization ../grammar2/testbed.xml
Loading grammar from URL: file:.../grammar3/grammar.xml
Loading: testbed
Parse Realize String
----- ------- ------
ok - girls dance
ok - boys dance
ok - Audrey dances
ok - Hunter dances
ok - *girls dances
ok - *boys dances
ok - *Audrey dance
ok - *Hunter dance
. . .
Here is the feature-enhanced categorial grammar we have been playing with for the last couple of sections (ignoring reflexive pronouns for the moment), this time with semantics attached, in the form of elementary predications of hybrid logic dependency structure:
We can use this lexicon to perform derivations as follows, where the set of elementary predications is extended monotonically, as the derivation progresses:
girls love Hunter
-------------- -------------------- ---------------------
NP[num=pl]x (Se\NP[num=pl]x)/NPy NP[num=sg,gend=masc]y
@x group @e love @y Hunter
@x <of>z @e <sbj> x
@z girl @e <obj> y
----------------------------------------->
Se\NP[num=pl]x
@e love, @e <sbj> x,
@e <obj> y, @y Hunter
---------------------------------------------------------<
Se
@e love, @e <sbj> x, @x group, @x <of>z, @z girl,
@e <obj> y, @y Hunter
Note that what is happening here is that, when functional application is used to combine two adjacent words or phrases into a new, composite phrase, the elementary predications associated with the new phrase is simply the set union of those associated with each of the two inputs (assuming that the relevant referential indices have been 'unified' of course).
If you are unclear how hybrid logic dependency semantics work, then you need to read sections 2 and 3 of this paper here.
In this section, we discuss how to encode semantic representations in OpenCCG XML.
Elementary predications consisting of a node label predicate and a single variable identifier, for example "@x love" or "@y Audrey", are represented as follows, using the satop and prop elements:
<satop nomvar="X">
<prop name="love"/>
</satop>
<satop nomvar="Y">
<prop name="Audrey"/>
</satop>
Elementary predications consisting of an arc label relation and two variable identifiers, e.g. "@x <sbj> y" or "@w <obj> z", are represented as follows using satop, diamond and nomvar elements:
<satop nomvar="X">
<diamond mode="sbj">
<nomvar name="Y"/>
</diamond>
</satop>
<satop nomvar="W">
<diamond mode="obj">
<nomvar name="Z"/>
</diamond>
</satop>
Finally, consider the following set of elementary predications:
This can be represented in OpenCCG XML as the following sequence of eight satop elements:
<satop nomvar="X">
<prop name="want"/>
</satop>
<satop nomvar="X">
<diamond mode="sbj">
<nomvar name="Y"/>
</diamond>
</satop>
<satop nomvar="Y">
<prop name="Audrey"/>
</satop>
<satop nomvar="X">
<diamond mode="obj">
<nomvar name="Z"/>
</diamond>
</satop>
<satop nomvar="Z">
<prop name="love"/>
</satop>
<satop nomvar="Z">
<diamond mode="sbj">
<nomvar name="Y"/>
</diamond>
</satop>
<satop nomvar="Z">
<diamond mode="obj">
<nomvar name="W"/>
</diamond>
</satop>
<satop nomvar="W">
<prop name="Hunter"/>
</satop>
The next step is to look at how we can add semantic representations to the OpenCCG grammar files. Let's start with the lexicon.xml file first. The previous grammar we have been looking at, the one involving both feature sets and macros called grammar3 we developed in chapter 5, had three families listed in its lexicon.xml file (ignoring the reflexive pronouns for now). One for nouns:
<family name="A" pos="noun">
<entry name="Primary">
<atomcat type="NP">
<fs id="1"/>
</atomcat>
</entry>
</family>
One for intransitive verbs:
<family name="B" pos="intransitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1"/>
</atomcat>
</complexcat>
</entry>
</family>
And finally one for transitive verbs:
<family name="C" pos="transitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1"/>
</atomcat>
<slash dir="/"/>
<atomcat type="NP"/>
</complexcat>
</entry>
</family>
Recall that the id attributes on the fs ('feature set') elements are used to link between the lexical families and macros, thus adding number and gender information.
The first thing we have to do is add a referential index to each saturated category, i.e. to each atomcat element. The only way that OpenCCG allows us to do this is to add an index feature to the category's feature set, whose value is an lf element incorporating a nomvar element. In other words, a referential index looks like this:
<feat attr="index">
<lf>
<nomvar name="X"/>
</lf>
</feat>
Note that the name of the index is, by convention, an uppercase letter.
So, after adding referential indices to saturated categories, the noun family looks like this:
<family name="A" pos="noun">
<entry name="Primary">
<atomcat type="NP">
<fs id="1">
<feat attr="index">
<lf>
<nomvar name="X"/>
</lf>
</feat>
</fs>
</atomcat>
</entry>
</family>
The intransitive verb family looks like this:
<family name="B" pos="intransitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S">
<fs>
<feat attr="index">
<lf>
<nomvar name="E"/>
</lf>
</feat>
</fs>
</atomcat>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1">
<feat attr="index">
<lf>
<nomvar name="X"/>
</lf>
</feat>
</fs>
</atomcat>
</complexcat>
</entry>
</family>
And the transitive verb family looks like this:
<family name="B" pos="transitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S">
<fs>
<feat attr="index">
<lf>
<nomvar name="E"/>
</lf>
</feat>
</fs>
</atomcat>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1">
<feat attr="index">
<lf>
<nomvar name="X"/>
</lf>
</feat>
</fs>
</atomcat>
<slash dir="/"/>
<atomcat type="NP">
<fs>
<feat attr="index">
<lf>
<nomvar name="Y"/>
</lf>
</feat>
</fs>
</atomcat>
</complexcat>
</entry>
</family>
Next, we add a set of elementary predicates to each entry of each family, specifying those aspects of the semantics which all lexical entries in the family have in common. To do so we add an lf element to every "top-level" complexcat or atomcat element inside an entry element.
Let's start with intransitive verbs. The lexicon we are implementing here has two intransitive verbs listed:
Since the elementary predications are the same for both entries, we can just add them directly into the specification of the family, using the keyword [*DEFAULT*] to abstract over the particular kind of action involved:
<family name="B" pos="intransitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S">
<fs>
<feat attr="index">
<lf>
<nomvar name="E"/>
</lf>
</feat>
</fs>
</atomcat>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1">
<feat attr="index">
<lf>
<nomvar name="X"/>
</lf>
</feat>
</fs>
</atomcat>
<lf>
<satop nomvar="E">
<prop name="[*DEFAULT*]"/>
</satop>
<satop nomvar="E">
<diamond mode="sbj">
<nomvar name="X"/>
</diamond>
</satop>
</lf>
</complexcat>
</entry>
</family>
To be precise, the value of the [*DEFAULT*] keyword is derived from the stem attribute of the lexical entry itself. If no stem attribute is set, then the value comes from the word attribute. See section 4.3 below for more details.
Turning to transitive verbs, the lexicon we are implementing has two of these:
These can again be added directly to the relevant family as follows:
<family name="B" pos="transitive_verb">
<entry name="Primary">
<complexcat>
<atomcat type="S">
<fs>
<feat attr="index">
<lf>
<nomvar name="E"/>
</lf>
</feat>
</fs>
</atomcat>
<slash dir="\"/>
<atomcat type="NP">
<fs id="1">
<feat attr="index">
<lf>
<nomvar name="X"/>
</lf>
</feat>
</fs>
</atomcat>
<slash dir="/"/>
<atomcat type="NP">
<fs>
<feat attr="index">
<lf>
<nomvar name="Y"/>
</lf>
</feat>
</fs>
</atomcat>
<lf>
<satop nomvar="E">
<prop name="[*DEFAULT*]"/>
</satop>
<satop nomvar="E">
<diamond mode="sbj">
<nomvar name="X"/>
</diamond>
</satop>
<satop nomvar="E">
<diamond mode="obj">
<nomvar name="Y"/>
</diamond>
</satop>
</lf>
</complexcat>
</entry>
</family>
What about nouns? The grammar that we are in the process of implementing has four noun entries:
In this case, the lexical semantic representations do not have anything in common. Rather than define two different lexical families to handle this (singular nouns and plural nouns), we will use macros to handle these semantic representations, in section 4.3.
Having added semantic representation to the specifications of the lexical families in section 4.2, we now turn to the morph.xml file, which assign words to the lexical families. Our previous version of the feature-enhanced categorial grammar contained the following entries (again ignoring reflexive pronouns):
<entry word="girls" pos="noun" macros="@plural"/>
<entry word="boys" pos="noun" macros="@plural"/>
<entry word="Audrey" pos="noun" macros="@singular @feminine"/>
<entry word="Hunter" pos="noun" macros="@singular @masculine"/>
<entry word="dance" pos="intransitive" macros="@plural"/>
<entry word="dances" pos="intransitive" macros="@singular"/>
<entry word="love" pos="transitive" macros="@plural"/>
<entry word="loves" pos="transitive" macros="@singular"/>
Recall from section 4.2 that the verbal families themselves contain all the elementary predications required to build up the full lexical entry for a transitive or intransitive verb. All that we need to add to the lexical entries is a stem attribute, for those verb forms which are inflectional variants of a basic, citation form:
<entry word="dance" pos="intransitive" macros="@plural"/>
<entry word="dances" pos="intransitive" macros="@singular" stem="dance"/>
<entry word="love" pos="transitive" macros="@plural"/>
<entry word="loves" pos="transitive" macros="@singular" stem="love"/>
The stem information interacts with the elementary predication specified in the appropriate families, by substituting for the [*DEFAULT*] keyword, as discussed in section 4.2. Thus, the semantic representing for a verb like "dances" will assign the associated event to the same semantic type as the verb "dance". If no stem attribute is specified, then the value of the word attribute is used instead.
The lexical entries specified here thus get "fleshed out" with two sources of additional information:
Recall from section 4.2.2 that the lexical family that nouns are assigned to, i.e. noun, does not contain any elementary predications. Thus we need to use macros to add these. We assume thus that there are two semantic macros, one for plural nouns (group) and one for singular nouns (individual):
<entry word="girls" pos="noun" macros="@plural @group" stem="girl"/>
<entry word="boys" pos="noun" macros="@plural @group" stem="boy"/>
<entry word="Audrey" pos="noun" macros="@singular @feminine @individual"/>
<entry word="Hunter" pos="noun" macros="@singular @masculine @individual"/>
The semantic macros themselves are defined as follows, again using the keyword [*DEFAULT*] to refer to the value of the stem attribute (or the word attribute if there is no stem):
<macro name="@individual">
<lf>
<satop nomvar="X">
<prop name="[*DEFAULT*]"/>
</satop>
</lf>
</macro>
<macro name="@group">
<lf>
<satop nomvar="X">
<prop name="group"/>
</satop>
<satop nomvar="X">
<diamond mode="of">
<nomvar name="Y"/>
</diamond>
</satop>
<satop nomvar="Y">
<prop name="[*DEFAULT*]"/>
</satop>
</lf>
</macro>
As with verbs, the lexical entries for nouns get fleshed out with information from both the specified macros and the associated lexical family.
The new version of our OpenCCG grammar, complete with feature sets, macros and semantic representations, is contained inside the grammar4 grammar directory. We run it as follows:
> tccg
Loading grammar from URL: file:.../potomac/grammar.xml
Grammar 'grammar4' loaded.
Enter strings to parse.
Type ':r' to realize selected reading of previous parse.
Type ':h' for help on display options and ':q' to quit.
You can use the tab key for command completion,
Ctrl-P (prev) and Ctrl-N (next) to access the command history,
and emacs-style control keys to edit the line.
tccg> :sem
tccg> girls love Hunter
1 parse found.
Parse: S :
@l1(love ^
<obj>(h1 ^ Hunter) ^
<sbj>(g1 ^ group ^
<of>(g2 ^ girl)))
tccg> :r
[1.000] girls love Hunter :- S : (@g1(group) ^ @g1(<of>g2) ^ @g2(girl)
^ @h1(Hunter) ^ @l1(love) ^ @l1(<obj>h1) ^ @l1(<sbj>g1))
tccg>
Note that we can turn the output semantic representations on and off using :sem and :nosem.
<family name="D" pos="reflexive">
<entry name="Primary">
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="N">
<fs id="1">
<feat attr="index">
<lf>
<nomvar name="X"/>
</lf>
</feat>
</fs>
</atomcat>
<slash dir="\"/>
<complexcat>
<atomcat type="S"/>
<slash dir="\"/>
<atomcat type="N">
<fs id="1"/>
</atomcat>
<slash dir="/"/>
<atomcat type="N">
<fs>
<feat attr="index">
<lf>
<nomvar name="X"/>
</lf>
</feat>
</fs>
</atomcat>
</complexcat>
</complexcat>
</entry>
</family>
<entry word="themselves" pos="reflexive" macros="@plural"/>
<entry word="himself" pos="reflexive" macros="@singular @masculine"/>
<entry word="herself" pos="reflexive" macros="@singular @feminine"/>