Lab 5

Preliminaries

Requirements for this assignment

Make sure you have a baseline test suite corresponding to your lab 4 grammar (the final grammar from the Matrix). Add some (positive and negative) examples of adjectival and adverbial modifiers.
- update your skeleton
- create a profile and parse it (this is the baseline test suite)
Add adjectival and adverbial modifiers.
If your language has agreement between adjectives and head nouns, implement the appropriate lexical rules for adjectives to model this.
Add demontratives and markers of definiteness (if any).
Refine the rules allowing for optional arguments (argument drop) to reflect the discourse constraints on when arguments can be dropped.
5. If there's anything simple that you've been waiting for tdl editing to fix, optionally fix it now. (Keyword here is simple1)
Collect a small test corpus and create a testsuite file for it.
Test!
- create a profile and parse it (this is the baseline test suite)
- analyse with PyDelphin and compare to the baseline
Write up the phenomena you have analyzed.

Modification

Head-modifier rules

The Matrix distinguishes scopal from intersective modification. We're going to pretend that everything is intersective and just not worry about the scopal guys for now (aside from negative adverbs, if you got one from the customization system).

Create an instance of head-adj-int-phrase, an instance of adj-head-int-phrase, or both, depending on whether you need only prehead modifiers, only posthead modifiers, or both. (You may already have some of these, depending on what you said about negation in the customization system.) To do this, add the following to rules.tdl:
```
head-adj-int := head-adj-int-phrase.
adj-head-int := adj-head-int-phrase.
```
Try parsing a sentence without a modifier, and examine the parse chart. Did the head-adj phrase fire? If so, Constrain your non-adj/adv subtypes of head. You can do this by adding the following to your my_language.tdl.
```
+nvcdmo :+ [ MOD < > ].
```
That adds the constraint [MOD < > ] to the type +nvcdmo, which is the supertype of all the head types other than adj and adv. (Notes: You might already have this constraint or a functionally equivalent one if your customized grammar includes adverbial negation. Constraining +nvcdmo isn't necessarily the best way to do this, as you might want to allow nouny or verby things to be modifiers. The alternative is to constrain the relevant lexeme types to have empty MOD values.)
Try parsing the misbehaving sentence again.

Adjectives

Create a type adjective-lex which inherits from basic-adjective-lex. The following type works for English assuming that:
1. We're not worried about predicative adjectives or adjectives taking complements for now.
2. English has both pre-head and post-head modifiers, (head-adj and adj-head), but simple adjectives are (almost) always prehead (hence the value of POSTHEAD).
3. We're only dealing with intersective adjectives (as stipulated).
```
adjective-lex := basic-adjective-lex & intersective-mod-lex &
	      norm-ltop-lex-item &
  [ SYNSEM [ LOCAL [ CAT [ HEAD.MOD < [ LOCAL.CAT [ HEAD noun,
                                                    VAL.SPR cons ]]>,
			   VAL [ SPR < >,
				 SUBJ < >,
				 COMPS < >,
				 SPEC < > ],
			   POSTHEAD - ]]]].
```
If adjectives display agreement in your language, you'll be adding that information to the MOD value in agreement below. For now, leave it underspecified (this will cause your grammar to overgenerate).
Create one or more adjective instances.
Parse sentences with your adjectives, and examine the MRSs. Are the adjective relations being predicated of the right indices?

Adverbs

Create one or more types for adverbs. The following type definition inherits from appropriately-defined Matrix supertypes, and constrains the modified constituent to be verbal.

adverb-lex := basic-adverb-lex & intersective-mod-lex &
  [ SYNSEM [ LOCAL [ CAT [ HEAD.MOD < [ LOCAL.CAT.HEAD verb ]>,
			   VAL [ SPR < >,
				 SUBJ < >,
				 COMPS < >,
				 SPEC < > ]]]]].

Try parsing a sentence with an adverb and then generating to see where else the adverb can show up. If you language allows multiple attachment sites for the adverb, admire the results. If it doesn't, or doesn't allow *that* many, constrain them further.
In order to constrain the possible attachment sites for adverbs, you may need to constrain the value of POSTHEAD, or the value of SPR inside MOD or the value of LIGHT inside MOD. ([LIGHT +] picks out lexical Vs, including both transitive and intransitive ones.)
Parse sentences with your adverbs, and examing the MRSs. Are the adverb relations being predicated of the right indices?

Adjective Agreement

To model adjective agreement, you'll probably want to write lexical rules that inflect the adjectives and constrain the features inside the MOD value so that each inflected adjective can only modify the right kind of nouns.

Below is some general information on writing lexical rules. Please also refer to the lexical rules emitted by the customization system. Adjective agreement lexical rules should be of the "add only" type. Note that if you have an apparently uninflected form, you'll need to make sure it goes through a constant lexical rule (no spelling change) which fills in the relevant feature values.

Lexical rules

Pick a supertype for your rule:
- Determine whether your lexical rule needs to change SYNSEM information, or just add to it. (Examples: If the input has a non-empty SPR list and the output has an empty SPR list, that's changing information. If the input has no value specified for CASE and the output is [CASE nom], that's just adding information.)
- Determine whether your lexical rule creates fully inflected forms, or whether there's more inflection you'd like to stack on top of it.
- Rules creating fully inflected forms and only adding information to SYNSEM can inherit from infl-ltow-rule and add-only-no-ccont-rule[Updated 2/3/10].
- Rules creating not-yet fully inflected forms and only adding information should inherit from infl-add-only-no-ccont-ltol-rule.
- If your rule needs to change the SYNSEM value, determine which part of SYNSEM is changing (e.g., VAL only, HEAD only, CAT only) and choose an appropriate type out of the types called infl-***-change-only-ltol-rule. Unless you're adding any relations, your rule should also inherit from no-ccont-lex-rule. I expect most lexical rules created for this lab to be of the add-only variety, rather than changing information.
Define a rule type in my_language.tdl which contains all of the information about your rule except the spelling changes. The value of DTR should be specific enough to constrain the rule to only applying to the right type of inputs. The value of SYNSEM should include the primary information contributed by the rule (e.g., the constraints on MOD reflected in the agreement morphology). If you aren't using one of the "add only" types, then you need to be sure that the rest of the SYNSEM value is sufficiently constrained. Here's an example from English (where the value of SYNSEM ends up being very specific since all the information from the daughter is also in the mother):
```
3sg_verb-lex-rule := infl-ltow-rule &
  [ SYNSEM.LOCAL.CAT.VAL.SUBJ < [ LOCAL.CONT.HOOK.INDEX.PNG [ PER third,
							      NUM sg ]] >,
    DTR verb-lex ].
```
If you have multiple rules applying to the same form, constrain the innermost (rightmost prefix or leftmost suffix) to take (some subtype of) lex-item as its DTR. The next one to apply to should take the first rule as its DTR, etc. If multiple rules can appear in one slot, define a supertype for them which can be the DTR of the next rule type out.
Define an instance of the rule type in irules.tdl. This instance should give the spelling change subrules on a line beginning with %prefix or %suffix. Assuming you're working from regularized morphophonology, these should be simple concatenation, of the form (* pref) or (* suff).
A slightly more complicated example from English (without regularized morphophonology) follows. After %suffix there is a list of pairs in which the first member matches the input form and the second member describes the output form. * matches the empty string. ! signals a letter-set. More specific subrules to the right.
```
3sg_verb :=
%suffix (!s !ss) (!ss !ssses) (ss sses)
3sg_verb-lex-rule.
```
And here's the letter set that's used:
```
%(letter-set (!s abcedfghijklmnopqrtuvwxyz))
```
Make sure that your lexical entries give the stem instead of the inflected word (i.e., so that your lexical rule can do the work). Be sure that the lexical type says [INFLECTED -] if the rule is obligatory. (And note that lexical rules for agreement usually are.)
Test your grammar. Does the lexical rule apply to the words it should apply to? Does it apply to words it shouldn't apply to?

Demonstratives and definiteness

The basics

We are modeling the cognitive status attributed to discourse referents by particular referring expressions through a pair of features COG-ST and SPECI on ref-ind (the value of INDEX for nouns). Here is our first-pass guess at the cognitive status associated with various types of overt expressions (for dropped arguments, see below):

Marker COG-ST value SPECI value

Personal pronoun activ-or-more +

Demonstrative article/adjective activ+fam

Definite article/inflection uniq+fam+act

Indefinite article/inflection type-id

Marker	COG-ST value	SPECI value
Personal pronoun	activ-or-more	+
Demonstrative article/adjective	activ+fam
Definite article/inflection	uniq+fam+act
Indefinite article/inflection	type-id

If you have any overt personal pronouns, constrain their INDEX values to be [COG-ST activ-or-more, SPECI + ].

If you have any determiners which mark definitness, have them constrain the COG-ST of their SPEC appropriately. For demonstrative determiners, see below.

If you have any nominal inflections associated with discourse status, implement lexical rules which add them and constrain the COG-ST value appropriately.

Note that in some cases an unmarked form is underspecified, where in others it stands in contrast to a marked form. You should figure out which is the case for any unmarked forms in your language (e.g., bare NPs in a language with determiners, unmarked nouns in a language with definiteness markers), and constrain the unmarked forms appropriately. For bare NPs, the place to do this is the bare NP rule (note that you might have to create separate bare NP rules for pronouns v. common nouns in this case). For definiteness affixes, you'll want a constant-lex-rule that constrains COG-ST, and that is parallel to the inflecting-lex-rule that adds the affix for the overtly marked case.

Some languages have agreement for definiteness on adjectives. In this case, you'll want to add lexical rules for adjectives that constrain the COG-ST of the item on their MOD list.

Demonstratives

All demonstratives (determiners, adjectives and pronouns [not on the todo list this year]) will share a set of relations which express the proximity to hearer and speaker. We will arrange these relations into a hierarchy so that languages with just a one- or two-way distinction can be more easily mapped to languages with a two- or three-way distinction. In order to do this, we're using types for these PRED values rather than strings. Note the absence of quotation marks. We will treat the demonstrative relations as adjectival relations, no matter how they are introduced (via pronouns, determiners, or quantifiers).

There are (at least) two different types of three-way distinctions. Here are two of them. Let me know if your language isn't modeled by either.

demonstrative_a_rel := predsort.
proximal+dem_a_rel := demonstrative_a_rel. ; close to speaker
distal+dem_a_rel := demonstrative_a_rel.   ; away from speaker
remote+dem_a_rel := distal+dem_a_rel.      ; away from speaker and hearer
hearer+dem_a_rel := distal+dem_a_rel.      ; near hearer

demonstrative_a_rel := predsort.
proximal+dem_a_rel := demonstrative_a_rel. ; close to speaker
distal+dem_a_rel := demonstrative_a_rel.   ; away from speaker
mid+dem_a_rel := distal+dem_a_rel.         ; away, but not very far away
far+dem_a_rel := distal+dem_a_rel.         ; very far away

Demonstrative adjectives

Demonstrative adjectives come out as the easy case in this system. They are just like regular adjectives, except that in addition to introducing a relation whose PRED value is one of the subtypes of demonstrative_a_rel defined above, they also constrain the INDEX.COG-ST of their MOD value to be activ+fam.

Demonstrative determiners

Demonstrative determiners introduce two relations. This time, they are introducing the quantifier relation (Let's say "exist_q_rel") and the demonstrative relation. This analysis entails changes to the Matrix core, as basic-determiner-lex assumes just one relation being contributed. Accordingly, we are going to by-pass the current version of basic-determiner-lex and define instead determiner-lex-supertype as follows:

determiner-lex-supertype := norm-hook-lex-item & basic-zero-arg &
  [ SYNSEM [ LOCAL [ CAT [ HEAD det,
			   VAL[ SPEC.FIRST.LOCAL.CONT.HOOK [ INDEX #ind,
				  			     LTOP #larg ],
                                SPR < >,
                                SUBJ < >,
                                COMPS < >]],
		     CONT.HCONS < ! qeq &
				 [ HARG #harg,
				   LARG #larg ] ! > ], 
	     LKEYS.KEYREL quant-relation &
		   [ ARG0 #ind,
		     RSTR #harg ] ] ].

This type should have two subtypes (assuming you have demonstrative determiners as well as others in your language --- otherwise, just incorporate the constraints for demonstrative determiners into the type above).

The subtype for ordinary (non-demonstrative) determiners should add the constraint that the RELS list has exactly one thing on it, by adding the supertype single-rel-lex-item.
The subtype for demonstrative determiners should specify a RELS list with two things on it: the first should have the "exist_q_rel" for its PRED value. (It's already constrained to be a quant-relation because the type norm-hook-lex-item inherited by determiner-lex-supertype identifies the first element of the RELS list with the LKEYS.KEYREL.) The second one should be identified with LKEYS.ALTKEYREL and should be an arg1-ev-relation (the type we use for the relations of intransitive adjectives). The HOOK.INDEX.COG-ST inside the SPEC value should be constrained to activ+fam. Finally, the LBL and ARG1 of the arg1-ev-relation should be identified with the SPEC..HOOK.LTOP and SPEC..HOOK.INDEX of the determiner, respectively. (This will result in the demonstrative adjective relation sharing its handle with the N' the determiner attaches to.)

Make sure your ordinary determiners in the lexicon inherit from the first subtype, and that your demonstrative determiners inherit from the second subtype. Demonstrative determiner lexical entries should constrain their LKEYS.ALTKEYREL.PRED to be an appropriate subtype of demonstrative_a_rel.

Optional arguments

The customization system now includes an argument optionality library which we believe to be fairly thorough, regarding the syntax of optional arguments. The goal of this part of this lab (this year!) therefore is to (a) fix up anything that is not quite right in the syntax and (b) try to model the semantics, and in particular, the cognitive status associated with different kinds of dropped arguments. Regarding (a), if the analysis provided by the customization system isn't quite working, email me and we'll discuss how to fix it with tdl editing.

Regarding (b), you need to do the following:

Determine the cognitive status of the different types of dropped arguments in your language. For example, dropped subjects might always be the equivalent of unstressed pronouns, i.e., [COG-ST in-foc], while objects might be [COG-ST activ-or-more] (like the dropped argument of told in I already told you!) and others might be [COG-ST type-id] (like the dropped argument of eat in Did you already eat?). Languages with object markers might forgo the object markers in the case of [COG-ST type-id] arguments. In addition, the COG-ST of the dropped argument might depend on the verb.
Edit the lexical rules and lexical entries involved in licensing dropped arguments to provide the COG-ST value. Since the same argument might be overtly realized in most cases, rather than constraining the COG-ST directly, use the feature OPT-CS instead. This feature takes the same range of values as COG-ST, and the phrase structure rules that discharge the optional arguments check it for the value of put in COG-ST.
Test examples and examine the MRS to see if the expected COG-ST values are appearing.

Note that the Matrix currently assumings that dropped subjects are always [COG-ST in-foc]. This may not be true, especially in various impersonal constructions. If it's not true for your language, please let me know.

Test corpus

In order to get a sense of the coverage of our grammars over naturally occurring text, we are going to collect small test corpora. Minimally, these should consist of 10-20 sentences from running text. They could be larger, however, that is not recommended unless:

Your language has a simple enough morphophonology that your grammar is directly targeting surface forms.
You have easy access to large digitized texts (i.e., you don't have to type something in by hand).
1,000 sentences is the maximum practical size for any single [incr tsdb()] skeleton. You could of course split your test corpus over multiple skeletons, but I'd be surprised if anyone got close to 1,000 sentences!

Note also that our grammars won't cover anything without lexicon. If you have access to a digitized lexical resource that you can import lexical items from, you can address this to a certain extent. Otherwise, you'll want to limit your test corpus to a size that you are willing to hand-enter vocabulary for. (If you have access to a Toolbox lexicon for your language, contact me about importing via the customization system.)

For Lab 5, your task is to locate your test corpus (10-20 sentences will be sufficient, more if you want) and format it for [incr tsdb()]. If you have IGT to work with in the first place, it may be convenient to use the make_item.pl script to create the test corpus skeleton. (Note that you want this to be separate from your regular test suite skeleton.) Otherwise, you can use [incr tsdb()]'s own import tool (File | Import | Test items) which expects a plain text file with one item per line. The result of that command is a testsuite profile from which you'll need to copy the item (and relations) file to create a testsuite skeleton.

Check list:

tsdb/skeletons directory should include two subdirectories: one for the test corpus, and one for the test suite.
tsdb/skeletons/Index.lisp should include two lines: one for your test corpus and one for your test suite.
When the Skeletons Root is pointed at your tsdb/skeletons directory, File | Create should show two possibilities (test suite and test corpus).
The items in your test corpus should be in the format (standard orthography or transliteration, morpheme segmented or not) that your grammar expects.

Write up your analyses

For each of the following phenomena, please include the following in your final write up:

A descriptive statement of the facts of your language.
Illustrative IGT examples from your testsuite.
A statement of how you implemented the phenomenon, in terms of types you added/modified and particular tdl constraints. That is, I want to see actual tdl snippets with prose descriptions around them.
If the analysis is not (fully) working, a description of the problems you are encountering.

Adjectival and adverbial modifiers.
Agreement between adjectives and head nouns. (If your language doesn't have this, then just say so.)
Demonstratives and markers of definiteness.
Argument optionality.
Anything else you fixed.

In addition, your write up should include a statement of the current coverage of your grammar over your test suite (using numbers you can get from PyDelphin); and a comparison between your baseline test suite run and your final one for this lab.

Finally please briefly describe your test corpus, including: where you collected it, how many sentences it contains, and what format (transliterated, etc) it is in.

Submit your assignment

Create a tarball of your grammar, your tsdb directory, and your write up.
tar czf yourname.tgz yourgrammar
Email the tarball to bond@ieee.org.

use github?

Course materials borrow heavily from Linguistics 567: Knowledge Engineering for NLP at the University of Washington. Thanks to Emily Bender for letting us use them.