Grammar Induction using Brill's Automated Parsing Method

INDEX:

  1. Background
  2. Goals
  3. Implementation
  4. Code and Sample Data
  5. Sources



BACKGROUND:

The Relevance of Grammar Induction

  • Grammars are useful in defining complex languages, but often, the more complex the language, the more difficult it becomes to manually produce a working grammar.
  • In applications such as natural language processing the grammars in use are ambiguous and can be difficult to define, whereas parsing the texts by hand is far easier for a human annotator but has the drawback of being extremely time consuming.
  • These difficulties provide a good argument for developing tools to automatically generate grammars from hand-parsed texts.
  • Current methods for grammar induction include a variety of statistical algorithms:
    • Inside Outside Probabilities
    • Maximum Entropy Method
    • Conditional Markov Models
    • Conditional Random Fields
  • Brill's grammar induction algorithm, uses a transformational-learning method.

Brill's Method of Grammar Induction

  • Brill's system uses the idea that a series of structural transformations applied to a very naively parsed grammar will eventually produce a good parse.
  • The system first parses a sentence (or any parsable structure) in a purely left recursive manner, with the endmarking high in the parse
    		Example: ((The (boy (kicked (the ball)))).) 
    	
  • It then repeatedly tests rule/transform combinations againsts a hand-parsed grammar to "learn" a series of transformations that best parse the text
  • The tests involve repeatedly checking the parser's productions againsts the hand-parsed productions using a previously defined error function.
  • The error function used in most tests of the algorithm checks to see if the parser's productions cross productions in the hand parsed version.
    		Example: For a hand parsed (((The big dog)(ate)).) and machine parsed (((The big)(dog ate)).)
    		
    				(dog ate) would register an error while (the big) would not.
    	
  • There are 12 possible transformations:
    • 1-8: (Add | Delete) a (left | right) parenthesis to the (left | right) of a terminal t.
    • 9-12: (Add | Delete) a (left | right) parenthesis between terminals t1 and t2.
  • Each transformation actually involves multiple actions to balance parenthesis.
  • Examples of specific transformational rules:
    	
    			Delete a left paren to the left of terminal 1.
    			
    			Add a right paren to the left of terminal 2.
    		
  • Unknown texts are then naively parsed like the example text, then "corrected" by applying the series of transformations learned from the process described above.
  • This corrected parse then becomes the algorithm's guess for the parse of the unknown text.
  • One of the drawbacks of Brill's algorithm is that it is possible to overtrain it on an ambiguous language. A transformation that results in a small improvement of a single example text may, in fact, result be detrimental in parsing a majority of texts.



  • GOAL: To generate an unambiguous grammar using a hand-parsed example text

    An overview of my specifics goals

    • Using hand-parsed arithmetic expressions, I will use a modification of Brill's parsing method to generate an unambiguous grammar in Chomsky Normal Form.
    • Basing my grammar induction algorithm on Brill's method has the advantages computational simplicity and relative ease of implementation over statistically based methods while maintaining a similar degree of accuracy in deducing grammars.
    • Although this project incorporates Brill's core method of using transformations to parse a text, my algorithm differs in a few important ways:
      1. I have separated the parser from the induction side of the algorithm, so I use Brill's transformations to generate a grammar, rather than to parse a text directly (see "Part 1" above).
      2. Brill's paper deals exclusively with natural language. This project attempts to infer a well-defined unambiguous grammar from hand-parsed texts produced by that grammar. Specifically, I will attempt to infer the basic grammar of an arithmetic expression from hand-parsed examples.
      3. The parser side of my algorithm, like Brill's, parses grammars in Chomsky Normal Form. Unlike Brill's parser, my parser produces all posible parses of a text, given a grammar. I will use these ambiguous parses to measure my success in inferring an unambiguous grammar, and for future analysis of parsing algorithms.
      • The arithmetic expressions that the program will analyze will include the following tokens:
        		([1-9][0-9]*)|0 - the regular expression denoting a number
        		'+' - addition operator
        		'*' - multiplication operator
        		'(' - left parenthesis
        		')' - right parenthesis
        		

      • In the hand-parsed input to my algorithm, parentheses will be used to denote a gramatical production. However, the language that I wish to deduce also includes parentheses as individual tokens.
      • To resolve this conflict hand-parses use backslashes as escape characters ("\(" or "\)") to denote actual parentheses in arithmetic expressions.
      • Here are some examples of hand-parsed expressions:
        		"(1+(2*3))"
        		"((1+2)+3)"
        		"((\((1+2)\))*3)"
        	
      • The grammar for these arithmetic expressions appears below in Chomsky Normal Form:
        		E ::= EQ | TN | a number | CD
        		Q ::= PT
        		P ::= '+'
        		T ::= TN | a number | CD
        		N ::= MF
        		M ::= '*'
        		F ::= a number | CD
        		C ::= '('
        		D ::= EB
        		B ::= ')'
        	
        	
    • The grammar above is what I wish to deduce using my modified version of Brill's algorithm. It will also be used to generate example arithmetic expressions that will be fed into the induction engine.



    IMPLEMENTATION

    An overview of the project

  • The project is organized into four main pieces: a program that generates random productions given a grammar, an induction algorithm, a program that translates the parse trees produced by the induction algorithm into CNF grammars, and a parser.
  • I have thus far referred to hand-parsed strings as the input to Brill's algorithm. In actuality, the application that generates grammar productions produces parse trees and strings of terminals which then fed directly into the induction engine, skipping the step of generating a text string and then having to parse that string.
    • The entire process involves the following steps:
      1. A string of terminals and an accurate parse tree for those terminals are generated.
      2. The string is naively parsed.
      3. Transformations based on specific rules are applied step by step to the naive parse, and the transformation is chosen that most closely matches the actual parse tree.
      4. After each transformation is chosen, the grammar generator generates a new set of rules from the parse tree. These rule supercede the rule generated from the previous transform.
      5. Transformations are applied until there are no possible transforms that produce a tree that more closely matches the actual parse.
      6. The final grammar is fed with the string of non-terminals into the parser, which produces a list of actions.
    • Ideally the actions would be evaluated to determine how closely the results match the proper evaluation of the arithmetic expression, but I'm still not sure how to assign an action to a given grammar production.
  • Part 1: The generator

    • Given a grammar as input, the program generates a random production in the language that the grammar describes.

    Part 2: Implementing Brill's transformation-based learning method

    • The algorithm first creates a purely left-recursive binary tree from the expression, the naive parse in Brill's algorithm. (see naive_grammar in brill1.c)
    • It then searches the naive parse for potential transformations, transform the tree accordingly, and compared the modified tree to the hand-parsed tree.
      • The four potential configurations of the tree when comparing adjacent leaves are pictured below:
        Configuration A:
        < --- > Configuration B:
        Configuration C:
        < --- > Configuration D:

      • Subtrees 1 and 2 represent common roots of adjacent terminals in the expression.
      • More than one of these configurations is possible for two adjacent terminals:
        • If the common root of 1 and 2 has a root then either B or C is possible.
        • If a branch of the common root is not a leaf containing 1 or 2 then A or D or both are possible.
      • Manipulating the tree from A to C and from C to D is the same transform. Similarly, moving from b to a and from d to c is the same transform. So only two basic manipulations of the tree are possible. - (bt_trans in brill1.c)
    • Because I am using a binary tree to represent the parse of a text, the possible transforamtional rules in my algorithm are somewhat different from Brill's.
      • The six possible rules are the following:
        1. Join the parent tree of terminal t with the parent tree of the terminal to the (left | right) according to transform A.
        2. Join the parent tree of terminal t with the parent tree of the terminal to the (left | right) according to transform D.
        3. Separate the parent tree of terminal t from the terminal to the (left | right) according to transform B or C (only one is possible).
      • I have not implemented rules that consider pairs of terminals in my algorithm.
    • Once the algorithm has compared all possible transformations of its own parse to the hand parse, it chooses the best transformation, records that transformation, and repeats the process on the newly transformed tree.
    • The algorithm compare two trees (and thereby chooses the best transformation) by calculating the distances of adjacent nodes to their common roots in the tree. Each pair of differing distances between the hand-parse and the test parse detract from the test-parse's score.
    • This process is repeated until all possible transformations of the tree produce no noticeable improvement.

    Part 3: Converting a series of transformations into a CNF grammar

    • I am assuming that it is fairly straightforward to produce a grammar from a parse tree; each unique pair of branches produces a root non-terminal.
    • At each transformation, a new tree is produced, and new grammar rules can be derived from that tree.
    • These grammar productions supercede the productions from the untransformed parse-tree. For example:
      CONTAINS

      Rule1 ::= A B
      Rule2 ::= Rule1 C

      WHEREAS
      CONTAINS

      Rule3 ::= B C
      Rule4 ::= A Rule3


    • If the original tree had been transformed into the second, the program cannot assume that the first productions are invalid.
    • The program can only infer that 3 ::= bc supercedes 1 ::= ab.
    • It cannot infer that 2 ::= 1c supercedes 4 ::= a3 (although in this case, the productions cannot interfere).
    • However, if the 'c' terminal was an 'a' then this would be an issue because it creates ambiguity in the grammar: aba could be parsed into either a 2 or a 4 non-terminal as (ab)a or a(ba).
    • I have decided to allow for recursive rules by assuming that a rule is recursive if the left child of the right subnode is the same as the left child of the current node or if the right child of the left subnode is the same as the right child current node.
      For Example:
      PRODUCES THE RULES

      Rule1 ::= A B
      Rule1 ::= Rule1 B

      SIMILARLY
      PRODUCES THE RULES

      Rule1 ::= A B
      Rule1 ::= A Rule1

    Part 4: Implementating an ambiguous-grammar parser for Chomsky Normal Form grammars

    • The parser is a shift-reduce parser that takes CNF grammars and produces all possible parses of a string.
    • Given a stream of tokens, it outputs a tree of possible actions, where the unique paths along the tree represent the sequences of actions that occurs for each possible parse.
    • The parser does not produce every possible parse tree, although this feature may be implemented in the future.
    • It can also distinguish between conflicting rules by choosing the rule with a higher value of the "round" field.
    • Data Model:
      • The function yylex() returns the next token in the input stream. The parser uses this function to interface with LEX-generated scanners.
      • Each token and non-terminal is represented by a nonzero index. - get_ptable(int index)
        • These indices are associated with a structure that contains an input, output, and action list. - PartTable_t
          • The input list determines what tokens can come after the index token.
          • A zero value means that the token can be reduced with no input, a 1-1 reduce.
          • The output list specifies what non-terminal the reduction will produce.
          • The action list specifies what action to take on reducing the given rule.
          • A zero action means do nothing.
      • The parser uses a tree of possible actions for ambiguous parses. - ActionTree_t
        • For each time ambiguity in the grammar, the leaf of the action-tree with the ambiguity splits into two branches
        • Each of the leaves is paired with a stack and a marker denoting the next input from the token stream.
          • EndNode_t links the shift/reduce stack with a leaf on the action tree.
      • The grammar produces binary-branching parse trees.
        • Any two symbols, terminals or non-terminals can be reduced into one symbol.
        • Any one symbol can be reduced to a different symbol.
          • One to one reduces can simplify certain grammars, but they present problems because they can lead to infinite action trees.
          • I've addressed this problem using the sp_back value in EndNode_t.
          • The sp_back variable attempts to detect cycles in 1-1 reductions.
      • Evaluating branches of the action tree will determine the validity of a given parse.
        • The action tree is especially useful to test whether I have parsed an arithmetic expression properly.
        • Arithmetic functions can be paired with sequences of actions to determine whether a given parse returns a correct evaluation of an expression.



    CODE AND SAMPLE DATA:

    The main program and arithmetic expression generator:

    • All of the program's source files
    • The main program currently outputs the following items:
      1. The randomly generated arithmetic expression
      2. The specifications of each transform
      3. Dot files of the hand parse and each of the transformations
      4. These .dot files can be transformed into graphical diagrams by Graphviz
      5. The parses and sequence of actions generated by the parser
    • "exp-gen.c"-The main source file
    • A sample parse tree generated by the grammar:
    • A sample output of the main program including full output from the parser:
      "main-output.txt"

    My implementation of Brill's grammar induction algorithm:

    • "brill1.c"-The main source file
    • An example of naive parse with a transformational rule applied:
      The naive parse tree of an expression:

      The parse tree after applying the rule "Join the parent tree of a number with the parent tree of the terminal to the right according to transform D" :


    • An example of a hand parse and the cooresponding transformed tree:
      The hand parse:

      The fully transformed tree:

      • This tree was generated by applying the following two rules to a naive parse:
        1. Join the parent tree of a number with the parent tree of the terminal to the right according to transform D.
        2. Join the parent tree of a number with the parent tree of the terminal to the left according to transform D.
      • The four nodes of final tranformed tree differ in their configurations with the corresponding nodes in the hand-parsed tree.
      • My implementation of Brill's algorithm cannot yet do any better because it does not include enough rules to get closer to the hand parse.

    The program to produce CNF grammars from binary trees

    The ambiguous CNF grammar parser

    • "mop2.c"-The main source file for my ambiguous-grammar parser
    • The input grammar I used to test the parser appears below:
      			1 ::= 1,1 | 2,2 | 2
      			2 ::= 2,1 | 1,2
      	
    • "mop-output.txt"-The program output of the parser run with the above grammar on the array of non-terminals {2,1,2,2,1}

    A scanner in lex for arithmetic expressions




    SOURCES:

    1. Brill, Eric. "Automatic Grammar Induction and Parsing of Free Text"
    2. Crocker, Mathew. "Inside and Outside Probabilities and Viterbi Parsing"
    3. Dupont, P. "Inductive and Statisstical Learning of Formal Grammars"
    4. Langley, P. "Machine Learning and Grammar Induction"
    5. Tasker, B., Klein, D., Collins, M., Koller, D., Manning, C. "Maximum-Margin Parsing"
    6. Klein, D., Manning, C. "Natural Language Grammar Induction using a Constituent-Context Model"
    7. Lieberman, Henry. Grammex
    8. Flex Man Page from GNU, a Lex-like scanner generator
    9. Graphviz, a program that produces graphical diagrams