Generating Language – Part 3

In the last edition, we looked at some early-stage content determination. Now we’ll move onto something a bit more technical, talk about grammar, and show some actual code. Armed with basic content, we move onto sentence planning and surface realization. Sentence planning is done by me; for surface realization, there’s a well-established Java API that does the vast majority of what we need. The upside: Somebody else has already done the least interesting, most tedious part of NLG for us. The downside: It’s in Java, and I don’t (or didn’t, a couple of weeks ago) have exposure to Java.

Good thing I have no fear of learning new languages. I found this tutorial helpful, even though the website design makes me want to stab myself in the eyes. And I have to give a mega thanks to my good friend Cody, who made himself available as a resource to answer all of my Java questions, no matter how amateur some of them likely seemed to him.

Once I had gone through enough of the tutorial that I felt like I could just wing it the rest of the way, I started learning SimpleNLG, which has its own tutorial. SimpleNLG’s tutorial really isn’t all that comprehensive; it goes through the bare bones basics, and it only just barely suffices for that. I think that having deep knowledge of morphology and grammar might have been a precondition for my success in using it (with Google’s help, of course).

The basic procedure is as follows:

  1. Decide on a piece of content you want to communicate.
  2. Formulate a sentence to communicate it.
  3. Tag each word in the sentence according to the role it plays (remember elementary school sentence diagramming? It’s actually useful!)
  4. Code it all up in SimpleNLG, abstracting away some words as variables and possibly conditioning some of the behavior on others. We’re about to see an example.

So, let’s say we have data for homes in Twin Peaks, WA, which show that the average home has 3 bedrooms and 1 bathroom. That’s step 1 above. We might, then, plan the sentence, “Homes in Twin Peaks typically have 3 bedrooms and 1 bathroom.” That’s step 2.

Step 3 is more complex: Our subject is the noun phrase (NP) “homes in Twin Peaks”, and our predicate is the verb phrase (VP) “typically have 3 bedrooms and 1 bathroom.”

Looking at the subject first, we can further decompose the sentences in to the NP “homes”, which is just the plural of “home”, and the prepositional phrase (PP) “in Twin Peaks”, which is the preposition “in” complemented by the proper noun “Twin Peaks.”

The predicate is built around the main VP “(to) have.” It gets modified by the adverb “typically.” What follows is the coordinating conjunction “3 bedrooms and 1 bathroom.” The first coordinate is the NP “3 bedrooms,” which breaks down into “bedroom,” modified by “3” and pluralized; the second coordinate is the NP “1 bathroom,” which breaks down into “bathroom,” modified by “1.”

That’s all very tedious, I know, but getting things tagged in this fine detail is important. It will allow us to have a computer make intelligent morphological decisions such as pluralizing “bedroom” when modified by a number greater than 1, but not other wise, and a whole host of other complicated linguistic things we might want to do later on. In the next post, we’ll write some code, I promise.


Leave a Reply