DBpedia Evolution – NLG for Statistical Reports – Cordoba Data Science Meetup 2015/09/02



DBpedia Evolution – NLG for Statistical Reports – Cordoba Data Science Meetup 2015/09/02

0 0


talk_meetupcbadatascience_20150902

Talk at Cordoba Data Science meetup 2015/09/02

On Github DrDub / talk_meetupcbadatascience_20150902

DBpedia Evolution

NLG for Statistical Reports

Cordoba Data Science Meetup 2015/09/02

Created by Pablo Duboue / @pabloduboue

What is NLG

Natural Language Generation

The other NLP

Decisions, decisions, decisions

Continuous enrichment

Pablo Duboue

My thesis (defended ten years ago!) was in Machine Learning for NLG. I worked in two full NLG systems:

  • MAGIC, a bypass surgery report system written in LISP.
  • ProGenIE, a biography generator written in Java.

Even though I gravitated towards ML and IR, half my papers are in NLG and I'm coming back to the field. (I recently run for a position in the board of SIGGEN.)

NLG & Data Science

Why Big Data needs natural language generation to workhttp://www.thoughtsoncloud.com/2015/07/why-big-data-needs-natural-language-generation-to-work/

Standard Pipeline

Adapted from https://code.google.com/p/simplenlg/wiki/AppendixA

Document Planner

content determination decides what information will appear in the output text. This depends on what your goal is, who the audience is, what sort of input information is available to you in the first place and other constraints such as allowed text length.

document structuring decides how chunks of content should be grouped in a document, how to relate these groups to each other and in what order they should appear. For instance, when describing last month’s weather, you might talk first about temperature, then rainfall. Or you might start off generally talking about the weather and then provide specific weather events that occurred during the month.

OpenSchema performs both tasks.

Microplanner

lexicalization decides what specific words should be used to express the content. For example, the actual nouns, verbs, adjectives and adverbs to appear in the text are chosen from a lexicon. Particular syntactic structures are chosen as well. For example you can say ‘the car owned by Mary’ or you might prefer the phrase ‘Mary’s car’.

referring expressions decides which expressions should be used to refer to entities (both concrete and abstract). The same entity can be referred to in many ways. For example March of last year can be referred to as:

  • March 2014
  • March
  • March of the previous year
  • it

Microplanner

aggregation decides how the structures created by document planning should be mapped onto linguistic structures such as sentences and paragraphs. For instance, two ideas can be expressed in two sentences or in one:

The month was cooler than average.The month was drier than average.

vs.

The month was cooler and drier than average.

Surface Realiser

linguistic realisation uses rules of grammar (about morphology and syntax) to convert abstract representations of sentences into actual text.

structure realization converts abstract structures such as paragraphs and sentences into mark-up symbols which are used to display the text.

SimpleNLG performs the last part, namely surface realisation.

Statistical Reports

PostGraphe, a system developed as part of Dr. Fasciano's thesis at UdeM

Basic intentions covered in PostGraphe:

  • The presentation of a variable
  • The comparison of variables or sets of variables
  • The evolution of a variable along another one
  • The correlation of variables
  • The distribution of a variable over another one

OpenSchema

OpenSchema takes care of selecting what to say and structuring the selected information. This is achieved by going executing an augmented transition network (ATN), which for the purposes of this software package it is a grammar for a regular language (think regular expressions) over discourse predicates defined also as part of the schema itself.

Input: RDF

RDF is a graph description notation used in the Semantic Web.

Output: Clauses

The output of the OpenSchemaPlanner is a DocumentPlan, which contains a list of paragraphs, each of which is a list of aggregation segments. Finally, an aggregation segment is a list of clauses, where each clause is a hierarchical attribute-value matrix, represented as a java Map from Strings to Object.

Schema

schema biography(self: c-person)  
  ; name of the schema 'biography'
  ; self is the person the bio is about, required

  ; first paragraph, the person
  plus
    pred-person(person|self)
  optional
    pred-birth(person|self)
  star ; zero or more aliases
    pred-alias(person|self)
  star ; zero or more parents
    choice
      pred-father(self|self,parent|parent)
      pred-mother(self|self,parent|parent)
    star
      pred-person(person|parent)
  star ; zero or more education
    pred-education(person|self)
  paragraph-boundary
    

Predicates

predicate pred-person
  variables
    req def person : c-person 
    occupation : c-occupation
  properties  ; properties that the variables have to hold
    occupation == person.occupation
  output
    ; use this for template generation
    template "{{name-first}} {{name-last}} is a {{occupation}}. "
    name-first    person.name.first-name
    name-last     person.name.last-name
    occupation    occupation.#TYPE
    ; use this preds for SimpleNLG
    pred attributive
    pred0 person 
    pred1 occupation

SimpleNLG

Tutorial, adapted from https://code.google.com/p/simplenlg/wiki/Tutorial

        Lexicon lexicon = Lexicon.getDefaultLexicon();
        NLGFactory nlgFactory = new NLGFactory(lexicon);
        Realiser realiser = new Realiser(lexicon);

        NLGElement s1 = 
            nlgFactory.createSentence("my dog is happy");

        String output = realiser.realiseSentence(s1);
        System.out.println(output);


My dog is happy.

        SPhraseSpec p = nlgFactory.createClause();
        p.setSubject("Mary");
        p.setVerb("chase");
        p.setObject("the monkey");

        String output = realiser.realiseSentence(p);
        System.out.println(output);

Mary chase the monkey.

        NPPhraseSpec subject1 = 
          nlgFactory.createNounPhrase("Mary");
        NPPhraseSpec subject2 = 
          nlgFactory.createNounPhrase("your", "giraffe");

        CoordinatedPhraseElement subj = 
          nlgFactory.createdCoordinatedPhrase(subject1, subject2); 
        p.setSubject(subj);

Mary and your giraffe chase the monkey.

        NPPhraseSpec object1 = 
            nlgFactory.createNounPhrase("the monkey");
        NPPhraseSpec object2 = 
            nlgFactory.createNounPhrase("George");

        CoordinatedPhraseElement obj = 
            nlgFactory.createdCoordinatedPhrase(object1, object2); 
        obj.addCoordinate("Martha");
        p.setObject(obj);

        obj.setFeature(Feature.CONJUNCTION, "or");

Mary and your giraffe chase the monkey, George or Martha.

        p.setFeature(Feature.TENSE, Tense.FUTURE);
        p.setFeature(Feature.NEGATED, true);

Mary will not chase the monkey.

In-between

Going from the OpenSchema predicates to SimpleNLG class:

  • Aggregation
  • Lexicalization
  • Referring Expression Generation: Alusivo

Alusivo

https://github.com/DrDub/Alusivo

{ Dilma_Rousseff Antonio_Palocci Celso_Amorim Hu_Jintao } http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/OfficeHolder http://dbpedia.org/ontology/profession http://dbpedia.org/resource/Economist

Alusivo

Input: RDF Triples

Output: propiedades distinctivas

Alusivo

En Uso

MICAI 2015

Evaluating Robustness of Referring Expression Generation Algorithms

We took DBpedia 2011/1, generated RE and evaluated them on DBpedia 2014/5

Most algorithms behaved correctly

DBpedia Evolution

Sampling to obtain differences beyond 3 years

Data Visualization

Dimensionality reduction

What is missing?

Capturing generalizations

Thoughtland

Website

Wrap-up

Do you want to learn more?

I have here the material for my NLG class from 2011. (A graduate level, semester long course.)

http://wiki.duboue.net/index.php/2011_FaMAF_Intro_to_NLG

Aprendizaje Sobre Grandes Volumenes de Datos

http://aprendizajengrande.net

Ahora en YouTube

FIN

Keep in touch with Pablo at:

DBpedia Evolution NLG for Statistical Reports Cordoba Data Science Meetup 2015/09/02 Created by Pablo Duboue / @pabloduboue