Week 11&12

Wiki Extraction & Sentences Filtering

Posted by Stuart Chen on August 18, 2019


We focus on:

1) get the natural language materials from the Wikipedia page for templates generation;

2) automate the paraphrases for predicates.

1. Wiki Extraction & Sentences Filtering

1.1 Wikipedia Articles Extraction

The Wiki pages are the excellent resources for us to get the natural language texts that containing the RDFs in DBpedia.

We use the library called Wikipedia-API to use Python wrapper for Wikipedias’ API, which facilitates extracting texts, sections, links, categories, translations, etc from the pages of Wikipedia. Documentation provides code snippets for the most common use cases.

For example, a ‘dbr’ should be like: ‘Audrey_Hepburn’, which is the identifier of both the DBPeida entity and the Wiki page.

def extract_page(dbr, clas):
    #wiki_wiki = wikipediaapi.Wikipedia( language='en', extract_format=wikipediaapi.ExtractFormat.WIKI )    
    wiki_wiki = wikipediaapi.Wikipedia('en')
    page_wiki = wiki_wiki.page(dbr)
    #page_wiki = wiki_wiki.page('Audrey_Hepburn')    
    print(" Page - Exists: %s" % page_wiki.exists())
    # Page - Exists: True
    fulltext = page_wiki.text
    savepath = save_page(dbr, fulltext, clas)
    return savepath

1.2 Preprocesing

We then use the spaCy to do the sentences segmentation. When using a space-based pre-trained model, the sentences are split based on training data provided during the model’s training procedure.

And, we also remove the puntuations except the comma in the sentences, because we need to use the comma to decompose the senetences in the part of converting them into questions.

Then we calculate the average length of the sentences we get and set it as the treshold to filter out those sentences that are too long.

2. Predicate-paraphrasing & RDF Matching

2.1. Coreference Resolution

Let’s see what is the coreference resolution.

This’s an explanative example(by StandfordNLP) [1],

Coreference Resolution

so, when the lady stated as above for this man’s election, we can see the corefernce relations as below:

    THE LADY: [ I, my, She ]
    THE MAN:  [ Nader, he ]

This can clarify what the words or pronouns are talking about in the passage: to which entity? from which? who? what’s this entity/pronoun?

  • in this example,

his wife Michelle

through the Coreference Resolution, we can figure out that whom does it mean by 'his' ? in the context, which helps a lot in the RDF entities confirmation.

Then we get the RDF:

'his' <dbr:Barack_Obama> –> 'wife' <dbo:spouse> –> 'Michelle' <dbr:Michelle_Obama>

Resolution of coreference is a well-studied problem in computer linguistics discourse. To obtain the correct interpretation of a text, or even to assess the relative significance of the different topics listed above, pronouns and other reference phrases must be related to the right individuals. Usually, algorithms designed to solve coreferences first look for the closestpre.

`Coreference Resolution in spaCy with Neural Networks.’

NeuralCoref[2] is a spaCy 2.1 + pipeline extension that uses a neural network to annotate and resolve coreference clusters. NeuralCoref is ready for production, incorporated into the NLP pipeline of spaCy and can be extended to fresh training datasets.

It solves the problem of pinpointing the pronounces and indicating the right corefernce of an entity.

2.2. Automatic Predicate-paraphrasing

At first, we use the API of an online paraphraser to do the part of automatic paraphrasing, which we found that might not be suitable to do the paraphrasing of short phrases like the predicates.

Then, we use the wordnet[3] via NLTK to take the paraphrasing task.

from nltk.corpus import wordnet

WordNet is the lexical database, equivalent to kind of English language dictionary, specifically intended for the processing of natural language.

Synset is a unique kind of easy interface to look up phrases in WordNet that is present in NLTK. Synset instances are synonymous word groupings expressing the same concept. Some of the phrases have just one synset, while others have several.

syn = wordnet.synsets('friend')[0]
print ("Synset name :  ", syn.name()) 

# Defining the word 
print ("\nSynset meaning : ", syn.definition()) 

# list of phrases that use the word in context 
print ("\nSynset example : ", syn.examples())
Synset name :   friend.n.01

Synset meaning :  a person you know well and regard with affection and trust

Synset example :  ['he was my best friend at the university']

To understand how the paraphraser works, we need to konw more about the Hypernyms and Hyponyms :

Hypernyms are the more abstract terms for the entity that a word describs.

Hyponyms are the specific terms, we can say that they are the subclasses of the entity.

print ("\nSynset abstract term :  ", syn.hypernyms()[:5]) 

print ("\nSynset specific term :  ",  syn.hypernyms()[0].hyponyms()[:5]) 

print ("\nSynset root hypernerm :  ", syn.root_hypernyms()[:5])
Synset name :   friend.n.01

Synset abstract term :   [Synset('person.n.01')]

Synset specific term :   [Synset('abator.n.01'), Synset('abjurer.n.01'), Synset('abomination.n.01'), Synset('abstainer.n.02'), Synset('achiever.n.01')]

Synset root hypernerm :   [Synset('entity.n.01')]

What is the Hypernyms and Hyponyms distinction?

Hyponymy indicates the connection between a generic (hypernyme) word and a particular (hyponymous) term.

A hyponym is a word or sentence that has a more particular semantic field than its hypernym.

A hypernym’s semantic field, also known as a superordinate, is wider than a hyponym’s.

Then we can get:

ps = PorterStemmer() 

def paraphrase(word):
    word = str(word).strip().lower()
    synsets = wordnet.synsets(word)
    #synonyms = [[[lemma.name() for lemma in synonym.lemmas()] for synonym in synset.hyponyms()] for synset in synsets ]
    synonyms = [[lemma.name() for lemma in synonym.lemmas()] for synonym in synsets[0].hyponyms()]  
    synonymset = []
    for synonym in synonyms :
        for syn in synonym:
            syn = str(syn).replace('_', ' ').lower()
            if len(syn.split()) == 1:
                syn = ps.stem(syn)
            synonymset.append(syn) ;
    synonymset = list(set(synonymset))
    synonymset.insert(0, word)
    return synonymset

Then we can do the RDF matching, by given a triple of DBpedia RDF, e.g. http://dbpedia.org/resource/Barack_Obama http://dbpedia.org/property/parents http://dbpedia.org/resource/Barack_Obama_Sr. .

we can have

{"subject":CoreferenceResolution(dbr_Barack_Obama), "predicate":dbp_parents+synonyms("parents"), "obeject":dbr_Barack_Obama_Sr}

to match the sentences that are sitable for this RDF triple.


[1] Coreference Resolution, StandfordNLP, https://nlp.stanford.edu/projects/coref.shtml

[2] NeuralCoref, https://github.com/huggingface/neuralcoref

[3] WordNet, https://wordnet.princeton.edu/