Week 9&10

Univerasal Sentence Encoder for Template Matching & Templates Bank

Posted by Stuart Chen on August 3, 2019

Preface

We focus on:

1) classifying the new questions by the vector similarity matching based on univeral sentence encoder;

2) constructure of the Template Bank.


1. Templates Matching

1.1 Universal Sentence Encoder

Why we need it?

Considering the great human efforts we have expended in manually composing the templates, we believe we can make better use of them for generating more new templates,

  • for example, when we see these two questions,

who is the wife of Obama? and who is the spouse of Obama?

actually, they can be represented by the same query for the same semantic meaning.

Then, how to accomplish the semantic searching?

Yes, we can use the similarity of embedding vectors by the Universal Sentence Encoder.

The Universal Sentence Encoder [1] encodes text into vectors of high dimensions that can be used in classifying documents, calculating semantic similarity, topics clustering, and more other natural language processing tasks. With Tensorflow-hub, we deploy the Universal Sentence Encoder to calculate the similarity between the new question and the existing templates to do the matching task.

The encoder part comes with two variant options, one with Transformer [2] encoder and the other with Deep Averaging Network (DAN) [3]. Both having a necessity for precision and computational resource trading, while the one with the Transformer encoder is of higher accuracy at a cost of computing, it is more intensive in computational terms. And, the other with DNA encoding is less costly computationally but with less precision.

Here we use the Transformer-based sentence embedding to match the question and its closet template.

Example,

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_2 (InputLayer)            (None, 512)          0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, 512)          0                                            
__________________________________________________________________________________________________
model_1 (Model)                 (None, 512)          262656      input_2[0][0]                    
                                                                 input_3[0][0]                    
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 512)          0           model_1[1][0]                    
                                                                 model_1[2][0]                    
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1)            513         lambda_1[0][0]                   
==================================================================================================
Total params: 263,169
Trainable params: 263,169
Non-trainable params: 0
__________________________________________________________________________________________________

So, if we compare who is the wife of Obama? and who is the spouse of Obama?, they would be classified into the same semantic meaning by setting the threshold similarity score.

However, when we faced who is the father of Obama? and who is the spouse of Obama?, the difference will be shown in the similarity score. If the threshold is lower than what’s required, we call the template composing function then.


1.2 Preprocesing

The templateset must first be filtered out the stop-words, the punctuations, and then turn into lower case for data regularization.

1.3 Detecting The Class Of The Question

We use the DBpedia Spotlight API to detect the belonging class of the noun subject in the question, which helps to classify the question into its correspondent templateset.

Why it’s necessary? Because, the sentences get confused or not depends on how close the two classes that they belong to, like the dbo:SportsLeague and dbo:Monument, their topics are not so similar, so they’re not easy to get confused, but when I took the

dbo:Place versus dbo:Monument, the "dbo:Place;;;what is the address of <X>;

and the

 dbo:Monument;;;what is the location of <A>;

this case could be easily confused.

In human concepts, the “address” and the “location” might holding the same meaning, while it is not that case in knowledge graph of DBpedia:

dbo:Place;;;what is the address of <X>;SELECT ?x WHERE { <X> <http://dbpedia.org/ontology/address> ?x };SELECT ?a WHERE { ?a <http://dbpedia.org/ontology/address> [] . ?a a <http://dbpedia.org/ontology/Place> }
dbo:Monument;;;what is the location of <A>;select ?a where { <A> dbo:location ?a };select distinct(?a) where { ?a dbo:location [] }

So, for promising the accuracy, we should classify the Classes before matching, like, before classifying these two above, we first classify the new question into a Class then conduct the matching.

It’s not so straightforward that it requires to find the entity type first:

Statue of Liberty --> Place or Monument

However, not every entity belongs to a single upper Class. Some certain entities might belong to multiple Classes, like in

"which television show were created by joe austen?" 

the

"joe austen"

is annotated as

"dbr:Joe_Austen" 

with three upper Classes,

dbo:Person, dbo:Artist, dbo:Agent, 

and the Spotlight can also indicate the most probable Class in this context is the “dbo:Person” in the return. So, to know to which Class it belongs, we can respectively fetch their lists of properties that its three upper Classes have, and see in which list of properties the predicate exists.


1.4 Semantic Searching

How do we use the Universal Sentence Encoder to build the Semantic Searching?

Given the detected Class and its belonging templateset, we can use the similarity searching function to get the matched template for the question:

def semantic_search(query, texts_processed, vectors):

    query = preprocess_text(query)

    print("\n Extracting features...")

    query_vec = get_features(query)[0].ravel()

    res = []

    for i, d in enumerate(texts_processed):

        qvec = vectors[i].ravel()

        sim = cosine_similarity(query_vec, qvec)

        res.append((sim, d[:100], i))
        
    sorted(res, key=lambda x : x[0], reverse=True) # Sorting for getting the most matched one.

    return res[0] ;

res = semantic_search("This is a new sentence. ", texts_processed, BASE_VECTORS)

2. Templates Bank

2.1 The Directory For The Bank

-- Bank
    -- Class 0
        -- class0.csv
        -- class0.vectors
    -- Class 1
        -- class1.csv 
        -- class1.vectors
    -- Class 2
        -- class2.csv
        -- class2.vectors
    -- Class ...

In this directory, the Bank consists of the Classes of DBpedia ontologies, and each Class has its own folder. Its folder contains the templateset CSV and its sentences vectors set.

We use this function to see the existing Classes,

import csv
import pickle
import os

path = "../data/Bank" #the Bank directory is located in the neural-qa/data

files= os.listdir(path) #to get all the files/folders names in the dir

templates_pool = []

for file in files: #iterate to get the folders

     if os.path.isdir(path+"/"+file): # whether a folder 

          templates_pool.append(file) #get the folder into the templates_pool list

print('\n The existing templates pool contains these Classes: \n ', s) #display the result
    

2.2 Allocation To The Matched Class Folder

From the templates_pool above, we can check whether the Class is existing in the Bank category.

Then, we can load the correspondent templateset and vectors.

Referrences

[1] D Cer et al. 2018 Universal Sentence Encoder

[2] A Vaswani 2017 Attention Is All You Need

[3] Mohit Iyyer 2015 Deep Averaging Networks