Week 5

Issues Analyses & Vectors Embedding

Posted by Stuart Chen on June 30, 2019

The Week 5

This week I believe it’s a diving.

I can now see many things, subtle and seemingly extending to somewhere abyssal.

In this week, I have tried transforming the natural language questions into SPARQL queries via the trained neural SPARQL machine models, which showed the restrictedness in the vocabulary mapping.

To trace back, I noticed that the Spotlight annotation might focus on the entity recognition based on the word-wise input. Thus, the entity consisting of more than one word could have been missed.

Also, to make better use of the efforts of the existing templates, it is more efficient to vectorize the new question templates generated in the previous work in order to do the comparison with the existing question templates’ embeddings.

What’s more, with the helpful advice, the project is moving forward an insightful and pioneering direction.

1. Issues Analyses

1.1 The Mismatching While Trained Models Work With Unprecedented Vocabulary

Case 1: Entities Ambiguation

For example,
when the NSpM model took an input natural question that contains
"region",
the model referred it to 
the "dbr:wineRegion" 
that it had only learned about the word similar to "region",
which might be biased and divergent to the original sense.

Case 2: The Abused Mapping

That's to say, when the model handles a vocabulary that is not in 
the topic that it was trained, 
it would mistakenly place the unprecedented word into the place of 
another vocabulary that it had learned, then mismatching to the 
value of the replaced vocabulary.
  • For example,

if we use the model trained on dbo:Monument templates set to infer the questions of the topic about location that contains the vocabulary that it had not learned in the training data, like: the “rdf:type dbo:Place” and the “dbo:location” are the two of the most frequently abused in the inference for those vocabulary unprecedented.

Case 3: Issue On Reproducibility

The models that have attained BLEU could have not perfectly 
translated the natural language questions with a new different 
vocabulary into the correct query.
  • For example,

I picked out some of the questions in the training set as input to the trained model to test whether it can reproduce the queries that are paired to these original question in the training set, like:

“whom did xu fan marry?”,

the generated result was

“select distinct var_uri where brack_open dbr_Rugrats dbo_composer var_uri”,

where apparently the “dbr_Rugrats” and “dbo_composer” are not related to the question, also, the crucial entity about the person named “Xu Fan” and the entity “dbr_Xu_Fan” has not been properly recognized, not matching to the correct paired query

“select distinct var_uri where brack_open var_uri dbo_spouse dbr_Xu_Fan sep_dot brack_close” in the training set.

Summing Up:

The current neural SPARQL machine would be having a strong 	
dependency on the vocabulary that it have learned.
Also, the present version of the model design has mostly focused on 
the fitting of neural machine translation between the  natural 
language questions and the SPARQL queries.

1.2 Expanding The Question:

So, what should we do next?

I got some feedback from the researches in NL2SQL that major in deploying reinforcement learning method in training the neural model to get more and more correctness in the returned answers.

Figure 1, SEQ2SQL(V Zhong et al.) Figure 1, SEQ2SQL(V Zhong et al.) [1]

To elaborate on this direction of researches, the paper MAPO (Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing) [2] on NIPS 2018 is worthy of our attention.

MAPO is a method based on weakly supervised and intensive learning. In the paper, it transforms the NL2SQL task into an intensive learning task based on the basic composition of the NL2SQL task and the basic elements of reinforcement learning. In MAPO, the state of reinforcement learning x is seen as the natural language problem of input and its corresponding environment (e.g. an interpreter, or an knowledge graph or, typically in the experiment in the paper, a database), and the action space A of reinforcement learning is regarded as all possible collections of programs under the current natural language problem. And each action sequence a of the enhanced learning trajectory corresponds to every possible program.

It can be seen that the key to the algorithm is the training of the strategy function. In MAPO, the author uses a seq2seq model to fit the strategy function, and training for the strategy function is equivalent to training the seq2seq model.

It is worth noting that in reinforcement learning, the parameter update of the strategy function is different from the deep neural network, not based on the loss function, but based on the expected return. The parameter update is in the direction of maximizing the expected return. The expected return can be given by the Reward function.

Since our task is to generate program statements, we can easily run the generated program in the real environment and compare the result with the label of the weakly supervised training data to get the 0-1 binary return function. Its core idea is to correct the behavior of the agent by interacting with the environment, so as to achieve the effect of “learning”.

Based on the above-mentioned intensive learning ideas, in the specific implementation, MAPO proposed the following innovative solutions. First, in order to improve the efficiency of training, the MAPO algorithm stores the sampled high-reward program into a Memory Buffer. When training the strategy function fitting network, the objective function that maximizes the expected return is divided into two parts, one is The expected return of the sampling program in the Memory Buffer, and the other part is the expected return of the sampling program outside the Buffer, as shown in the following figure.

MAPO,NIPS2018

Figure 2, MAPO(NIPS 2018) [2]

The team[2] applied MAPO to the WikiSQL dataset. For each sample on the WikiSQL data, 1000 sample programs were generated. Five high-return samples were stored in the Memory Buffer. The GloVe embeddings was applied to the strategy function fitting network. LSTM Hidden The unit is 200 and the training is 15000 steps: Figure 3, is the figure 2 from MAPO about its performances and experiments

The learning of the connection between the the natural language question and the structured query is the major objective function, where the task is designed to let the query fit as much as possible during the machine translation training.

While parallelly, the researches in NL2SQL have a comparative stronger emphasis on the result that the generated query would return and on the correctness that it would produce, because the correct return of the query is their only goal, not the matching similarity of the queries.

That’s to say, their models tend to build a more direct function from the natural language question through the generation of the structured query finally towards the return answer, based on which this type of algorithms of reinforcement learning get the reward upon the right or wrong of the result of the generated query then learning to let the query generator get the more and more pertinent returned answer.

To sum up, the machine translation based-model is query-driven while the reinforcement learning model that tries to take a step further can be classified as a final-answer-driven method.

However, in this type of final-answer-driven models, there’s a pre-requisite that requires to embed the natural language questions, the structured query, and the answers entities into a calculable form while working with our problem.

2. Vectors Embedding

Then, why embed the data?

The problem of the restricted limitation of the vocabulary mapping, and the abused vocabulary matching, in my humble opinion, could be traced back to the restriction of the vocabulary that it was able to learning in the training data set.

So, how could we remove the limitation away from the current model?

There are two aspects to be paid attention:

1) Uniqueness and Directivity: The embedding vector must be unique in terms of representing the entities, otherwise there could be conflicting mapping. And it must be promising that, in the vector space, each entity-vector pair must be correctly matched in one unique key-value pair.

  • For example, the vocabulary “brack_close” was duplicated in the generated data file in the training data set on the topic about “place_v2”.

2) Comprehensive Inclusion: The vector set should, comprehensively and accurately, comprise all the entities and relation properties in the DBpedia space, with inclusion in the embedding of the keywords of the query grammars, e.g. “SELECT”,”DISTINCT”,”WHERE”.etc.

After the attempts in training a new vector set on the given templates, I figured out that it might be of more efficiency employing the DBpedia embedding vector of the previous projects.

References

[1] Victor Zhong, Caiming Xiong, Richard Socher (2017) SEQ2SQL: GENERATING STRUCTURED QUERIES FROM NATURAL LANGUAGE USING REINFORCEMENT LEARNING

[2] Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc Le, Ni Lao (2018) Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing