So far in our paper, the reward values demonstrated in the examples are fixed, we have
and typically we have c1 > c2 .
This strategy makes the RL-based approach no difference from classification approaches with cross-entropy in terms of “treating wrong labels equally” as discussed in the introductory section. Moreover, recent RL approaches on relation extraction [21, 22] adopt a fixed setting of reward values with regard to different phases of entity and relation detection based on empirical tuning, which requires additional tuning work when switching to another data set or schema.
In event extraction task, entity, event and argument role labels yield to a complex structure with variant difficulties. Errors should be evaluated case by case, and from epoch to epoch. In the earlier epochs, when parameters in the neural networks are slightly optimized, all errors are tolerable, e.g., in sequence labeling, extractor within the first two or three iterations usually labels most tokens with O labels. As the epoch number increases, the extractor is expected to output more correct labels; however, if the extractor makes repeated mistakes—e.g., the extractor persistently labels “death” as O in the example sentence “... are punishable by death ... ” during multiple epochs—or is stuck in difficult cases—e.g., whether FAC (facility) token “bridges” serves as a Place or Target role in an Attack event triggered by “bombed” in sentence “US aircraft bombed Iraqi tanks holding bridges... ”—a mechanism is required to assess these challenges and to correct them with salient and dynamic rewards.
We describe the training approach as a process of extractor (agent A) imitating the ground-truth (expert E), and during the process, a mechanism ensures that the highest reward values are issued to correct labels (actions a), including the ones from both expert E and a:
This mechanism is Inverse Reinforcement Learning , which estimates the reward first in an RL framework.
Equation (12) reveals a scenario of adversary between ground truth and extractor and GAIL , which is based on GAN , fits such adversarial nature.
In the original GAN, a generator generates (fake) data and attempts to confuse a discriminator D which is trained to distinguish fake data from real data. In our proposed GAIL framework, the extractor (agent A) substitutes the generator and commits labels to the discriminator D; the discriminator D, now serves as reward estimator, aims to issue largest rewards to labels (actions) from the ground-truth (expert E) or identical ones from the extractor but provides lower rewards for other/wrong labels.
Rewards R(s,a) and the output of D are now equivalent and we ensure:
where s, aE and aA are inputs of the discriminator. In the sequence labeling task, s consists of the context embedding of current token vt and a one-hot vector that represents the previous action at −1 according to Equation (1), and in the argument role labeling task, s comes from the representations of all elements mentioned in Equation (6); aE is a one-hot vector of ground-truth label (expert, or “real data”) while aA denotes the counterpart from the extractor (agent, or “generator”). The concatenated s and aE is the input for “real data” channel while s and aA build the input for “generator” channel of the discriminator.
In our framework, due to the different dimensions in the two tasks and event types, we have 34 discriminators (one for sequence labeling, and 33 for event argument role labeling with regard to 33 event types). Every discriminator consists of two fully-connected layers with a sigmoid output. The original output of D denotes a probability which is bounded in [0,1], and we use linear transformation to shift and expand it:
e.g., in our experiments, we set α
= 20 and β
= 0.5 and make
To pursue Equation (13), we minimize the loss function and optimize the parameters in the neural network:
During the training process, after we feed the neural network mentioned in Section 4.1 and 4.2 with a mini-batch of the data, we collect the features (or states s), corresponding extractor labels (agent actions aA ) and ground-truth (expert actions aE ) to update the discriminators according to Equation (15); then we feed features and extractor labels into the discriminators to acquire reward values and train the extractor—or the generator from the GAN's perspective.
Since the discriminators are continuously optimized, if the extractor (generator) makes repeated mistakes or makes surprising ones (e.g., considering a PER as a Place), the margin of rewards between correct and wrong labels expands and outputs reward with larger absolute values. Hence, in sequence labeling task, the updated Q-values are updated with a more discriminative difference, and, similarly, in argument role labeling task, the P(a | s) also increases or decreases more significantly with a larger absolute reward values.
Figure 5 illustrates how we utilize a GAN for reward estimation.
An illustrative example of the GAN structure in sequence labeling scenario (argument role labeling scenario has the identical frameworks except vector dimensions). As introduced in Section 5, the “real data” in the original GAN is replaced by feature/state representation (Equation (1), or Equation (6) for argument role labeling scenario) and ground-truth labels (expert actions) in our framework, while the “generator data” consists of features and extractor's attempt labels (agent actions). The discriminator serves as the reward estimator and a linear transform is utilized to extend the D's original output of probability range [0,1].
In case where discriminators are not sufficiently optimized (e.g., in early epochs) and may output undesired values—e.g., negative for correct actions, we impose a hard margin
to ensure that correct actions will always take positive reward values and wrong ones take negative.