We cast Chinese keyphrase extraction as a character-level sequence labeling task and used IOB format as the input format to the model. This task can be formally stated as:
Let d = {w1, w2, …, wn} be an input text, where wt represents the ith element. If the input text is mixed up with Chinese and English, the element is a character for Chinese and a word for English. Assign each wt in the text one of the three class labels Y = {KB, KI, KO}, where KB denotes that wt locates at the beginning of a keyphrase, KI denotes that wt locates in the inside or end of a keyphrase, and KO denotes that wt is not a part of all keyphrases. For example, there is a sentence “X 连锁先天性肾上腺发育不良患儿的临床及 NR0B1 基因突变分析 (Clinical and NR0B1 gene mutation analysis in children with X-linked congenital adrenal dysplasia)” and the keyphrases in this sentence are “X 连锁先天性肾上腺发育不良 (X-linked congenital adrenal dysplasia)” and “NR0B1基因 (NR0B1 gene).”
After IOB format transformation, the character-level tagging result of this sentence is shown in Figure 1. As we can see, we split the sentence according to the language which regarded each English word as the elementary unit and each Chinese character as the elementary unit. This character-level formulation avoids errors of Chinese tokenizer, which has been a troublesome problem in Chinese keyphrase extraction.
Figure 1. An example of character-level sequence labeling. |
We collected data from Chinese Science Citation Database, which is a database contains more than 1,000 kinds of excellent journals published in mathematics, physics, chemistry, biology, medicine, and health etc. We set some constraints to restrict data to Chinese medical data as well as no incomplete and duplicated records included to ensure the quality of data. The constraints were listed as follows:
(1) According to Chinese Library Classification (CLC), the CLC code of medical data starts with the capital letter “R”. So we restricted data to records that the metadata field of CLC code starts with the capital letter “R”.
(2) The metadata field of language was set to Chinese.
(3) The metadata fields of title, abstract, and keyphrases were not null. Here, keyphrases refer to author-assigned keyphrases.
Statistics showed that there were 757,277 records meeting the above-mentioned constraints in total. The title and the abstract of each article were concatenated as the source input text. Furthermore, there are two types of keyphrases: extractive keyphrases and abstractive keyphrases. Extractive keyphrases refer to keyphrases that are present in the source input text while abstractive keyphrases refer to keyphrases that are absent in the source input text. Because we formulated keyphrase extraction as a character-level sequence labeling task and can only extract keyphrases that are present in the source input text, we just considered the extractive keyphrases.
For a given text, we expected that all author-assigned keyphrases are extractive keyphrases, so we can annotate as many extractive keyphrases as possible. To achieve that, we firstly matched each author-assigned keyphrase with the given text to see if all author-assigned keyphrases can be found in the text. Then we limited our dataset to records that all author-assigned keyphrases are extractive keyphrases. After filtration, there were 169,094 records in total. We aim to construct a large-scale dataset for our deep neural network model because although deep neural networks can learn highly non-linear features, they are prone to over-fitting compared with traditional machine learning methods.
We chose 100,000 records as our training set, 6,000 records as our development set and 3,094 records as our test set. Training set was used for training the keyphrase extraction model. Development set was used in the training process to monitor the generalization error of the model and to tune hyper-parameters. Test set was used to test the performance of the model. Note that there was no overlap among data sets. Next, we processed these three data sets using IOB format to make them suitable for modeling sequence labeling task.
In this paper, we are going to compare word-level and character-level formulation for Chinese keyphrase extraction. So we constructed datasets for character-level and word-level sequence labeling separately. For the generation of word-level dataset, we used Chinese tokenizer Jieba
①(
① https://github.com/fxsjy/jieba) to segment words. And the tagging process was almost the same with that of character-level dataset construction except that we tagged the words rather than characters. An example of word-level sequence labeling is shown in
Figure 2.
Figure 2. An example of word-level sequence labeling. |
For character-level IOB format generation, we did some preprocessing steps:
(1) Using Unicode Coding to distinguish Chinese and English. To address the problem that English words and Chinese words are mixed together in Chinese medical abstracts, we used Unicode Coding to distinguish English and Chinese. Our proposed data sets can greatly deal with the split of English words and Chinese characters, in which English word and Chinese character is the minimal unit respectively.
(2) Converting from all half width to full half width. Punctuations in Chinese medical text include two format: full width and half width. Authors may neglect the format of punctuations, which causes the problem that keyphrases can't match with the abstract. For example, the authors might provide the keyphrase “er:yag 激光” (er:yag laser), but they used “er:yag 激光” (er:yag laser) in the abstract in which the colon was in full width format. So we transformed all half width punctuations to full width punctuations except full stop.
(3) Dealing with special characters. There are lots of special characters in scientific Chinese medical abstracts and sometimes there are space characters next to these special characters while sometimes not. To unify the format, we dropped all space characters next to special characters.
(4) Lowercase. We transformed all English words to their lowercase format.
After preprocessing, we did the tagging process, in which we matched keyphrases with the source input text to find the locations of keyphrases present in the text and tagged the characters within the locations with either label “B” or label “I” and characters not within the locations with label “O”. For the first character in the keyphrase, tag it with label “B” and for the characters other than the first character in the keyphrase, tag them with label “I”.
Figure 3. An example of character-level iob format generation. |
Figure 3 is an example of character-level IOB format generation. In this example, the keyphrase is “X 连锁先天性肾上腺发育不良” (X-linked congenital adrenal dysplasia). We matched the keyphrase and returned the location between 0 and 12. So we tagged the character in location 0 with label “B” and the characters located between 1 and 12 with label “I”. Other characters not within the location were tagged with label “O”.
Note that there were two special occasions in our tagging process and we applied some tricks:
(1) Given two author-assigned keyphrases of the input text, if there is a containment relationship between the location span of two keyphrases, we use Maximum Matching Rule to tag the longest keyphrase. For example:
Text: “穴位注射罗哌卡因分娩镇痛对产妇产程的影响” (Effect of acupoint injection of ropivacaine labor analgesia on maternal labor)
This text has two author-assigned keyphrases: “分娩” (Childbirth) and “分娩镇痛” (Labor analgesia). The location span of “分娩” (Childbirth) is between 8 and 9 while the location span of “分娩镇痛” (Labor analgesia) is between 8 and 11. So we tagged the characters within the longest keyphrase “分娩镇痛” (Labor analgesia) with label “B” or “I”.
(2) If the first few characters of a keyphrase is equal to the last few characters of the other keyphrase and this keyphrase appears after the other keyphrase in a given text, we will concatenate these two keyphrases by their common characters. For example:
Text: “术中经食管超声心动图对心脏瓣膜置换术后即刻人工瓣膜功能异常的诊断价值” (Diagnostic value of intraoperative transesophageal echocardiography for abnormal prosthetic valve function in the immediate postoperative period after heart valve replacement)
This text has two author-assigned keyphrases: “人工瓣膜” (Prosthetic Valves) and “瓣膜功能异常” (Abnormal valve function). These two keyphrases share common characters “瓣膜” (Valves) and appear next to each other in the text. Then we will tag the keyphrase “人工瓣膜功能异常” (Abnormal prosthetic valve function) instead of “人工瓣膜” (Prosthetic Valves) or “瓣膜功能异常” (Abnormal valve function). This step determines that our dataset is suitable for flat keyphrase extraction rather than nested keyphrase extraction, which means that each character will be assigned only one label.
To examine the quality of our data sets, we counted the number of recognized keyphrases, the number of correct recognized keyphrases, and the number of ground-truth keyphrases in our generated data sets. And we used evaluation measures mentioned in section 3.2 to see the IOB generation performance. The IOB generation results for character-level and word-level are summarized in Table 1 and Table 2 separately.
Table 1 Character-level IOB generation results on data sets. |
Data Set | P | R | F | Number of Recognized keyphrases | Number of Correct Recognized Keyphrases | Number of Ground-truth Keyphrases |
Training Set | 99.18% | 99.42% | 99.3% | 416,013 | 409,371 | 408,373 |
Development Set | 99.13% | 99.54% | 99.34% | 25,942 | 26,169 | 26,061 |
Test Set | 99.15% | 99.56% | 99.36% | 13,344 | 13,458 | 13,403 |
Table 2 Word-level IOB generation results on data sets. |
Data Set | P | R | F | Number of Recognized keyphrases | Number of Correct Recognized Keyphrases | Number of Ground-truth Keyphrases |
Training Set | 91.15% | 96.93% | 93.96% | 395,852 | 434,266 | 408,373 |
Development Set | 91.35% | 97.03% | 94.11% | 25,287 | 27,680 | 26,061 |
Test Set | 90.99% | 97.11% | 93.95% | 13,016 | 14,305 | 13,403 |
As we can see, the F1-score of each character-level generated data set was higher than the corresponding word-level generated data set for more than 5 percent. For character-level data sets, owing to the above-mentioned tricks that we applied to IOB generation, the evaluation measures don't reach to 100%. But the character-level IOB generation results on all three data sets still show that our data sets are of good quality. For word-level sequence labeling data sets, the segmentation error of the Chinese tokenizer is a critical reason that the evaluation measures are lower than that of character-level. Take the example mentioned in section 3.1 as an example, the word-level tagging result is shown in Table 2. There was one incorrect keyphrase “nr0b1 基因突变” (nr0b1 gene mutation) which was supposed to be “nr0b1 基因” (nr0b1 gene). Except for tagged incorrect keyphrases, there might be missing keyphrases because of segmentation error for word-level sequence labeling.