{"title": "Glyce: Glyph-vectors for Chinese Character Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 2746, "page_last": 2757, "abstract": "It is intuitive that NLP tasks for logographic languages like Chinese should benefit from the use of the glyph information in those languages. However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of  standard computer vision models on character data, an effective way to utilize the glyph information remains to be found.\r\n\r\nIn this paper, we address this gap by presenting  Glyce, the glyph-vectors for Chinese character representations. We make three major innovations:   (1) We use historical Chinese scripts (e.g., bronzeware script, seal script, traditional Chinese, etc) to enrich the pictographic evidence in characters;    (2) We design CNN structures (called tianzege-CNN) tailored to Chinese character image processing; and   (3) We use image-classification as an auxiliary task in a  multi-task learning setup to increase the model's ability to generalize.   \r\n\r\nWe show that glyph-based models are able to consistently outperform word/char ID-based models  in a wide range of Chinese NLP tasks. When combing with BERT,  we  are able to  set new state-of-the-art results for a variety of Chinese NLP tasks, including  language modeling, tagging (NER, CWS, POS), \r\nsentence pair classification (BQ, LCQMC,  XNLI, NLPCC-DBQA), \r\nsingle sentence classification tasks (ChnSentiCorp, the Fudan corpus, iFeng),\r\ndependency parsing, and semantic role labeling. \r\nFor example, the proposed model achieves an F1 score of 81.6 on the OntoNotes dataset of NER, +1.5 over BERT; it achieves an almost perfect accuracy of 99.8\\% on the the Fudan corpus for text classification.", "full_text": "Glyce: Glyph-vectors for Chinese Character\n\nRepresentations\n\nYuxian Meng*, Wei Wu*, Fei Wang*, Xiaoya Li*, Ping Nie, Fan Yin\n\nMuyu Li, Qinghong Han, Xiaofei Sun and Jiwei Li\n\n{yuxian meng, wei wu, fei wang, xiaoya li, ping nie, fan yin,\nmuyu li, qinghong han, xiaofei sun, jiwei li}@shannonai.com\n\nShannon.AI\n\nAbstract\n\nIt is intuitive that NLP tasks for logographic languages like Chinese should bene\ufb01t\nfrom the use of the glyph information in those languages. However, due to the\nlack of rich pictographic evidence in glyphs and the weak generalization ability of\nstandard computer vision models on character data, an effective way to utilize the\nglyph information remains to be found.\nIn this paper, we address this gap by presenting Glyce, the glyph-vectors for\nChinese character representations. We make three major innovations: (1) We use\nhistorical Chinese scripts (e.g., bronzeware script, seal script, traditional Chinese,\netc) to enrich the pictographic evidence in characters; (2) We design CNN structures\n(called tianzege-CNN) tailored to Chinese character image processing; and (3)\nWe use image-classi\ufb01cation as an auxiliary task in a multi-task learning setup to\nincrease the model\u2019s ability to generalize.\nWe show that glyph-based models are able to consistently outperform word/char\nID-based models in a wide range of Chinese NLP tasks. We are able to set new state-\nof-the-art results for a variety of Chinese NLP tasks, including tagging (NER, CWS,\nPOS), sentence pair classi\ufb01cation, single sentence classi\ufb01cation tasks, dependency\nparsing, and semantic role labeling. For example, the proposed model achieves an\nF1 score of 80.6 on the OntoNotes dataset of NER, +1.5 over BERT; it achieves an\nalmost perfect accuracy of 99.8% on the Fudan corpus for text classi\ufb01cation. 1 2\n\n1\n\nIntroduction\n\nChinese is a logographic language. The logograms of Chinese characters encode rich information of\ntheir meanings. Therefore, it is intuitive that NLP tasks for Chinese should bene\ufb01t from the use of\nthe glyph information. Taking into account logographic information should help semantic modeling.\nRecent studies indirectly support this argument: Radical representations have proved to be useful\nin a wide range of language understanding tasks [Shi et al., 2015, Li et al., 2015, Yin et al., 2016,\nSun et al., 2014, Shao et al., 2017]. Using the Wubi scheme \u2014 a Chinese character encoding method\nthat mimics the order of typing the sequence of radicals for a character on the computer keyboard\n\u2014- is reported to improve performances on Chinese-English machine translation [Tan et al., 2018].\nCao et al. [2018] gets down to units of greater granularity, and proposed stroke n-grams for character\nmodeling.\nRecently, there have been some efforts applying CNN-based algorithms on the visual features of\ncharacters. Unfortunately, they do not show consistent performance boosts [Liu et al., 2017, Zhang\n\n1* indicates equal contribution.\n2code is available at https://github.com/ShannonAI/glyce.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fChinese\n\n\u91d1\u6587\n\u96b6\u4e66\n\u7bc6\u4e66\n\u9b4f\u7891\n\nEnglish\n\nBronzeware script\n\nClerical script\n\nSeal script\nTablet script\n\n\u7e41\u4f53\u4e2d\u6587\n\nTraditional Chinese\n\n\u7b80\u4f53\u4e2d\u6587(\u5b8b\u4f53)\n\u7b80\u4f53\u4e2d\u6587(\u4eff\u5b8b\u4f53)\n\nSimpli\ufb01ed Chinese - Song\n\nSimpli\ufb01ed Chinese - FangSong\n\n\u8349\u4e66\n\nCursive script\n\nTime Period\n\nShang and Zhou dynasty (2000 BC \u2013 300 BC)\n\nHan dynasty (200BC-200AD)\n\nHan dynasty and Wei-Jin period (100BC - 420 AD)\nNorthern and Southern dynasties 420AD - 588AD\n\n600AD - 1950AD (mainland China).\n\nstill currently used in HongKong and Taiwan\n\n1950-now\n1950-now\n\nJin Dynasty to now\n\nTable 1: Scripts and writing styles used in Glyce.\n\nand LeCun, 2017], and some even yield negative results [Dai and Cai, 2017]. For instance, Dai and\nCai [2017] run CNNs on char logos to obtain Chinese character representations and used them in the\ndownstream language modeling task. They reported that the incorporation of glyph representations\nactually worsens the performance and concluded that CNN-based representations do not provide\nextra useful information for language modeling. Using similar strategies, Liu et al. [2017] and Zhang\nand LeCun [2017] tested the idea on text classi\ufb01cation tasks, and performance boosts were observed\nonly in very limited number of settings. Positive results come from Su and Lee [2017], which found\nglyph embeddings help two tasks: word analogy and word similarity. Unfortunately, Su and Lee\n[2017] only focus on word-level semantic tasks and do not extend improvements in the word-level\ntasks to higher level NLP tasks such as phrase, sentence or discourse level. Combined with radical\nrepresentations, Shao et al. [2017] run CNNs on character \ufb01gures and use the output as auxiliary\nfeatures in the POS tagging task.\nWe propose the following explanations for negative results reported in the earlier CNN-based models\n[Dai and Cai, 2017]: (1) not using the correct version(s) of scripts: Chinese character system has a\nlong history of evolution. The characters started from being easy-to-draw, and slowly transitioned\nto being easy-to-write. Also, they became less pictographic and less concrete over time. The most\nwidely used script version to date, the Simpli\ufb01ed Chinese, is the easiest script to write, but inevitably\nloses the most signi\ufb01cant amount of pictographic information. For example, \u201d\u4eba\u201d (human) and \u201d\u5165\u201d\n(enter), which are of irrelevant meanings, are highly similar in shape in simpli\ufb01ed Chinese, but very\ndifferent in historical languages such as bronzeware script. (2) not using the proper CNN structures:\nunlike ImageNet images [Deng et al., 2009], the size of which is mostly at the scale of 800*600,\ncharacter logos are signi\ufb01cantly smaller (usually with the size of 12*12). It requires a different CNN\narchitecture to capture the local graphic features of character images; (3) no regulatory functions\nwere used in previous work: unlike the classi\ufb01cation task on the imageNet dataset, which contains\ntens of millions of data points, there are only about 10,000 Chinese characters. Auxiliary training\nobjectives are thus critical in preventing over\ufb01tting and promoting the model\u2019s ability to generalize.\nIn this paper, we propose GLYCE, the GLYph-vectors for Chinese character representations. We\ntreat Chinese characters as images and use CNNs to obtain their representations. We resolve the\naforementioned issues by using the following key techniques:\n\n\u2022 We use the ensemble of the historical and the contemporary scripts (e.g., the bronzeware\nscript, the clerical script, the seal script, the traditional Chinese etc), along with the scripts\nof different writing styles (e.g, the cursive script) to enrich pictographic information from\nthe character images.\n\u2022 We utilize the Tianzige-CNN (\u7530\u5b57\u683c) structures tailored to logographic character modeling.\n\u2022 We use multi-task learning methods by adding an image-classi\ufb01cation loss function to\n\nincrease the model\u2019s ability to generalize.\n\nGlyce is found to improve a wide range of Chinese NLP tasks. We are able to obtain the SOTA\nperformances on a wide range of Chinese NLP tasks, including tagging (NER, CWS, POS), sen-\ntence pair classi\ufb01cation (BQ, LCQMC, XNLI, NLPCC-DBQA), single sentence classi\ufb01cation tasks\n(ChnSentiCorp, the Fudan corpus, iFeng), dependency parsing, and semantic role labeling.\n\n2\n\n\fFigure 1: Illustration of the Tianzege-CNN used in Glyce.\n\n2 Glyce\n\n2.1 Using Historical Scripts\n\nAs discussed in Section 1, pictographic information is heavily lost in the simpli\ufb01ed Chinese script.\nWe thus propose using scripts from various time periods in history and also of different writing styles.\nWe collect the following major historical script with details shown in Table 1. Scripts from different\nhistorical periods, which are usually very different in shape, help the model to integrate pictographic\nevidence from various sources; Scripts of different writing styles help improve the model\u2019s ability to\ngeneralize. Both strategies are akin to widely-used data augmentation strategies in computer vision.\n\n2.2 The Tianzige-CNN Structure for Glyce\n\nDirectly using deep CNNs He et al. [2016], Szegedy et al. [2016], Ma et al. [2018a] in our task\nresults in very poor performances because of (1) relatively smaller size of the character images:\nthe size of Imagenet images is usually at the scale of 800*600, while the size of Chinese character\nimages is signi\ufb01cantly smaller, usually at the scale of 12*12; and (2) the lack of training examples:\nclassi\ufb01cations on the imageNet dataset utilizes tens of millions of different images. In contrast,\nthere are only about 10,000 distinct Chinese characters. To tackle these issues, we propose the\nTianzige-CNN structure, which is tailored to Chinese character modeling as illustrated in Figure 1.\nTianzige (\u7530\u5b57\u683c) is a traditional form of Chinese Calligraphy. It is a four-squared format (similar\nto Chinese character \u7530) for beginner to learn writing Chinese characters. The input image ximage\nis \ufb01rst passed through a convolution layer with kernel size 5 and output channels 1024 to capture\nlower level graphic features. Then a max-pooling of kernel size 4 is applied to the feature map which\nreduces the resolution from 8 \u00d7 8 to 2 \u00d7 2, . This 2 \u00d7 2 tianzige structure presents how radicals are\narranged in Chinese characters and also the order by which Chinese characters are written. Finally,\nwe apply group convolutions [Krizhevsky et al., 2012, Zhang et al., 2017] rather than conventional\nconvolutional operations to map tianzige grids to the \ufb01nal outputs . Group convolutional \ufb01lters are\nmuch smaller than their normal counterparts, and thus are less prone to over\ufb01tting. It is fairly easy to\nadjust the model from single script to multiple scripts, which can be achieved by simply changing the\ninput from 2D (i.e., dfont \u00d7 dfont) to 3D (i.e., dfont \u00d7 dfont \u00d7 Nscript), where dfont denotes the font size\nand Nscript the number of scripts we use.\n\n2.3 Image Classi\ufb01cation as an Auxiliary Objective\n\nTo further prevent over\ufb01tting, we use the task of image classi\ufb01cation as an auxiliary training objective.\nThe glyph embedding himage from CNNs will be forwarded to an image classi\ufb01cation objective to\npredict its corresponding charID. Suppose the label of image x is z. The training objective for the\nimage classi\ufb01cation task L(cls) is given as follows:\nL(cls) = \u2212 log p(z|x)\n\n(1)\nLet L(task) denote the task-speci\ufb01c objective for the task we need to tackle, e.g., language modeling,\nword segmentation, etc. We linearly combine L(task) and L(cl), making the \ufb01nal objective training\nfunction as follows:\n\n= \u2212 log softmax(W \u00d7 himage)\n\n(2)\nwhere \u03bb(t) controls the trade-off between the task-speci\ufb01c objective and the auxiliary image-\n1, where \u03bb0 \u2208 [0, 1]\nclassi\ufb01cation objective. \u03bb is a function of the number of epochs t: \u03bb(t) = \u03bb0\u03bbt\n\nL = (1 \u2212 \u03bb(t)) L(task) + \u03bb(t)L(cls)\n\n3\n\n\fFigure 2: Combing glyph information with BERT.\n\ndenotes the starting value, \u03bb1 \u2208 [0, 1] denotes the decaying value. This means that the in\ufb02uence from\nthe image classi\ufb01cation objective decreases as the training proceeds, with the intuitive explanation\nbeing that at the early stage of training, we need more regulations from the image classi\ufb01cation task.\nAdding image classi\ufb01cation as a training objective mimics the idea of multi-task learning.\n\n2.4 Combing Glyph Information with BERT\n\nThe glyph embeddings can be directly output to downstream models such as RNNs, LSTMs, trans-\nformers.\nSince large scale pretraining systems using language models, such as BERT [Devlin et al., 2018],\nELMO [Peters et al., 2018] and GPT [Radford et al., 2018], have proved to be effective in a wide range\nof NLP tasks, we explore the possibility of combining glyph embeddings with BERT embeddings.\nSuch a strategy will potentially endow the model with the advantage of both glyph evidence and\nlarge-scale pretraining. The overview of the combination is shown in Figure 2. The model consists of\nfour layers: the BERT layer, the glyph layer, the Glyce-BERT layer and the task-speci\ufb01c output layer.\n\u2022 BERT Layer Each input sentence S is concatenated with a special CLS token denoting the\nstart of the sentence, and a SEP token, denoting the end of the sentence. Given a pre-trained\nBERT model, the embedding for each token of S is computed using BERT. We use the\noutput from the last layer of the BERT transformer to represent the current token.\n\naddition is then concatenated with BERT to obtain the full Glyce representations.\n\n\u2022 Glyph Layer the output glyph embeddings of S from tianzege-CNNs.\n\u2022 Glyce-BERT layer Position embeddings are \ufb01rst added to the glyph embeddings. The\n\u2022 Task-speci\ufb01c output layer Glyce representations are used to represent the token at that\nposition, similar as word embeddings or Elmo emebddings [Peters et al., 2018]. Contextual-\naware information has already been encoded in the BERT representation but not glyph\nrepresentations. We thus need additional context models to encode contextual-aware glyph\nrepresentations. Here, we choose multi-layer transformers [Vaswani et al., 2017]. The\noutput representations from transformers are used as inputs to the prediction layer. It is\nworth noting that the representations the special CLS and SEP tokens are maintained at the\n\ufb01nal task-speci\ufb01c embedding layer.\n\n3 Tasks\n\nIn this section, we describe how glypg embeddings can be used for different NLP tasks. In the vanilla\nversion, glyph embeddings are simply treated as character embeddings, which are fed to models built\non top of the word-embedding layers, such as RNNs, CNNs or more sophisticated ones. If combined\n\n4\n\n\fFigure 3: Using Glyce-BERT model for different tasks.\n\nwith BERT, we need to speci\ufb01cally handle the integration between the glyph embeddings and the\npretrained embeddings from BERT in different scenarios, as will be discussed in order below:\n\nSequence Labeling Tasks Many Chinese NLP tasks, such as name entity recognition (NER),\nChinese word segmentation (CWS) and part speech tagging (POS), can be formalized as character-\nlevel sequence labeling tasks, in which we need to predict a label for each character. For glyce-BERT\nmodel, the embedding output from the task-speci\ufb01c layer (described in Section 2.4) is fed to the CRF\nmodel for label predictions.\n\nSingle Sentence Classi\ufb01cation For text classi\ufb01cation tasks, a single label is to be predicted for the\nentire sentence. In the BERT model, the representation for the CLS token in the \ufb01nal layer of BERT is\noutput to the softmax layer for prediction. We adopt the similar strategy, in which the representation\nfor the CLS token in the task-speci\ufb01c layer is fed to the softmax layer to predict labels.\n\nSentence Pair Classi\ufb01cation For sentence pair classi\ufb01cation task like SNIS [Bowman et al., 2015],\na model needs to handle the interaction between the two sentences and outputs a label for a pair of\nsentences. In the BERT setting, a sentence pair (s1, s2) is concatenated with one CLS and two SEP\ntokens, denoted by [CLS, s1, SEP, s2, SEP]. The concatenation is fed to the BERT model, and the\nobtained CLS representation is then fed to the softmax layer for label prediction. We adopt the similar\nstrategy for Glyce-BERT, in which [CLS, s1, SEP, s2, SEP] is subsequently passed through the BERT\nlayer, Glyph layer, Glyce-BERT layer and the task-speci\ufb01c output layer. The CLS representation from\nthe task-speci\ufb01c output layer is fed to the softmax function for the \ufb01nal label prediction.\n\n4 Experiments\n\nTo enable apples-to-apples comparison, we perform grid parameter search for both baselines and the\nproposed model on the dev set. Tasks that we work on are described in order below.\n\n4.1 Tagging\n\nNER For the task of Chinese NER, we used the widely-used OntoNotes, MSRA, Weibo and resume\ndatasets. Since most datasets don\u2019t have gold-standard segmentation, the task is normally treated\nas a char-level tagging task: outputting an NER tag for each character. The currently most widely\nused non-BERT model is Lattice-LSTMs [Yang et al., 2018, Zhang and Yang, 2018], achieving better\nperformances than CRF+LSTM [Ma and Hovy, 2016].\n\nCWS : The task of Chinese word segmentation (CWS) is normally treated as a char-level tagging\nproblem. We used the widely-used PKU, MSR, CITYU and AS benchmarks from SIGHAN 2005\nbake-off for evaluation.\n\n5\n\n\fPOS The task of Chinese part of speech tagging is normally formalized as a character-level sequence\nlabeling task, assigning labels to each of the characters within the sequence. We use the CTB5, CTB9\nand UD1 (Universal Dependencies) benchmarks to test our models.\n\nOntoNotes\n\nP\nModel\n74.36\nCRF-LSTM\nLattice-LSTM\n76.35\nGlyce+Lattice-LSTM 82.06\n\nBERT\nGlyce+BERT\n\nModel\nCRF-LSTM\nLattice-LSTM\nLattice-LSTM+Glyce\n\nBERT\nGlyce+BERT\n\n78.01\n81.87\n\nMSRA\n\nP\n92.97\n93.57\n93.86\n\n94.97\n95.57\n\nR\n69.43\n71.56\n68.74\n\n80.35\n81.40\n\nR\n90.80\n92.79\n93.92\n\n94.62\n95.51\n\nF\n71.81\n73.88\n74.81\n(+ 0.93)\n79.16\n80.62\n(+1.46)\n\nF\n91.87\n93.18\n93.89\n(+0.71)\n94.80\n95.54\n(+0.74)\n\nModel\nCRF-LSTM\nLattice-LSTM\nLattice-LSTM+Glyce\n\nBERT\nGlyce+BERT\n\nWeibo\n\nP\n51.16\n52.71\n53.69\n\n67.12\n67.68\n\nresume\n\nModel\nCRF-LSTM\nLattice-LSTM\nLattice-LSTM+Glyce\n\nP\n94.53\n94.81\n95.72\n\nBERT\nGlyce+BERT\n\n96.12\n96.62\n\nR\n51.07\n53.92\n55.30\n\n66.88\n67.71\n\nR\n94.29\n94.11\n95.63\n\n95.45\n96.48\n\nF\n50.95\n53.13\n54.32\n(+1.19)\n67.33\n67.60\n(+0.27)\n\nF\n94.41\n94.46\n95.67\n(+1.21)\n95.78\n96.54\n(+0.76)\n\nTable 2: Results for NER tasks.\n\nResults for NER, CWS and POS are respectively shown in Tables 2, 3 and 4. When comparing with\nnon-BERT models, Lattice-Glyce performs better than all non-BERT models across all datasets in\nall tasks. BERT outperforms non-BERT models in all datasets except Weibo. This is due to the\ndiscrepancy between the dataset which BERT is pretrained on (i.e., wikipedia) and weibo. The\nGlyce-BERT model outperforms BERT and sets new SOTA results across all datasets, manifesting\nthe effectiveness of incorporating glyph information. We are able to achieve SOTA performances on\nall of the datasets using either Glyce model itself or BERT-Glyce model.\n\nModel\nYang et al. [2017]\nMa et al. [2018b]\nHuang et al. [2019]\nBERT\nGlyce+BERT\n\nModel\nYang et al. [2017]\nMa et al. [2018b]\nHuang et al. [2019]\nBERT\nGlyce+BERT\n\nPKU\nP\n-\n-\n-\n96.8\n97.1\n\nMSR\nP\n-\n-\n-\n98.1\n98.2\n\nR\n-\n-\n-\n96.3\n96.4\n\nR\n-\n-\n-\n98.2\n98.3\n\nF\n96.3\n96.1\n96.6\n96.5\n96.7\n(+0.2)\n\nF\n97.5\n98.1\n97.9\n98.1\n98.3\n(+0.2)\n\nCITYU\n\nModel\nYang et al. [2017]\nMa et al. [2018b]\nHuang et al. [2019]\nBERT\nGlyce+BERT\n\nP\n-\n-\n-\n97.5\n97.9\n\nModel\nYang et al. [2017]\nMa et al. [2018b]\nHuang et al. [2019]\nBERT\nGlyce+BERT\n\nAS\nP\n-\n-\n-\n96.7\n96.6\n\nR\n-\n-\n-\n97.7\n98.0\n\nR\n-\n-\n-\n96.4\n96.8\n\nF\n96.9\n97.2\n97.6\n97.6\n97.9\n(+0.3)\n\nF\n95.7\n96.2\n96.6\n96.5\n96.7\n(+0.2)\n\nTable 3: Results for CWS tasks.\n\n4.2 Sentence Pair Classi\ufb01cation\n\nFor sentence pair classi\ufb01cation tasks, we need to output a label for each pair of sentence. We employ\nthe following four different datasets: (1) BQ (binary classi\ufb01cation task) [Bowman et al., 2015]; (2)\nLCQMC (binary classi\ufb01cation task) [Liu et al., 2018], (3) XNLI (three-class classi\ufb01cation task)\n[Williams and Bowman], and (4) NLPCC-DBQA (binary classi\ufb01cation task) 3.\n\n3https://github.com/xxx0624/QA_Model\n\n6\n\n\fModel\nShao et al. [2017] (Sig)\nShao et al. [2017] (Ens)\nLattice-LSTM\nGlyce+Lattice-LSTM\n\nCTB5\nP\n93.68\n93.95\n94.77\n95.49\n\nBERT\nGlyce+BERT\n\n95.86\n96.50\n\nModel\nShao et al. [2017] (Sig)\nLattice-LSTM\nGlyce+Lattice-LSTM\n\nCTB6\nP\n-\n92.00\n92.72\n\nBERT\nGlyce+BERT\n\n94.91\n95.56\n\nR\n94.47\n94.81\n95.51\n95.72\n\n96.26\n96.74\n\nR\n-\n90.86\n91.14\n\n94.63\n95.26\n\nF\n94.07\n94.38\n95.14\n95.61\n(+0.47)\n96.06\n96.61\n(+0.55)\n\nF\n90.81\n91.43\n91.92\n(+0.49)\n94.77\n95.41\n(+0.64)\n\nModel\nShao et al. [2017] (Sig)\nShao et al. [2017] (Ens)\nLattice-LSTM\nLattice-LSTM+Glyce\n\nCTB9\nP\n91.81\n92.28\n92.53\n92.28\n\nBERT\nGlyce+BERT\n\nModel\nShao et al. [2017] (Sig)\nShao et al. [2017] (Ens)\nLattice-LSTM\nLattice-LSTM+Glyce\n\nBERT\nGlyce+BERT\n\n92.43\n93.49\n\nUD1\nP\n89.28\n89.67\n90.47\n91.57\n\n95.42\n96.19\n\nR\n94.47\n92.40\n91.73\n92.85\n\n92.15\n92.84\n\nR\n89.54\n89.86\n89.70\n90.19\n\n94.17\n96.10\n\nF\n91.89\n92.34\n92.13\n92.38\n(+0.25)\n92.29\n93.15\n(+0.86)\n\nF\n89.41\n89.75\n90.09\n90.87\n(+0.78)\n94.79\n96.14\n(+1.35)\n\nTable 4: Results for POS tasks.\n\nThe current non-BERT SOTA model is based on the bilateral multi-perspective matching model\n(BiMPM) [Wang et al., 2017], which speci\ufb01cally tackles the subunit matching between sentences.\nGlyph embeddings are incorporated into BiMPMs, forming the Glyce+BiMPM baseline. Results\nregarding each model on different datasets are given in Table 5. As can be seen, BiPMP+Glyce\noutperforms BiPMPs, achieving the best results among non-bert models. BERT outperforms all\nnon-BERT models, and BERT+Glyce performs the best, setting new SOTA results on all of the four\nbenchmarks.\n\nP\nModel\nBiMPM\n82.3\nGlyce+BiMPM 81.9\n\nBQ\nR\n81.2\n85.5\n\nBERT\nGlyce+BERT\n\n83.5\n84.2\n\n85.7\n86.9\n\nP\nModel\nBiMPM\n-\nGlyce+BiMPM -\n\nXNLI\nR\n-\n-\n\nBERT\nGlyce+BERT\n\n-\n-\n\n-\n-\n\nF\n81.7\n83.7\n(+2.0)\n84.6\n85.5\n(+0.9)\n\nF\n-\n-\n\n-\n-\n\nA\n81.9\n83.3\n(+1.4)\n84.8\n85.8\n(+1.0)\n\nA\n67.5\n67.7\n(+0.2)\n78.4\n79.2\n(+0.8)\n\nP\nModel\nBiMPM\n77.6\nGlyce+BiMPM 80.4\n\nLCQMC\nR\n93.9\n93.4\n\nBERT\nGlyce+BERT\n\n83.2\n86.8\n\n94.2\n91.2\n\nNLPCC-DBQA\n\nP\nModel\nBiMPM\n78.8\nGlyce+BiMPM 76.3\n\nR\n56.5\n59.9\n\nBERT\nGlyce+BERT\n\n79.6\n81.1\n\n86.0\n85.8\n\nF\n85.0\n86.4\n(+1.4)\n88.2\n88.8\n(+0.6)\n\nF\n65.8\n67.1\n(+1.3)\n82.7\n83.4\n(+0.7)\n\nA\n83.4\n85.3\n(+1.9)\n87.5\n88.7\n(+1.2)\n\nA\n-\n-\n-\n-\n-\n-\n\nTable 5: Results for sentence-pair classi\ufb01cation tasks.\n\nModel\nLSTM\n\nLSTM + Glyce\n\nBERT\n\nGlyce+BERT\n\nChnSentiCorp\n\n91.7\n93.1\n(+ 1.4)\n95.4\n95.9\n(+0.5)\n\nthe Fudan corpus\n\n95.8\n96.3\n(+0.5)\n99.5\n99.8\n(+0.3)\n\niFeng\n84.9\n85.8\n(+0.9)\n87.1\n87.5\n(+0.4)\n\nTable 6: Accuracies for Single Sentence Classi\ufb01cation task.\n\n7\n\n\fDependency Parsing\n\nModel\nBallesteros et al. [2016]\nKiperwasser and Goldberg [2016]\nCheng et al. [2016]\nBiaf\ufb01ne\nBiaf\ufb01ne+Glyce\n\nUAS\n87.7\n87.6\n88.1\n89.3\n90.2\n(+0.9)\n\nLAS\n86.2\n86.1\n85.7\n88.2\n89.0\n(+0.8)\n\nSemantic Role Labeling\n\nModel\nRoth and Lapata [2016]\nMarcheggiani and Titov [2017]\nHe et al. [2018]\nk-order pruning+Glyce\n\nP\n76.9\n84.6\n84.2\n85.4\n(+0.8)\n\nR\n73.8\n80.4\n81.5\n82.1\n(+0.6)\n\nF\n75.3\n82.5\n82.8\n83.7\n(+0.9)\n\nTable 7: Results for dependency parsing and SRL.\n\n4.3 Single Sentence Classi\ufb01cation\n\nFor single sentence/document classi\ufb01cation, we need to output a label for a text sequence. The label\ncould be either a sentiment indicator or a news genre. Datasets that we use include: (1) ChnSentiCorp\n(binary classi\ufb01cation); (2) the Fudan corpus (5-class classi\ufb01cation) [Li, 2011]; and (3) Ifeng (5-class\nclassi\ufb01cation).\nResults for different models on different tasks are shown in Table 6. We observe similar phenomenon\nas before: Glyce+BERT achieves SOTA results on all of the datasets. Speci\ufb01cally, the Glyce+BERT\nmodel achieves an almost perfect accuracy (99.8) on the Fudan corpus.\n\n4.4 Dependency Parsing and Semantic Role Labeling\n\nFor dependency parsing [Chen and Manning, 2014, Dyer et al., 2015], we used the widely-used\nChinese Penn Treebank 5.1 dataset for evaluation. Our implementation uses the previous state-of-the-\nart Deep Biaf\ufb01ne model Dozat and Manning [2016] as a backbone. We replaced the word vectors\nfrom the biaf\ufb01ne model with Glyce-word embeddings, and exactly followed its model structure and\ntraining/dev/test split criteria. We report scores for unlabeled attachment score (UAS) and labeled\nattachment score (LAS). Results for previous models are copied from [Dozat and Manning, 2016,\nBallesteros et al., 2016, Cheng et al., 2016]. Glyce-word pushes SOTA performances up by +0.9 and\n+0.8 in terms of UAS and LAS scores.\nFor the task of semantic role labeling (SRL) [Roth and Lapata, 2016, Marcheggiani and Titov, 2017,\nHe et al., 2018], we used the CoNLL-2009 shared-task. We used the current SOTA model, the k-order\npruning algorithm [He et al., 2018] as a backbone.4 We replaced word embeddings with Glyce\nembeddings. Glyce outperforms the previous SOTA performance by 0.9 with respect to the F1 score,\nachieving a new SOTA score of 83.7.\nBERT does not perform competitively in these two tasks, and results are thus omitted.\n\n5 Ablation Studies\n\nIn this section, we discuss the in\ufb02uence of different factors of the proposed model. We use the\nLCQMC dataset of the sentence-pair prediction task for illustration. Factors that we discuss include\ntraining strategy, model architecture, auxiliary image-classi\ufb01cation objective, etc.\n\n5.1 Training Strategy\n\nThis section talks about a training tactic (denoted by BERT-glyce-joint), in which given task-speci\ufb01c\nsupervisions, we \ufb01rst \ufb01ne-tune the BERT model, then freeze BERT to \ufb01ne-tune the glyph layer,\nand \ufb01nally jointly tune both layers until convergence. We compare this strategy with other tactics,\nincluding (1) the Glyph-Joint strategy, in which BERT is not \ufb01ne-tuned in the beginning: we \ufb01rst\n\n4Code open sourced at https://github.com/bcmi220/srl_syn_pruning\n\n8\n\n\fStrategy\nBERT-glyce-joint\nGlyph-Joint\njoint\nonly BERT\n\nP\n86.8\n82.5\n81.5\n83.2\n\nR\n91.2\n94.0\n95.1\n94.2\n\nF\n88.8\n87.9\n87.8\n88.2\n\nAcc\n88.7\n87.1\n86.8\n87.5\n\nStrategy\nTransformers\nBiLSMTs\nCNNs\nBiMPM\n\nPrecision Recall\n86.8\n81.8\n81.5\n81.1\n\n91.2\n94.9\n94.8\n94.6\n\nF1\n88.8\n87.9\n87.6\n87.3\n\nAccuracy\n88.7\n86.9\n86.6\n86.2\n\nTable 8: Impact of different training strategies.\n\nTable 9: Impact of structures for the task-speci\ufb01c output layer.\n\nStrategy\nW image-cls\nWO image-cls\n\nP\n86.8\n83.9\n\nR\n91.2\n93.6\n\nF\n88.8\n88.4\n\nAcc\n88.7\n87.9\n\nVanilla-CNN\nHe et al. [2016]\nTianzige-CNN\n\nP\n85.3\n84.5\n86.8\n\nR\n89.8\n90.8\n91.2\n\nF\n87.4\n87.5\n88.8\n\nTable 10: Impact of the auxilliary image classi\ufb01ca-\ntion training objective.\n\nTable 11: Impact of CNN structures.\n\nfreeze BERT to tune the glyph layer, and then jointly tune both layers until convergence; and (2) the\njoint strategy, in which we directly jointly training two models until converge.\nResults are shown in Table 8. As can be seen, the BERT-glyce-joint outperforms the rest two strategies.\nOur explanation for the inferior performance of the joint strategy is as follows: the BERT layer is\npretrained but the glyph layer is randomly initialized. Given the relatively small amount of training\nsignals, the BERT layer could be mislead by the randomly initialized glyph layer at the early stage of\ntraining, leading to inferior \ufb01nal performances.\n\n5.2 Structures of the task-speci\ufb01c output layer\n\nThe concatenation of the glyph embedding and the BERT embedding is fed to the task-speci\ufb01c output\nlayer. The task-speci\ufb01c output layer is made up with two layers of transformer layers. Here we\nchange transformers to other structures such as BiLSTMs and CNNs to explore the in\ufb02uence. We\nalso try the BiMPM structure Wang et al. [2017] to see the results.\nPerformances are shown in Table 9. As can be seen, transformers not only outperform BiLSTMs and\nCNNs, but also the BiMPM structure, which is speci\ufb01cally built for the sentence pair classi\ufb01cation\ntask. We conjecture that this is because of the consistency between transformers and the BERT\nstructure.\n\n5.3 The image-classi\ufb01cation training objective\n\nWe also explore the in\ufb02uence of the image-classi\ufb01cation training objective, which outputs the glyph\nrepresentation to an image-classi\ufb01cation objective. Table 10 represents its in\ufb02uence. As can be seen,\nthis auxiliary training objective given a +0.8 performance boost.\n\n5.4 CNN structures\n\nResults for different CNN structures are shown in Table 11. As can be seen, the adoption of tianzige-\nCNN structure introduces a performance boost of F1 about +1.0. Directly using deep CNNs in\nour task results in very poor performances because of (1) relatively smaller size of the character\nimages: the size of ImageNet images is usually at the scale of 800*600, while the size of Chinese\ncharacter images is signi\ufb01cantly smaller, usually at the scale of 12*12; and (2) the lack of training\nexamples: classi\ufb01cations on the ImageNet dataset utilizes tens of millions of different images. In\ncontrast, there are only about 10,000 distinct Chinese characters. We utilize the Tianzige-CNN (\u7530\u5b57\n\u683c) structures tailored to logographic character modeling for Chinese. This tianzige structure is of\nsigni\ufb01cant importance in extracting character meanings.\n\n6 Conclusion\n\nIn this paper, we propose Glyce, Glyph-vectors for Chinese Character Representations. Glyce treats\nChinese characters as images and uses Tianzige-CNN to extract character semantics. Glyce provides\na general way to model character semantics of logographic languages. It is general and fundamental.\nJust like word embeddings, Glyce can be integrated to any existing deep learning system.\n\n9\n\n\fReferences\nXinlei Shi, Junjie Zhai, Xudong Yang, Zehua Xie, and Chao Liu. Radical embedding: Delving\ndeeper to chinese radicals. In Proceedings of the 53rd Annual Meeting of the Association for\nComputational Linguistics and the 7th International Joint Conference on Natural Language\nProcessing (Volume 2: Short Papers), volume 2, pages 594\u2013598, 2015.\n\nYanran Li, Wenjie Li, Fei Sun, and Sujian Li. Component-enhanced chinese character embeddings.\n\narXiv preprint arXiv:1508.06669, 2015.\n\nRongchao Yin, Quan Wang, Peng Li, Rui Li, and Bin Wang. Multi-granularity chinese word\nembedding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language\nProcessing, pages 981\u2013986, 2016.\n\nYaming Sun, Lei Lin, Nan Yang, Zhenzhou Ji, and Xiaolong Wang. Radical-enhanced chinese\nIn International Conference on Neural Information Processing, pages\n\ncharacter embedding.\n279\u2013286. Springer, 2014.\n\nYan Shao, Christian Hardmeier, J\u00a8org Tiedemann, and Joakim Nivre. Character-based joint segmen-\ntation and pos tagging for chinese using bidirectional rnn-crf. arXiv preprint arXiv:1704.01314,\n2017.\n\nMi Xue Tan, Yuhuang Hu, Nikola I Nikolov, and Richard HR Hahnloser. wubi2en: Character-level\n\nchinese-english translation through ascii encoding. arXiv preprint arXiv:1805.03330, 2018.\n\nShaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. cw2vec: Learning chinese word embeddings\n\nwith stroke n-gram information. 2018.\n\nFrederick Liu, Han Lu, Chieh Lo, and Graham Neubig. Learning character-level compositionality\n\nwith visual features. arXiv preprint arXiv:1704.04859, 2017.\n\nXiang Zhang and Yann LeCun. Which encoding is the best for text classi\ufb01cation in chinese, english,\n\njapanese and korean? arXiv preprint arXiv:1708.02657, 2017.\n\nFalcon Z Dai and Zheng Cai. Glyph-aware embedding of chinese characters. arXiv preprint\n\narXiv:1709.00028, 2017.\n\nTzu-Ray Su and Hung-Yi Lee. Learning chinese word representations from glyphs of characters.\n\narXiv preprint arXiv:1708.04755, 2017.\n\nJia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. Ieee, 2009.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\nChristian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking\nthe inception architecture for computer vision. In The IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), June 2016.\n\nNingning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shuf\ufb02enet v2: Practical guidelines for\n\nef\ufb01cient cnn architecture design. arXiv preprint arXiv:1807.11164, 5, 2018a.\n\nAlex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolu-\ntional neural networks. In Advances in neural information processing systems, pages 1097\u20131105,\n2012.\n\nTing Zhang, Guo-Jun Qi, Bin Xiao, and Jingdong Wang. Interleaved group convolutions. In Computer\n\nVision and Pattern Recognition, 2017.\n\nJacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep\nbidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.\n\n10\n\n\fMatthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and\nLuke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365,\n2018.\n\nAlec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.\n\nImproving language\nunderstanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-\nassets/research-covers/languageunsupervised/language understanding paper. pdf, 2018.\n\nAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nIn Advances in Neural Information\n\nKaiser, and Illia Polosukhin. Attention is all you need.\nProcessing Systems, pages 5998\u20136008, 2017.\n\nSamuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated\ncorpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical\nMethods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21,\n2015, pages 632\u2013642, 2015.\n\nJie Yang, Yue Zhang, and Shuailong Liang. Subword encoding in lattice LSTM for chinese word\n\nsegmentation. CoRR, abs/1810.12594, 2018.\n\nYue Zhang and Jie Yang. Chinese NER using lattice LSTM. In Proceedings of the 56th Annual\nMeeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July\n15-20, 2018, Volume 1: Long Papers, pages 1554\u20131564, 2018.\n\nXuezhe Ma and Eduard H. Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In\nProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL\n2016, August 7-12, 2016, Berlin, Germany, Volume 1, 2016.\n\nJie Yang, Yue Zhang, and Fei Dong. Neural word segmentation with rich pretraining. In Proceedings\nof the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver,\nCanada, July 30 - August 4, Volume 1: Long Papers, pages 839\u2013849, 2017.\n\nJi Ma, Kuzman Ganchev, and David Weiss. State-of-the-art chinese word segmentation with bi-lstms.\n\nCoRR, abs/1808.06511, 2018b. URL http://arxiv.org/abs/1808.06511.\n\nWeipeng Huang, Xingyi Cheng, Kunlong Chen, Taifeng Wang, and Wei Chu. Toward fast and\naccurate neural chinese word segmentation with multi-criteria learning. CoRR, abs/1903.04190,\n2019. URL http://arxiv.org/abs/1903.04190.\n\nXin Liu, Qingcai Chen, Chong Deng, Huajun Zeng, Jing Chen, Dongfang Li, and Buzhou Tang.\nLcqmc: A large-scale chinese question matching corpus. In Proceedings of the 27th International\nConference on Computational Linguistics, pages 1952\u20131962, 2018.\n\nAdina Williams and Samuel R Bowman. The multi-genre nli corpus 0.2: Repeval shared task\n\npreliminary version description paper.\n\nZhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for natural\nlanguage sentences. In Proceedings of the Twenty-Sixth International Joint Conference on Arti\ufb01cial\nIntelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 4144\u20134150, 2017.\n\nRonglu Li. Fudan corpus for text classi\ufb01cation. 2011.\n\nDanqi Chen and Christopher Manning. A fast and accurate dependency parser using neural networks.\nIn Proceedings of the 2014 conference on empirical methods in natural language processing\n(EMNLP), pages 740\u2013750, 2014.\n\nChris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A Smith. Transition-based\ndependency parsing with stack long short-term memory. arXiv preprint arXiv:1505.08075, 2015.\n\nTimothy Dozat and Christopher D Manning. Deep biaf\ufb01ne attention for neural dependency parsing.\n\narXiv preprint arXiv:1611.01734, 2016.\n\nMiguel Ballesteros, Yoav Goldberg, Chris Dyer, and Noah A Smith. Training with exploration\n\nimproves a greedy stack-lstm parser. arXiv preprint arXiv:1603.03793, 2016.\n\n11\n\n\fHao Cheng, Hao Fang, Xiaodong He, Jianfeng Gao, and Li Deng. Bi-directional attention with\n\nagreement for dependency parsing. arXiv preprint arXiv:1608.02076, 2016.\n\nEliyahu Kiperwasser and Yoav Goldberg. Simple and accurate dependency parsing using bidirectional\n\nlstm feature representations. arXiv preprint arXiv:1603.04351, 2016.\n\nMichael Roth and Mirella Lapata. Neural semantic role labeling with dependency path embeddings.\n\narXiv preprint arXiv:1605.07515, 2016.\n\nDiego Marcheggiani and Ivan Titov. Encoding sentences with graph convolutional networks for\n\nsemantic role labeling. arXiv preprint arXiv:1703.04826, 2017.\n\nShexia He, Zuchao Li, Hai Zhao, and Hongxiao Bai. Syntax for semantic role labeling, to be, or not\nto be. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics\n(Volume 1: Long Papers), volume 1, pages 2061\u20132071, 2018.\n\n12\n\n\f", "award": [], "sourceid": 1572, "authors": [{"given_name": "Yuxian", "family_name": "Meng", "institution": "Shannon.AI"}, {"given_name": "Wei", "family_name": "Wu", "institution": "Shannon.AI"}, {"given_name": "Fei", "family_name": "Wang", "institution": "Shannon.AI"}, {"given_name": "Xiaoya", "family_name": "Li", "institution": "Shannon.AI"}, {"given_name": "Ping", "family_name": "Nie", "institution": "Shannon.AI"}, {"given_name": "Fan", "family_name": "Yin", "institution": "Shannon.AI"}, {"given_name": "Muyu", "family_name": "Li", "institution": "Shannon.AI"}, {"given_name": "Qinghong", "family_name": "Han", "institution": "Shannon.AI"}, {"given_name": "Xiaofei", "family_name": "Sun", "institution": "Shannon.AI"}, {"given_name": "Jiwei", "family_name": "Li", "institution": "Shannon.AI"}]}