当前位置：首页 >跨站数据测试

词向量学习总结

日期： 2018-08-21 分类：跨站数据测试 485次阅读

词向量又称词嵌入，是自然语言处理过程中对“基本单位”词的一种数学化表示，生成词向量的方法有神经网络，单词共生矩阵的降维，语言概率模型等。

词向量的表示

离散表示（one-hot representation）

传统的基于规则或基于统计的自然语义处理方法看做一个原子符号，one-hot representation将每个词表示成一个长的向量，这个向量的维度就是词表（词空间）的大小，向量中只有一个维度的值为1，其余维度的值都为0。

分布式表示（distribution representation）

将此转化为一个分布式的表示，一般的词的维度控制在50-200维度之间，它将此表示成一个定长的连续的稠密向量。

优点就是词与词之间存在了“距离”的概念，词向量包含了更多的信息，并且每一维都有特定的含义。

词向量的生成

基于统计的方法：

通过统计词在此空间出现的次数，构建一个两两词之间的共现矩阵，此共现矩阵将会是一个巨大的矩阵，然后更具SVD奇异值分解对矩阵进行降维，最终SVD得到了词的稠密矩阵，该矩阵具有一个良好的性质就是：语义相近的词在向量空间相近。

语言概率模型：

语言模型生成词向量是通过训练神经网络语言模型NNLM，词向量作为语言模型的附带产出。

词向量的训练

目前训练得到词向量的方法有很多，有C&W的SENNA、M&H的HLBL、Mikolov的RNNLM、Huang的语义强化，而每一种都是在前一种或者某一种的优化，而最原始的此处词向量的为Bengio，他提出的方法也是word2vec的原始概念，通过一个三层的神经网络训练n-gram模型，从而得到了词向量。

Word2vec是一个典型的预测模型，用于高效的学习词向量，实现的模型就为CBOW（通过上下文预测目标词）和Skip-Gram（通过目标词预测上下文），两个模型是相似的，而一般而言，CBOW更适合应用在小规模数据集上，能够对很多的分布式信息进行平滑处理，而Skip-Gram则比较适合用于大规模数据集上。

词向量任务评价

词向量的评价大体可以分为两种，第一种是把词向量融入现有的系统中，看对系统性能的提升；第二种是直接从语言学的角度对词向量进行分析，相似度、语义偏移等。

最后：语料越大，词向量就越好

计算词向量的实际操作

下面利用目前的工具对Word2vec预测模型进行训练来得到词向量，这里利用的是英文的数据集，因为这个数据的语料足够大，且词语相对简单，易于理解。

此英文语料库共17005207个单词

第0步：模型平均损失Average loss at step 0 : 263.6611633300781

查看效果

Nearest to to : bradycardia, pejoratively, excited, ulein, vesta, psychometrics, plein, primigenius,

Nearest to at : indira, evaluating, teaches, ton, glamorgan, mexicana, schoolteacher, linnean,

Nearest to not : torquay, wells, bookmarks, bitter, eagerness, sld, mot, colliding,

Nearest to are : reissues, soccer, eprint, kierkegaard, laboratory, unaccented, influences, netsplit,

Nearest to more : recognisable, lectured, hey, kournikova, recover, romano, main, punctuation,

Nearest to states : wrest, unbounded, formality, mayonnaise, observatories, gladius, weir, bolt,

Nearest to on : karzai, cytology, fas, inconsequential, deficient, puebla, busway, penicillin,

Nearest to zero : methadone, haitians, introduce, bigg, diverging, percussionist, epithets, sancti,

Nearest to will : prevenient, harlem, evade, lured, extinguished, reviews, sarai, happens,

Nearest to were : conscience, megabytes, plans, prudent, logos, mormons, emacs, leonis,

Nearest to UNK : chiron, huffman, lafur, ubiquity, remember, grouping, fret, senile,

Nearest to b : dualities, contradicted, tab, circumcision, sailplane, moslem, sus, storylines,

Nearest to united : scuttled, popularly, opposition, accord, viceroy, technologically, outsiders, yusuf,

Nearest to most : degrades, xico, leverett, chronometers, boyfriend, clouds, seville, flung,

Nearest to was : ph, adherent, inventory, picker, thai, commonplace, twitching, speedy,

Nearest to that : nomen, fets, yakko, interceptors, deformations, jumpers, icons, choral,

以上结果为和之前词比较相近的词，每个分别例举了8个，通过观察可以发现，效果并不好，比如和zero 相近的词是methadone, haitians, introduce, bigg, diverging, percussionist, epithets, sancti这几个词。

第10000步：模型平均损失Average loss at step 10000 : 17.92200917339325

Nearest to to : in, and, specialized, of, prominent, for, three, vs,

Nearest to at : by, in, victoriae, and, UNK, main, british, aberdeen,

Nearest to not : bitter, torquay, aetius, also, made, operates, ash, desired,

Nearest to are : is, laboratory, reginae, were, and, implicit, phobias, kiev,

Nearest to more : recover, cl, kournikova, main, lectured, largest, phi, closer,

Nearest to states : aberdeen, liberals, vertigo, point, obsolete, gland, areas, response,

Nearest to on : in, and, aol, although, during, victoriae, of, karzai,

Nearest to zero : nine, eight, austin, gland, archie, reginae, cl, six,

Nearest to will : reviews, insisted, fran, as, harlem, rebel, happens, economically,

Nearest to were : was, are, conscience, asceticism, is, and, vs, action,

Nearest to UNK : one, rudolph, two, and, the, markov, victoriae, austin,

Nearest to b : agave, circumcision, kilgore, choices, surgery, cl, rocky, busiest,

Nearest to united : homomorphism, behind, opposition, alpina, accord, gulfs, carry, typology,

Nearest to most : if, gland, nazis, xico, ipod, clouds, canaris, one,

Nearest to was : is, were, manner, gollancz, and, depression, had, fails,

Nearest to that : and, know, this, aberdeen, reginae, billboard, it, cambrian,

通过观察可以发现，此时与zero相近的词为：nine, eight, austin, gland, archie, reginae, cl, six，其中有一些都是表示数字的词，说明目前词向量是有效果的。

第20000步时：Average loss at step 20000 : 7.809222069978714

Nearest to to : for, with, in, and, nine, by, three, would,

Nearest to at : in, by, and, main, victoriae, from, on, for,

Nearest to not : it, also, torquay, bitter, there, desired, operates, to,

Nearest to are : were, is, was, dasyprocta, laboratory, reginae, phobias, detective,

Nearest to more : recover, kournikova, main, cl, hey, lectured, largest, franks,

Nearest to states : agouti, dasyprocta, liberals, sancti, obsolete, luther, exponentiation, vertigo,

Nearest to on : in, and, two, during, although, victoriae, at, penicillin,

Nearest to zero : eight, nine, seven, five, three, six, two, four,

Nearest to will : prevenient, insisted, as, reviews, fran, harlem, economically, phoneme,

Nearest to were : are, was, is, and, asceticism, conscience, in, gollancz,

Nearest to UNK : dasyprocta, hbox, agouti, apatosaurus, two, victoriae, markov, polyhedra,

Nearest to b : circumcision, d, agave, fold, kilgore, fetus, contradicted, imitate,

Nearest to united : accord, alpina, behind, homomorphism, typology, carry, yusuf, outsiders,

Nearest to most : polyhedra, xico, vivaldi, if, congo, illustrated, gland, nazis,

Nearest to was : is, were, by, had, are, in, as, has,

Nearest to that : which, yakko, this, and, aes, it, agouti, in

可以很明显的发现和zero相近的词都变成了数字（eight, nine, seven, five, three, six, two, four），目前来看当语料足够大时，词向量的结果比较理想且词与词之间也存在了距离的概念。

下面针对中文的保险语料库进行测试，目前用到的数据只是保险的提问数据。所以目前共7万个中文词汇。

第0步：Average loss at step 0 : 195.21290588378906

Nearest to 在: 利润, 特别, 电池板, 支援, 老兵, 关闭, 税率, 纤维化,

Nearest to 好: 技巧, 母乳喂养, 认可, 供款, 美化, 跨越, 暖通, 盖床,

Nearest to 哪些: 科技, 生效, 良好, 新旗, 帅哥, 职业, 专员, 一棵树,

Nearest to 的: 移动, 历史, 蟑螂, 借着, 混合, 生日, 二手车, 持有,

Nearest to 发生: 第一次, 主动脉瓣, 水坑, 状况, 每月, 淋巴瘤, 申请, 签发,

Nearest to 便宜: 龙卷风, 费城, 堆, 交易会, 干涸, 覆盖面, 租用, 租借,

Nearest to 扣除: 参与, 提交, 初始化, UNK, 烧毁, 证明, 信息, 死亡,

Nearest to 和: 路上, 很快, 上网, 精神病, 选择, 双相, 自愿, 信任,

Nearest to 为什么: 退出, 转入, 程序, 处罚, 运动, 便携式, 孕产, 阿肯色州,

Nearest to 护理: 共事, 当, 行政, 发作, 公牛, 都, 住, 赔率,

Nearest to 计划: 最早, 跑车, 骨密度, 残骸, 母乳喂养, 冰冷, 不同于, 关节,

Nearest to 上: 油箱, 会因, 联保, 分配, 旋转, 会多大, 停电, 封盖,

Nearest to 年金: 静脉, 集合, 用作, 两类, 数额, 一个, 路边, 救护车,

Nearest to 医疗: 优点, 车子, 切除术, 最, 处方药, 临时, 年级, 执照,

Nearest to 吗: 真的, 改, 自行车, 加强, 合, 自驾, 戴维森, 集体,

Nearest to 医疗保险: 吉普, 自然灾害, 卖年, 肾衰, 损伤, 品牌, 变动, 金新,

可以发现刚开始迭代的效果并不好。

第10万步：Average loss at step 100000 : 2.132530316889286

Nearest to 在: 支援, 推行, 消失, 经纪人, 无法, 订婚戒指, 关闭, 丢掉,

Nearest to 好: 母乳喂养, 伟大, 好事, 差异, 家园, 暖通, 技巧, 昂贵,

Nearest to 哪些: 科技, 组合, 退款, 整形术, 一棵树, 下车, 第一年, 帅哥,

Nearest to 的: 胎儿, 签署, 谈到, 蟑螂, 来到, 错过, 照片, 列出,

Nearest to 发生: 主动脉瓣, 是非, 水坑, 寻找, 金年, 是禧, 费随, 意味着,

Nearest to 便宜: 堆, 廉价, 用药, 学习, 覆盖面, 总统, 南卡罗来纳州, 联盟,

Nearest to 扣除: 死亡, 谈谈, 产业, 爱沙尼亚, 推荐, 尽早, 自由, 租赁者,

Nearest to 和: 职业, 很快, 路上, 胃套, 协助, 上网, 双相, 自由,

Nearest to 为什么: 地下, 追逐, 程序, 正确, 处罚, 退出, 转入, 附加费,

Nearest to 护理: 赔率, 共事, 洪水, 分析, 中级, 主, 照顾, 里面,

Nearest to 计划: 最早, 关节, 人力, 能放, 损害, 大, 代表, 利用,

Nearest to 上: 人身, 会多大, 转向, 字典, 而, 溢流, 建议, 德国,

Nearest to 年金: 集合, 泳池, 数额, 浴缸, 通用, 年, 行政, 无人认领,

Nearest to 医疗: 宝宝, 切除术, 睡眠, 执照, 人命, 优点, 作用, 临时,

Nearest to 吗: 保险箱, 怎么样, 成熟, 合, 毯子, 期末, 自驾, 巨大,

Nearest to 医疗保险: 仍然, 吉普, 总收入, 国会, 俄亥俄州, 顶级, 箱式, 德国

可以发现效果并不是很理想，原因大概有三个方面，第一：语料库的语料太少；第二：语料本身也存在问题，由英文翻译过来并不是很通顺；第三：由于其中的问题涵盖了保险业务所有可能存在的问题，且问题的大致意思是不重复的。

引自：

词向量总结笔记（简洁版）链接

词向量和语言模型链接

前辈计算好的一些中文词向量链接

tensorflow实现的word2wev 链接

tensorflow代码实现word2vec 链接

除特别声明，本站所有文章均为原创，如需转载请以超级链接形式注明出处：SmartCat's Blog

标签：自然语言处理词向量 NLP

上一篇：贪吃蛇的JAVA实现

下一篇： LinuxCNC解析（一）之安装Linuxcnc所需要的依赖包

Young87