Index
n-gram Precision, 109
n 元语法单元, 49
Action-value Function, 448
Active Learning, 458
Actor-critic, 449
AdaGrad, 313
Adam, 314
Additive Smoothing, 52
Adequacy, 105
Adversarial Attack, 440
Adversarial Examples, 439
Adversarial Training, 439
Adversarial-NMT, 446
Algorithm, 28
Ambiguity, 94
Anchor, 566
Anneal, 458
Annotated Data, 76
Artificial Neural Networks, 275
Asymmetric Word Alignment, 156
Asynchronous Update, 315
Attention Mechanism, 340
Attention Weight, 362
Automated Machine Learning, 533
Automatic Differentiation, 310
Automatic Post-editing, 131
Automatic Speech Recognition, 584
Autoregressive Decoding, 481
Autoregressive Model, 374
Autoregressive Translation, 481
Average Pooling, 383
Back Translation, 542
Backward Mode, 311
Backward Propagation, 303
Batch Gradient Descent, 308
Batch Inference, 478
Batch Normalization, 317
Bayes’ Rule, 42
Beam Pruning, 67
Beam Search, 66
768 INDEX
Bellman Equation, 450
BERT, 549
Bias, 46
Bidirectional Inference, 468
Bilingual Dictionary Induction, 564
Binarization, 258
Binarized Neural Networks, 480
Bits Per Second, 583
Breadth-first Search, 62
Brevity Penalty, 110
Broadcast Mechanism, 300
Bucket, 479
Byte Pair Encoding, 73
Calculus, 303
Catastrophic Forgetting, 461
Chain Rule, 41
Chart, 264
Chart Cell, 264
Child Model, 560
Chomsky Normal Form, 235
Clarity, 105
Class Imbalance Problem, 574
Classifier, 88
Co-adaptation, 436
Coherence, 105
Composition, 266
Compositional Translation, 192
Computation Graph, 303
Computational Linguistics, 35
Concept, 182
Conditional Probability, 39
Conditional Random Fields, 80
Confusion Network, 492
Connectionism, 276
Connectionist Temporal Classification, 588
Constraint-based Translation, 615
Context Vector, 361
Context-free Grammar, 92
Continual Learning, 461
Continuous Dynamic Model, 502
Convex Function, 186
Convolution Kernel, 381
Convolutional Neural Network, 379
Correlation, 120
Cost Function, 306
Coverage Model, 215
Cross-entropy, 44
Cross-entropy Difference, 457
Cube Pruning, 239
Curriculum Learning, 459
curriculum learning, 445
Curriculum Schedule, 459
Data Augmentation, 542
Data-driven, 19
Decay, 313
Decoding, 58
Deep Learning, 275
Deficiency, 184
Denoising, 433
Denoising Autoencoder, 544
Dependency Parsing, 90
Depth-first Search, 62
Depthwise Convolution, 395
Depthwise Separable Convolution, 395
Derivation, 93
Difficulty Criteria, 459
Direct Assessment, 106
Directed Hyper-graph, 263
Disambiguation, 94
Discriminative Model, 98
Disfluency Detection, 585
Distributed Representation, 118
Domain Adaptation, 456
DREEM, 118
INDEX 769
Dropout, 420
Dual Supervised Learning, 555
Dual Unsupervised Learning, 556
Dynamic convolution, 395
Dynamic Linear Combination of Layers, 512
Dynamic Programming EncodingDPE, 432
Element-wise Addition, 282
Element-wise Product, 284
Emission Probability, 82
Empty Alignment, 157
Encoder-Decoder, 32
Encoder-Decoder Attention Sub-layer, 406
Encoder-Decoder Paradigm, 346
End-to-End Learning, 278
End-to-End Speech Translation, 586
Ensemble Learning, 438
Entropy, 17
Estimate, 38
Estimation, 38
Euclidean Norm, 286
Event, 37
Evil Feature, 210
Exact Model, 111
Expectation Maximization, 166
Expected Count, 167
Explainable Machine Learning, 335
Exposure Bias, 444
Expression Swell, 310
Feature, 80
Feature Decay Algorithms, 457
Feature Engineering, 81
Feed-Forward Sub-layer, 405
Fertility, 178
Fidelity, 102
Filter-bank, 584
Fine-tuning, 548
First Moment Estimation, 369
Fluency, 102
FNNLM, 326
Forward Propagation, 302
Frame Length, 584
Frame Shift, 584
Framing, 584
Frobenius Norm, 286
Frobenius 范数, 286
Frontier Set, 253
Full Parsing, 90
Full Search, 374
Future Mask, 414
Gated Linear Units, 388
Gated Recurrent Unit, 357
Generalization, 434
Generation, 353
Generative Adversarial Networks, 441
Generative Model, 84
Glue Rule, 231
Good-Turing Estimate, 53
GPT, 549
GPU, 20
Gradient Clipping, 316
Gradient Explosion, 316
Gradient Vanishing, 316
Gradual Warmup, 371
Grammer, 28
Graphics Processing Unit, 20
Greedy Search, 65
Grid Search, 212
Hamming, 584
Harmonic-mean, 113
Heuristic Function, 65
Heuristic Search, 65
Hidden Markov Model, 80
770 INDEX
Hierarchical Phrase-based Grammar, 230
Hierarchical Phrase-based Model, 228
High-resource Language, 541
Histogram Pruning, 67
Human Translation Error Rate, 127
Hyper-edge, 263
Hypothesis Recombination, 217
Hypothesis Selection, 490
Ill-posed Problem, 433
Image Captioning, 592
Image-to-Image Translation, 597
Implicit Bridging, 562
Incremental Learning, 461
Inductive Bias, 498
Inference, 58
Informativeness, 105
Intelligibility, 105
Interactive Machine Translation, 613
Interlingua-based Translation, 27
Inverse Problem, 433
Iterative Back Translation, 543
Iterative Refinement, 568
Joint Probability, 39
Joint-BPE, 430
KL Distance, 43
KL 距离, 43
Kullback-Leibler 距离, 43
Label, 88
Label Bias, 86
Label Smoothing, 420
Language, 94
Language Model, 49
Language Modeling, 49
Law of Total Probability, 41
Layer Normalization, 317
Learning Difficulty, 452
Learning Rate, 307
Left-hand Side, 93
Left-most Derivation, 94
Left-to-Right Generation, 59
Lexical Analysis, 74
Lexical Chain, 600
Lexical Translation Probability, 205
Lexicalized Norm Form, 269
Lexically Constrained Translation, 615
Lifelong Learning, 461
Lightweight Convolution, 395
Line Search, 212
Linear Mapping, 284
Linear Transformation, 284
Linearization, 245
Locally Connected, 380
Log-linear Model, 200
Logical Deficiency, 188
Long Short-term Memory, 355
Loss Function, 305
Low-resource Language, 541
LSH Attention, 508
Machine Translation, 13
Marginal Probability, 40
Mask, 414
MASS, 551
Matrix, 281
Max Pooling, 383
Maximum Entropy, 80
Mel 频率倒谱系数, 584
Meteor Normalizer, 114
Mini-batch Gradient Descent, 308
Mini-batch Training, 420
Minimal Rules, 254
Minimum Error Rate Training, 211
Minimum Risk Training, 447
INDEX 771
MLM, 550
Modality, 582
Model Compression, 452
Model Parameters, 305
Model Score, 62
Model Training, 211
Momentum, 312
Multi-branch, 504
Multi-head Attention, 413
Multi-hop Attention, 388
Multi-lingual Single Model-based Method, 561
Multi-model Machine Translation, 582
Multi-stage Inference, 469
Multi-step Attention, 388
Multimodality Problem, 483
Multitask Learning, 531
Named Entity, 79
Named Entity Recognition, 79
Natural Language Processing, 35
Nesterov Accelerated Gradient, 393
Nesterov 加速梯度下降法, 393
Neural Architecture Search, 533
Neural Language Model, 325
Neural Machine Translation, 337
Neural Networks, 275
Noise Channel Model, 153
Non-autoregressive Model, 374
Non-Autoregressive Translation, 481
Non-terminal, 90
Norm, 285
Numerical Differentiation, 309
Objective Function, 305
Offline Speech Translation, 583
One-hot 编码, 330
Open Vocabulary, 428
Optimal Stopping Criteria, 67
Out-of-vocabulary Word, 52
Over Translation, 470
Overfitting, 318
Padding Mask, 414
Parameter, 51
Parameter Estimation, 45
Parameter Server, 316
Paraphrase Matcher, 114
Paraphrasing, 546
Parent Model, 560
Parsing, 72
Perplexity, 57
Phrasal Segmentation, 195
Phrase Extraction, 202
Phrase Pairs, 196
Phrase Structure Parsing, 90
Phrase Table, 205
Physical Deficiency, 188
Piecewise Constant Decay, 371
Pivot Language, 558
Pointwise Convolution, 395
Policy Gradient, 448
Porter Stem Model, 112
Position Embedding, 388
Position-independent word Error Rate, 108
Post-editing, 613
Post-norm, 417
Post-processing, 73
Pre-emphasis, 584
Pre-norm, 417
Pre-processing, 73
Pre-terminal, 90
Pre-training, 548
Precision, 110
Prediction, 58
Probabilistic Context-free Grammar, 95
Probabilistic Graphical Model, 81
772 INDEX
Probability, 38
Procrustes Analysis, 565
Procrustes Problem, 566
Production Rule, 93
Pruning, 64
Quality Estimation, 124
Quality Evaluation of Translation, 101
Random Variable, 38
Rank, 298
Recall, 110
Receptive Field, 385
Reconstruction Loss, 557
Recurrent Neural Network, 328
Recursive Auto-encoder Embedding, 119
Regularization, 318
Reinforcement Learning, 447
Relative Entropy, 44
Relative Positional Representation, 499
Relative Ranking, 106
Reordering, 206
Representation Learning, 278
Reranking, 468
Residual Connection, 388
Residual Networks, 317
Reverse Mode, 311
Reward Shaping, 451
Right-hand Side, 93
RMSProp, 314
RNN Cell, 329
RNNLM, 328
Robustness, 439
Round-off Error, 309
Scalar, 281
Scalar Multiplication, 283
Scaled Dot-product Attention, 411
Scheduled Sampling, 445
Search Problem, 58
Second Moment Estimation, 369
Segmentation, 72
Selection, 457
Self-attention Mechanism, 330
Self-Attention Sub-layer, 405
Self-information, 43
Self-paced Learning, 460
Self-supervised, 590
Semi-ring Parsing, 264
Sentence, 94
Sequence Generation, 58
Sequence Labeling, 79
Sequence-level Knowledge Distillation, 453
Significance Level, 122
Simultaneous Translation, 583
Singular Value Decomposition, 566
Skip Connection, 317
Smoothing, 52
Solver, 502
Source Domain, 456
Source Language, 13
Span, 235
Speech Translation, 583
Speech-to-Speech Translation, 583
Speech-to-Text Translation, 583
Spiritual Deficiency, 188
Stability-Plasticity, 461
Statistical Hypothesis Testing, 121
Statistical Inference, 494
Statistical Language Modeling, 37
Stochastic Gradient Descent, 308
Stride Convolution, 508
String-based Decoding, 266
Structural Position Representations, 501
Student Model, 453
Sub-word, 429
INDEX 773
Supervised Training, 305
Support Vector Machine, 80
Symbolic Differentiation, 309
Symbolicism, 277
Symmetrization, 187
Synchronous Context-free Grammar, 229
Synchronous Tree-substitution Grammar, 247
Synchronous Update, 315
Syntax, 90
System Bias, 189
System Combination, 490
Target Domain, 456
Target Language, 13
Teacher Model, 453
Technical Deficiency, 188
Tensor, 298
Terminal, 90
Text-to-Image Translation, 597
The Gradient Descent Method, 307
The Gradient-based Method, 307
The Lagrange Multiplier Method, 164
The Reversible Residual Network, 508
Threshold Pruning, 217
Time-constrained Search, 472
Token, 74
Top-Down Parsing, 236
Training, 51
Training Data Set, 305
Transfer Learning, 560
Transfer-based Translation, 26
Transformer, 404
Transition Probability, 82
Translation Candidate, 140
Translation Error Rate, 108
Translation Hypothesis, 216
Translation Memory, 615
Transpose, 282
Tree Fragment, 246
Tree-based Decoding, 266
Tree-to-String Translation Rule, 246
Tree-to-Tree Translation Rule, 246
Treebank, 97
Truncation Error, 309
Tunable Weight Vector, 114
Tuning Set, 211
Under Translation, 470
Uniform-cost Search, 64
Uninformed Search, 64
Unsupervised Machine Translation, 564
Update Rule, 307
Variational Autoencoders, 566
Variational Methods, 494
Vector, 281
Vectorization, 300
Viterbi Algorithm, 84
Warmup, 419
Waveform, 583
Weight, 286
Weight Sharing, 380
Weight Tuning, 211
Well-posed, 433
Windowing, 584
WN Synonymy Model, 112
Word Alignment, 147
Word Alignment Link, 147
Word Embedding, 118
Word Error Rate, 108
Word Feature, 81
Word Lattice, 492
Word-level Knowledge Distillation, 453
Word-to-Word, 111
“同义词”匹配模型, 112
774 INDEX
“波特词干”匹配模型, 112
“绝对”匹配模型, 111
一致代价搜索, 64
一阶矩估计, 369
上下文向量, 361
上下文无关文法, 92
不适定问题, 433
与位置无关的单词错误率, 108
主动学习, 458
乔姆斯基范式, 235
事件, 37
二值网络, 480
二叉化, 258
二阶矩估计, 369
交互式机器翻译, 613
交叉熵, 44
交叉熵差, 457
产出率, 178
产生式规则, 93
人工神经网络, 275
人工神经网络方法, 50
人工译后编辑距离, 127
代价函数, 306
估计, 38
估计值, 38
位置编码, 388
低资源机器翻译, 541
低资源语言, 541
依存分析, 90
信息性
,
105
假设选择, 490
假设重组, 217
偏置, 46
健壮性, 439
充分性, 105
全搜索, 374
全概率公式, 41
准确率, 110
凸函数, 186
函数塑形, 451
分布式表示, 118
分布式表示评价度量, 118
分帧, 584
分段常数衰减, 371
分类器, 88
分词, 72
判别式模型, 98
前向传播, 302
前标准化, 417
前馈神经网络子层, 405
前馈神经网络语言模型, 326
剪枝, 64
加法平滑, 52
加窗, 584
动作价值函数, 448
动态卷积, 395
动态线性聚合网络, 512
动态规划编码, 432
半环分析, 264
单元, 74
单词, 72
单词到单词, 111
单词错误率, 108
卷积核, 381
卷积神经网络, 379
参数, 51
参数估计, 45
参数更新的规则, 307
参数服务器, 316
双向推断, 468
双向编码器表示, 549
双字节编码, 73
双字节联合编码, 430
INDEX 775
反向传播, 303
反向模式, 311
反问题, 433
发射概率, 82
变分方法, 494
变分自编码器, 566
古德-图灵估计, 53
句子, 94
句子的表示, 333
句子表示模型, 333
句法, 90
句法分析, 72
句长补全掩码, 414
召回率, 110
可理解度, 105
可解释机器学习, 335
可调权值向量, 114
可逆残差网络结构, 508
右部, 93
同步上下文无关文法, 229
同步更新, 315
同步树替换文法, 247
后处理, 73
后标准化, 417
向量, 281
向量化, 300
启发式函数, 65
启发式搜索, 65
命名实体, 79
命名实体识别, 79
噪声信道模型, 153
回译, 542
困惑度, 57
图像到图像的翻译, 597
图片描述生成, 592
基于中间语言的机器翻译, 27
基于串的解码, 266
基于单词的知识蒸馏, 453
基于句法的特征, 262
基于层次短语的文法, 230
基于层次短语的模型, 228
基于序列的知识蒸馏, 453
基于树的解码, 266
基于梯度的方法, 307
基于短语的特征, 262
基于约束的翻译, 615
基于结构化位置编码, 501
基于转换规则的机器翻译, 26
基于连续动态系统, 502
基于频次的方法, 50
增量式学习, 461
多任务学习, 531
多分支, 504
多头注意力机制, 413
多峰问题, 483
多模态机器翻译, 582
多语言单模型方法, 561
多跳注意力机制, 388
多阶段推断, 469
奇异值分解, 566
子模型, 560
子词, 429
学习率, 307
学习难度, 452
学生模型, 453
完全分析, 90
实时语音翻译, 583
宽度优先搜索, 62
富资源语言, 541
对抗攻击, 440
对抗样本, 439
对抗神经机器翻译, 446
对抗训练, 439
对数线性模型, 200
776 INDEX
对称化, 187
小批量梯度下降, 308
小批量训练, 420
局部哈希敏感注意力机制, 508
局部连接, 380
层标准化, 317
左部, 93
帧移, 584
帧长, 584
平均池化, 383
平滑, 52
广播机制, 300
序列化, 245
序列标注, 79
序列生成, 58
开放词表, 428
异步更新, 315
张量, 298
强化学习, 447
归纳偏置, 498
循环单元, 329
循环神经网络, 328
循环神经网络语言模型, 328
微调, 548
忠诚度, 102
感受野, 385
成分分析, 90
截断误差, 309
批量推断, 478
批量标准化, 317
批量梯度下降, 308
技术缺陷, 188
拉格朗日乘数法, 164
持续学习, 461
按元素乘积, 284
按元素加法, 282
损失函数, 305
推导, 93
推断, 58
掩码, 414
掩码端到端预训练, 551
掩码语言模型, 550
搜索问题, 58
支持向量机, 80
教师模型, 453
数乘, 283
数值微分, 309
数据增强, 542
数据并行, 315
数据选择, 457
数据驱动, 19
文本到图像的翻译, 597
文本规范器, 114
无信息搜索, 64
无监督机器翻译, 564
时间受限的搜索, 472
映射锚点, 566
显著性水平, 122
普氏分析, 565
普鲁克问题, 566
曝光偏置, 444
最佳停止条件, 67
最大池化, 383
最大熵, 80
最小规则, 254
最小错误率训练, 211
最小风险训练, 447
最左优先推导, 94
有向超图, 263
有害特征, 210
有指导的训练, 305
有监督的训练, 305
期望最大化, 166
期望频次, 166
INDEX 777
未来信息掩码, 414
未登录词, 52
机器翻译, 13
权值共享, 380
权重, 286
权重调优, 211
束剪枝, 67
束搜索, 66
条件概率, 39
条件随机场, 80
枢轴语言, 558
标注偏置, 86
标注数据, 76
标签, 88
标签平滑, 420
标量, 281
树到串翻译规则, 246
树到树翻译规则, 246
树库, 96
树片段, 246
格搜索, 212
, 479
梯度下降方法, 307
梯度消失, 316
梯度爆炸, 316
梯度裁剪, 316
概念, 182
概念单元, 182
概率, 38
概率上下文无关文法, 95
概率分布函数, 39
概率图模型, 81
概率密度函数, 39
模型压缩, 452
模型参数, 305
模型并行, 372
模型得分, 62
模型训练, 211
模态, 582
欠翻译, 470
欧几里得范数, 286
正则化, 318
歧义, 94
残差网络, 317
残差连接, 388
比特率, 583
求解器, 502
汉明窗, 584
池化层, 383
泛化, 434
注意力机制, 340
注意力权重, 362
流畅度, 102
消歧, 94
深度优先搜索, 62
深度可分离卷积, 395
深度学习, 275
混淆网络, 492
清晰度, 105
源语言, 13
源领域, 456
滤波器, 381
滤波器组, 584
演员-评论家, 449
灾难性遗忘, 461
, 17
父模型, 560
物理缺陷, 188
特征, 80
特征工程, 81
特征衰减算法, 457
独热编码, 330
生成, 353
生成对抗网络, 441
778 INDEX
生成式模型, 84
生成式预训练, 549
目标函数, 305
目标语言, 13
目标领域, 456
直接评估, 106
直方图剪枝, 67
相互适应, 436
相关性, 120
相对位置编码, 499
相对排序, 106
相对熵, 44
矩阵, 281
短句惩罚因子, 110
短语, 193
短语切分, 195
短语对, 196
短语抽取, 202
短语结构分析, 90
短语表, 205
神经机器翻译, 337
神经架构搜索, 533
神经网络, 275
神经语言模型, 325
离线语音翻译, 583
稳定性- 可塑性, 461
空对齐, 157
立方剪枝, 239
端到端学习, 278
端到端的语音翻译模型, 586
符号主义, 277
符号微分, 309
策略梯度, 448
算子, 303
算法, 28
类别不均衡问题, 574
系统偏置, 189
系统融合, 490
繁衍率, 178
线性化, 245
线性变换, 284
线性映射, 284
线搜索, 212
组合, 266
组合性翻译, 192
终结符, 90
统计假设检验, 121
统计推断, 494
统计语言建模, 37
维特比算法, 84
编码-解码注意力子层, 406
编码器-解码器, 32
编码器-解码器框架, 346
编码器-解码器模型, 346
缩放的点乘注意力, 411
缺陷, 184
翻译候选, 140
翻译假设, 216
翻译记忆, 615
翻译错误率, 108
联合概率, 39
胶水规则, 231
自下而上的分析, 236
自信息, 43
自动后编辑, 131
自动微分, 310
自动机器学习, 533
自动评价, 108
自动语音识别, 584
自回归模型, 374
自回归翻译, 481
自回归解码, 481
自左向右生成, 59
自步学习, 460
INDEX 779
自注意力子层, 405
自注意力机制, 330
自然语言处理, 35
自监督, 590
舍入误差, 309
范数, 285
表格, 264
表格单元, 264
表示学习, 278
表达式膨胀, 310
衰减, 313
覆盖度模型, 215
解码, 58
计算图, 303
计算语言学, 35
训练, 51
训练数据集合, 305
记忆更新, 356
词典归纳, 564
词对齐, 147
词对齐连接, 147
词嵌入, 118
词格, 492
词汇化标准形式, 269
词汇化翻译概率, 205
词汇约束翻译, 615
词汇链, 600
词法分析, 74
词特征, 81
译后编辑, 613
译文质量评价, 101
语法, 28
语言, 94
语言建模, 49
语言模型, 49
语音到文本翻译, 583
语音到语音翻译, 583
语音翻译, 583
课程学习, 445, 459
课程规划, 459
调优集合, 211
调和均值, 113
调序, 206
调度采样, 445
贝叶斯法则, 42
贝尔曼方程, 450
质量评估, 124
贪婪搜索, 65
超边, 263
跨度, 235
跨步卷积, 508
跳接, 317
转移概率, 82
转置, 282
转述, 546
轻量卷积, 395
输出, 356
边缘概率, 40
边缘集合, 253
迁移学习, 560
过拟合, 318
过翻译, 470
连接主义, 276
连接时序分类, 588
连贯性, 105
迭代优化, 568
迭代式回译, 543
退火, 458
适定的, 433
逐渐预热, 371
逐点卷积, 395
逐通道卷积, 395
递归自动编码, 119
逻辑缺陷, 188
780 INDEX
遗忘, 355
释义匹配器, 114
重排序, 468
重构损失, 557
链式法则, 41
长短时记忆, 355
门循环单元, 357
阈值剪枝, 217
, 298
降噪, 433
降噪自编码器, 544
随机事件, 38
随机变量, 38
随机梯度下降, 308
隐式桥接, 562
隐马尔可夫模型, 80
难度评估准则, 459
集成学习, 438
非对称的词对齐, 156
非终结符, 90
非自回归模型, 374
非自回归翻译, 481
顺滑, 585
预加重, 584
预处理, 73
预测, 58
预热, 419
预终结符, 90
预训练, 548
领域适应, 456