[1] 慧立, 彦悰, and 道宣. 大慈恩寺三藏法師傳. Volume 2. 中华书局, 2000 (cited
on page 27).
[2] 中国翻译协. 2019 中国语言服务行业发展报告. 中国翻译协会, 2019 (cited
on page 27).
[3] 姚恺 赵军. 改革探讨创新进发展——全翻译专业学位究生
教育 2019 年会综述”. In: 中国翻译, 2019 (cited on page 27).
[4] James Knowlson. Universal Language Schemes in England and France 1600-1800.
University of Toronto Press, 1975 (cited on page 27).
[5] Claude E. Shannon. “A mathematical theory of communication”. In: volume 27.
3. Bell System Technical Journal, 1948, pages 379–423 (cited on page 27).
[6] Claude E. Shannon and Warren Weaver. “The mathematical theory of communi-
cation”. In: volume 13. IEEE Transactions on Instrumentation and Measurement,
1949 (cited on page 27).
[7] Warren Weaver. “Translation”. In: volume 14. 15-23. Cambridge: Technology Press,
MIT, 1955, page 10 (cited on page 27).
[8] Noam Chomsky. “Syntactic Structures”. In: volume 33. 3. Language, 1957 (cited
on pages 28, 99).
[9] Peter F. Brown, John Cocke, Stephen Della Pietra, Vincent J. Della Pietra, Freder-
ick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. “A Statistical
Approach to Machine Translation”. In: volume 16. 2. Computational Linguistics,
1990, pages 79–85 (cited on pages 29, 41, 161).
[10] Peter F. Brown, Stephen Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer.
“The Mathematics of Statistical Machine Translation: Parameter Estimation”. In:
volume 19. 2. Computational Linguistics, 1993, pages 263–311 (cited on pages 29,
145, 146, 161, 163, 181, 183, 186, 190, 229).
[11] Makoto Nagao. “A framework of a mechanical translation between Japanese and
English by analogy principle”. In: Artificial and human intelligence, 1984, pages 351–
354 (cited on pages 29, 40).
[12] Satoshi Sato and Makoto Nagao. “Toward Memory-based Translation”. In: Inter-
national Conference on Computational Linguistics, 1990, pages 247–252 (cited on
page 29).
[13] Sergei Nirenburg. “Knowledge-based machine translation”. In: volume 4. 1. Springer,
1989, pages 5–24 (cited on page 35).
[14] William John Hutchins. Machine translation: past, present, future. Ellis Horwood
Chichester, 1986 (cited on page 35).
[15] Michael Zarechnak. “The history of machine translation”. In: volume 1979. Ma-
chine Translation, 1979, pages 1–87 (cited on page 35).
[16] 冯志伟. 机器翻译研究. 中国对外翻译出版公司, 2004 (cited on page 36).
[17] Dan Jurafsky and James H. Martin. Speech and language processing: an introduc-
tion to natural language processing, computational linguistics, and speech recog-
nition, 2nd Edition. Prentice Hall, Pearson Education International, 2009 (cited on
pages 36, 66, 146).
[18] . 述语 (CTRDL)”. In:
volume 5. 4. 中文信息学报, 1991 (cited on page 39).
[19] 姚天顺 唐泓英. 基于搭配词典的词汇语义驱动算法”. In: volume 6. A01. 软件
学报, 1995, pages 78–85 (cited on page 39).
[20] William A. Gale and Kenneth W. Church. “A program for aligning sentences in
bilingual corpora”. In: volume 19. 1. Computational Linguistics, 1993, pages 75–
102 (cited on page 41).
[21] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. “Sequence to Sequence Learning
with Neural Networks”. In: Advances in Neural Information Processing Systems,
2014, pages 3104–3112 (cited on pages 42, 348, 359).
[22] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Trans-
lation by Jointly Learning to Align and Translate”. In: International Conference on
Learning Representations, 2015 (cited on pages 42, 198, 343, 348, 359, 369, 374,
386, 398, 477, 633).
[23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan
Gomez, Lukasz Kaiser, and Illia Polosukhin. “Attention is All You Need”. In: In-
ternational Conference on Neural Information Processing, 2017, pages 5998–6008
(cited on pages 42, 78, 198, 338, 349, 352, 386, 410, 411, 426, 514, 525).
[24] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin.
“Convolutional Sequence to Sequence Learning”. In: volume 70. International Con-
ference on Machine Learning, 2017, pages 1243–1252 (cited on pages 42, 349, 352,
387, 394).
[25] Thang Luong, Hieu Pham, and Christopher D. Manning. “Effective Approaches to
Attention-based Neural Machine Translation”. In: Conference on Empirical Meth-
ods in Natural Language Processing, 2015, pages 1412–1421 (cited on pages 42,
359, 369, 386, 398, 506).
[26] Philipp Koehn. Statistical Machine Translation. Cambridge University Press, 2010
(cited on page 44).
[27] Philipp Koehn. Neural Machine Translation. Cambridge University Press, 2020
(cited on page 44).
[28] Christopher D Manning, Christopher D Manning, and Hinrich Schütze. Founda-
tions of statistical natural language processing. Massachusetts Institute of Tech-
nology Press, 1999 (cited on page 44).
[29] 宗成庆. 统计自然语言处理. 清华大学出版社, 2013 (cited on page 44).
[30] Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. Deep Learning. MIT
Press, 2016 (cited on pages 44, 343).
[31] Yoav Goldberg. “Neural network methods for natural language processing”. In: vol-
ume 10. 1. Morgan & Claypool Publishers, 2017, pages 1–309 (cited on pages 44,
[32] 周志华. 机器学习. 清华大学出版社, 2016 (cited on pages 44, 96).
[33] 李航. 统计学习方法. 清华大学出版社, 2019 (cited on pages 44, 96, 97).
[34] 邱锡鹏. 神经网络与深度学习. 机械工业出版社, 2020 (cited on page 44).
[35] 魏宗
, 2011 (cited
on page 48).
[36] Andre Nikolaevich Kolmogorov and Albert T Bharucha-Reid. Foundations of the
theory of probability: Second English Edition. Courier Dover Publications, 2018
(cited on page 48).
[37] 刘克. 实用马尔可夫决策过程. 清华大学出版社, 2004 (cited on page 60).
[38] A. Barbour and Sidney Resnick. “Adventures in Stochastic Processes.” In: vol-
ume 88. Journal of the American Statistical Association, Dec. 1993, page 1474
(cited on page 60).
[39] Irving J Good. “The population frequencies of species and the estimation of popu-
lation parameters”. In: volume 40. 3-4. Oxford University Press, 1953, pages 237–
264 (cited on page 63).
[40] William A. Gale and Geoffrey Sampson. “Good-Turing Frequency Estimation With-
out Tears”. In: volume 2. 3. Journal of Quantitative Linguistics, 1995, pages 217–
237 (cited on page 63).
[41] Reinhard Kneser and Hermann Ney. “Improved backing-off for M-gram language
modeling”. In: International Conference on Acoustics, Speech, and Signal Process-
ing, 1995, pages 181–184 (cited on page 64).
[42] Stanley F. Chen and Joshua Goodman. “An empirical study of smoothing tech-
niques for language modeling”. In: volume 13. 4. Computer Speech & Language,
1999, pages 359–393 (cited on pages 64, 66, 78).
[43] Hermann Ney and Ute Essen. “On smoothing techniques for bigram-based natu-
ral language modelling”. In: International Conference on Acoustics, Speech, and
Signal Processing, 1991, pages 825–828 (cited on page 65).
[44] Hermann Ney, Ute Essen, and Reinhard Kneser. “On structuring probabilistic de-
pendences in stochastic language modelling”. In: volume 8. 1. Computer Speech
& Language, 1994, pages 1–38 (cited on pages 65, 66).
[45] Kenneth Heafield. “KenLM: Faster and Smaller Language Model Queries”. In: An-
nual Meeting of the Association for Computational Linguistics, 2011, pages 187–
197 (cited on pages 66, 78).
[46] Andreas Stolcke. “SRILM - an extensible language modeling toolkit”. In: Interna-
tional Conference on Spoken Language Processing, 2002 (cited on page 66).
[47] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction
to Algorithms. The MIT Press and McGraw-Hill Book Company, 1989 (cited on
page 69).
[48] Shimon Even. Graph algorithms. Cambridge University Press, 2011 (cited on page 72).
[49] Robert Endre Tarjan. “Depth-First Search and Linear Graph Algorithms”. In: vol-
ume 1. 2. SIAM Journal on Computing, 1972, pages 146–160 (cited on page 72).
[50] Ashish Sabharwal and Bart Selman. “S. Russell, P. Norvig, Artificial Intelligence:
A Modern Approach, Third Edition”. In: volume 175. 5-6. Artificial Intelligence,
2011, pages 935–937 (cited on page 74).
[51] Sartaj Sahni and Ellis Horowitz. Fundamentals of Computer Algorithms. Computer
Science Press, 1978 (cited on page 74).
[52] Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. “A Formal Basis for the Heuris-
tic Determination of Minimum Cost Paths”. In: volume 4. 2. IEEE Transactions on
Systems Science and Cybernetics, 1968, pages 100–107 (cited on page 75).
[53] Bruce T. Lowerre. The HARPY speech recognition system. Carnegie Mellon Uni-
versity, 1976 (cited on page 75).
[54] Christopher M. Bishop. Neural networks for pattern recognition. Oxford university
press, 1995 (cited on page 75).
[55] Karl Johan Åström. “Optimal control of Markov processes with incomplete state in-
formation”. In: volume 10. 1. Journal of Mathematical Analysis and Applications,
1965, pages 174–205 (cited on page 75).
[56] Richard E. Korf. “Real-time heuristic search”. In: volume 42. 2. Artificial Intelli-
gence, 1990, pages 189–211 (cited on page 75).
[57] Liang Huang, Kai Zhao, and Mingbo Ma. “When to Finish? Optimal Beam Search
for Neural Text Generation (modulo beam size)”. In: Annual Meeting of the Asso-
ciation for Computational Linguistics, 2017, pages 2134–2139 (cited on pages 77,
[58] Yilin Yang, Liang Huang, and Mingbo Ma. “Breaking the Beam Search Curse: A
Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Transla-
tion”. In: Annual Meeting of the Association for Computational Linguistics, 2018,
pages 3054–3059 (cited on pages 77, 477, 479).
[59] F. Jelinek. “Interpolated estimation of Markov source parameters from sparse data”.
In: Pattern Recognition in Practice, 1980, pages 381–397 (cited on page 78).
[60] S. Katz. “Estimation of probabilities from sparse data for the language model com-
ponent of a speech recognizer”. In: volume 35. 3. International Conference on
Acoustics, Speech and Signal Processing, 1987, pages 400–401 (cited on page 78).
[61] Timothy C. Bell, John G. Cleary, and Ian H. Witten. Text compression. Prentice
Hall, 1990 (cited on page 78).
[62] I.H. Witten and T.C. Bell. “The zero-frequency problem: estimating the probabili-
ties of novel events in adaptive text compression”. In: volume 37. 4. IEEE Trans-
actions on Information Theory, 1991, pages 1085–1094 (cited on page 78).
[63] Joshua T. Goodman. “A bit of progress in language modeling”. In: volume 15. 4.
Computer Speech & Language, 2001, pages 403–434 (cited on page 78).
[64] Katrin Kirchhoff and Mei Yang. “Improved Language Modeling for Statistical Ma-
chine Translation”. In: Annual Meeting of the Association for Computational Lin-
guistics, 2005, pages 125–128 (cited on page 78).
[65] Ruhi Sarikaya and Yonggang Deng. “Joint Morphological-Lexical Language Mod-
eling for Machine Translation”. In: Annual Meeting of the Association for Compu-
tational Linguistics, 2007, pages 145–148 (cited on page 78).
[66] Philipp Koehn and Hieu Hoang. “Factored Translation Models”. In: Annual Meet-
ing of the Association for Computational Linguistics, 2007, pages 868–876 (cited
on page 78).
[67] Marcello Federico and Mauro Cettolo. “Efficient Handling of N-gram Language
Models for Statistical Machine Translation”. In: Annual Meeting of the Association
for Computational Linguistics, 2007, pages 88–95 (cited on page 78).
[68] Marcello Federico and Nicola Bertoldi. “How Many Bits Are Needed To Store
Probabilities for Phrase-Based Translation?” In: Annual Meeting of the Associa-
tion for Computational Linguistics, 2006, pages 94–101 (cited on page 78).
[69] David Talbot and Miles Osborne. “Smoothed Bloom Filter Language Models: Tera-
Scale LMs on the Cheap”. In: Annual Meeting of the Association for Computa-
tional Linguistics, 2007, pages 468–476 (cited on page 78).
[70] David Talbot and Miles Osborne. “Randomised Language Modelling for Statistical
Machine Translation”. In: Annual Meeting of the Association for Computational
Linguistics, 2007, pages 512–519 (cited on page 78).
[71] Kun Jing and Jungang Xu. “A Survey on Neural Network Language Models.” In:
arXiv preprint arXiv:1906.03591, 2019 (cited on page 78).
[72] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. “A neu-
ral probabilistic language model”. In: volume 3. 6. Journal of Machine Learning
Research, 2003, pages 1137–1155 (cited on pages 78, 126, 288, 334).
[73] Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khu-
danpur. “Recurrent neural network based language model”. In: International Speech
Communication Association, 2010, pages 1045–1048 (cited on pages 78, 288, 336).
[74] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. “LSTM Neural Networks
for Language Modeling”. In: International Speech Communication Association,
2012, pages 194–197 (cited on page 78).
[75] Franz Josef Och, Nicola Ueffing, and Hermann Ney. “An Efficient A* Search Algo-
rithm for Statistical Machine Translation”. In: Proceedings of the ACL Workshop
on Data-Driven Methods in Machine Translation, 2001 (cited on page
[76] Ye-Yi Wang and Alex Waibel. “Decoding Algorithm in Statistical Machine Trans-
lation”. In: Morgan Kaufmann Publishers, 1997, pages 366–372 (cited on pages 78,
[77] Christoph Tillmann, Stephan Vogel, Hermann Ney, and Alex Zubiaga. “A DP-
based Search Using Monotone Alignments in Statistical Translation”. In: Morgan
Kaufmann Publishers, 1997, pages 289–296 (cited on pages 78, 228).
[78] Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada.
“Fast Decoding and Optimal Decoding for Machine Translation”. In: Morgan Kauf-
mann Publishers, 2001, pages 228–235 (cited on pages 78, 179).
[79] Ulrich Germann. “Greedy decoding for statistical machine translation in almost
linear time”. In: Annual Meeting of the Association for Computational Linguistics,
2003, pages 1–8 (cited on pages 78, 179).
[80] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Fed-
erico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens,
Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. “Moses: Open
Source Toolkit for Statistical Machine Translation”. In: Annual Meeting of the As-
sociation for Computational Linguistics, 2007 (cited on pages 78, 198, 215, 228,
472, 475, 631).
[81] Philipp Koehn. “Pharaoh: A Beam Search Decoder for Phrase-Based Statistical
Machine Translation Models”. In: volume 3265. Springer, 2004, pages 115–124
(cited on pages 78, 198, 228, 472).
[82] S. Bangalore and G. Riccardi. “A finite-state approach to machine translation”. In:
Annual Meeting of the Association for Computational Linguistics, 2001, pages 381–
388 (cited on page 78).
[83] Srinivas Bangalore and Giuseppe Riccardi. “Stochastic Finite-State Models for
Spoken Language Machine Translation”. In: volume 17. 3. Machine Translation,
2002, pages 165–184 (cited on page 78).
[84] Ashish Venugopal, Andreas Zollmann, and Vogel Stephan. “An Efficient Two-
Pass Approach to Synchronous-CFG Driven Statistical MT”. In: Annual Meeting
of the Association for Computational Linguistics, 2007, pages 500–507 (cited on
page 78).
[85] Andreas Zollmann, Ashish Venugopal, Matthias Paulik, and Stephan Vogel. “The
Syntax Augmented MT (SAMT) System at the Shared Task for the 2007 ACL
Workshop on Statistical Machine Translation”. In: Annual Meeting of the Asso-
ciation for Computational Linguistics, 2007, pages 216–219 (cited on pages 78,
[86] Yang Liu, Qun Liu, and Shouxun Lin. “Tree-to-String Alignment Template for
Statistical Machine Translation”. In: Annual Meeting of the Association for Com-
putational Linguistics, 2006 (cited on pages 78, 251, 278).
[87] Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei
Wang, and Ignacio Thayer. “Scalable Inference and Training of Context-Rich Syn-
tactic Translation Models”. In: Annual Meeting of the Association for Computa-
tional Linguistics, 2006 (cited on pages 78, 251, 258, 278).
[88] David Chiang. “A Hierarchical Phrase-Based Model for Statistical Machine Trans-
lation”. In: Annual Meeting of the Association for Computational Linguistics, 2005,
pages 263–270 (cited on pages 78, 236, 278).
[89] Rico Sennrich, Barry Haddow, and Alexandra Birch. “Neural Machine Translation
of Rare Words with Subword Units”. In: Annual Meeting of the Association for
Computational Linguistics, 2016 (cited on pages 81, 435, 436, 480, 549).
[90] 刘挺, 吴岩, and 王开铸. 最大概率分词问题及其解”. In: 06. 哈尔滨工业
学学报, 1998, pages 37–41 (cited on page 86).
[91] 丁洁. 基于最大概率分词算法的中文分词方法研究”. In: 21. 科技信息, 2010,
pages I0075–I0075 (cited on page 86).
[92] Richard Bellman. “Dynamic programming”. In: volume 153. 3731. Science, 1966,
pages 34–37 (cited on page 86).
[93] Kevin Humphreys, Robert J. Gaizauskas, Saliha Azzam, Charles Huyck, Brian
Mitchell, Hamish Cunningham, and Yorick Wilks. University of Sheffield: Descrip-
tion of the LaSIE-II system as used for MUC-7. Annual Meeting of the Association
for Computational Linguistics, 1995 (cited on page 88).
[94] George Krupka and Kevin Hausman. “IsoQuest Inc.: Description of the NetOwl™
Extractor System as Used for MUC-7”. In: Annual Meeting of the Association for
Computational Linguistics, 1998 (cited on page 88).
[95] William J Black, Fabio Rinaldi, and David Mowatt. “FACILE: Description of the
NE System Used for MUC-7”. In: Annual Meeting of the Association for Compu-
tational Linguistics, 1998 (cited on page 88).
[96] Sean R Eddy. “Hidden Markov models.” In: volume 6. 3. Current Opinion in Struc-
tural Biology, 1996, pages 361–5 (cited on pages 88, 90).
[97] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. “Conditional
Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data”.
In: proceedings of the Eighteenth International Conference on Machine Learning,
2001, pages 282–289 (cited on pages 88, 94, 95, 106).
[98] Jagat Narain Kapur. Maximum-entropy models in science and engineering. John
Wiley & Sons, 1989 (cited on page 88).
[99] Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf.
“Support vector machines”. In: volume 13. 4. IEEE Intelligent Systems & Their
Applications, 1998, pages 18–28 (cited on page 88).
[100] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu,
and Pavel Kuksa. “Natural Language Processing (almost) from Scratch”. In: vol-
ume 12. 1. Journal of Machine Learning Research, 2011, pages 2493–2537 (cited
on pages 88, 343, 406, 552).
[101] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami,
and Chris Dyer. “Neural Architectures for Named Entity Recognition”. In: Annual
Meeting of the Association for Computational Linguistics, 2016, pages 260–270
(cited on page 89).
[102] Leonard E Baum and Ted Petrie. “Statistical Inference for Probabilistic Functions
of Finite State Markov Chains”. In: volume 37. 6. Annals of Mathematical Stats,
1966, pages 1554–1563 (cited on page 90).
[103] Leonard E Baum, Ted Petrie, George Soules, and Norman Weiss. “A maximization
technique occurring in the statistical analysis of probabilistic functions of Markov
chains”. In: volume 41. 1. Annals of Mathematical Stats, 1970, pages 164–171
(cited on pages 90, 92).
[104] Arthur P Dempster, Nan M Laird, and Donald B Rubin. “Maximum likelihood
from incomplete data via the EM algorithm”. In: volume 39. 1. Journal of the Royal
Statistical Society: Series B (Methodological), 1977, pages 1–22 (cited on page 92).
[105] Andrew Viterbi. “Error bounds for convolutional codes and an asymptotically op-
timum decoding algorithm”. In: volume 13. 2. IEEE Transactions on Information
Theory, 1967, pages 260–269 (cited on page 92).
[106] Peter Harrington. ”. In: 版社, , 2013 (cited on
page 97).
[107] Andrew Y Ng and Michael I Jordan. “On Discriminative vs. Generative Classi-
fiers: A comparison of logistic regression and naive Bayes”. In: MIT Press, 2001,
pages 841–848 (cited on page 106).
[108] Christopher D Manning, Hinrich Schütze, and Prabhakar Raghavan. Introduction
to information retrieval. Cambridge university press, 2008 (cited on page 106).
[109] Adam Berger, Stephen A Della Pietra, and Vincent J Della Pietra. “A maximum
entropy approach to natural language processing”. In: volume 22. 1. Computational
linguistics, 1996, pages 39–71 (cited on page 106).
[110] Tom Mitchell. Machine Learning. McCraw Hill, 1996 (cited on page 106).
[111] Franz Josef Och and Hermann Ney. “Discriminative Training and Maximum En-
tropy Models for Statistical Machine Translation”. In: Annual Meeting of the Asso-
ciation for Computational Linguistics, 2002, pages 295–302 (cited on pages 106,
[112] Liang Huang. “Coling 2008: Advanced Dynamic Programming in Computational
Linguistics: Theory, Algorithms and Applications-Tutorial notes”. In: International
Conference on Computational Linguistics, 2008 (cited on page 106).
[113] Mehryar Mohri, Fernando Pereira, and Michael Riley. “Speech recognition with
weighted finite-state transducers”. In: Springer, 2008, pages 559–584 (cited on
page 106).
[114] Alfred V Aho and Jeffrey D Ullman. The theory of parsing, translation, and com-
piling. Prentice-Hall Englewood Cliffs, NJ, 1973 (cited on page 106).
[115] Thorsten Brants. “TnT - A Statistical Part-of-Speech Tagger”. In: Annual Meeting
of the Association for Computational Linguistics, 2000, pages 224–231 (cited on
page 106).
[116] Yoshimasa Tsuruoka and Jun’ichi Tsujii. “Chunk Parsing Revisited”. In: Annual
Meeting of the Association for Computational Linguistics, 2005, pages 133–140
(cited on page 106).
[117] Sujian Li, Houfeng Wang, Shiwen Yu, and Chengsheng Xin. “News-Oriented Au-
tomatic Chinese Keyword Indexing”. In: Annual Meeting of the Association for
Computational Linguistics, 2003, pages 92–97 (cited on page 106).
[118] Noam Chomsky. Lectures on government and binding: The Pisa lectures. Walter
de Gruyter, 1993 (cited on page 106).
[119] Zhiheng Huang, Wei Xu, and Kai Yu. “Bidirectional LSTM-CRF Models for Se-
quence Tagging”. In: CoRR, 2015 (cited on page 106).
[120] Jason PC Chiu and Eric Nichols. “Named entity recognition with bidirectional
LSTM-CNNs”. In: volume 4. MIT Press, 2016, pages 357–370 (cited on page 106).
[121] Andrej Zukov Gregoric, Yoram Bachrach, and Sam Coope. “Named Entity Recog-
nition With Parallel Recurrent Neural Networks”. In: Annual Meeting of the Asso-
ciation for Computational Linguistics, 2018, pages 69–74 (cited on page 107).
[122] Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. “A Survey on Deep Learning
for Named Entity Recognition”. In: volume PP. 99. IEEE Transactions on Knowl-
edge and Data Engineering, 2020, pages 1–1 (cited on page 107).
[123] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-
training of deep bidirectional transformers for language understanding”. In: Annual
Meeting of the Association for Computational Linguistics, 2019, pages 4171–4186
(cited on pages 107, 127, 475, 493, 548, 552–554, 575).
[124] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. “Improving
language understanding by generative pre-training”. In: 2018 (cited on pages 107,
127, 552–554).
[125] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil-
laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer,
and Veselin Stoyanov. “Unsupervised Cross-lingual Representation Learning at
Scale”. In: Annual Meeting of the Association for Computational Linguistics, 2020,
pages 8440–8451 (cited on page 107).
[126] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-jing Zhu. “Bleu: a Method
for Automatic Evaluation of Machine Translation”. In: Annual Meeting of the As-
sociation for Computational Linguistics, 2002, pages 311–318 (cited on pages 109,
117, 118).
[127] Kenneth W Church and Eduard H Hovy. “Good applications for crummy machine
translation”. In: volume 8. 4. Springer, 1993, pages 239–258 (cited on page 110).
[128] John B. Carroll. “An experiment in evaluating the quality of translations”. In:
volume 9. 3-4. Mech. Transl. Comput. Linguistics, 1966, pages 55–66 (cited on
page 113).
[129] John S White, Theresa A OConnell, and Francis E OMara. “The ARPA MT eval-
uation methodologies: evolution, lessons, and future approaches”. In: Proceedings
of the First Conference of the Association for Machine Translation in the Americas,
1994 (cited on pages 113, 114).
[130] Keith J. Miller and Michelle Vanni. “Inter-rater Agreement Measures, and the Re-
finement of Metrics in the PLATO MT Evaluation Paradigm”. In: The tenth Ma-
chine Translation Summit, 2005, pages 125–132 (cited on page 113).
[131] Margaret King, Andrei Popescu-Belis, and Eduard Hovy. “FEMTI: creating and
using a framework for MT evaluation”. In: Proceedings of MT Summit IX, New
Orleans, LA, 2003, pages 224–231 (cited on page
[132] Mark A. Przybocki, Kay Peterson, Sebastien Bronsart, and Gregory A. Sanders.
“The NIST 2008 Metrics for machine translation challenge - overview, methodol-
ogy, metrics, and results”. In: volume 23. 2-3. Machine Translation, 2009, pages 71–
103 (cited on page 114).
[133] Florence Reeder. “Direct application of a language learner test to MT evaluation”.
In: Proceedings of AMTA, 2006 (cited on page 114).
[134] Chris Callison-Burch, Cameron S. Fordyce, Philipp Koehn, Christof Monz, and
Josh Schroeder. “(Meta-) Evaluation of Machine Translation”. In: Annual Meeting
of the Association for Computational Linguistics, 2007, pages 136–158 (cited on
page 114).
[135] Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut,
and Lucia Specia. “Findings of the 2012 Workshop on Statistical Machine Transla-
tion”. In: Annual Meeting of the Association for Computational Linguistics, 2012,
pages 10–51 (cited on page 114).
[136] Adam Lopez. “Putting Human Assessments of Machine Translation Systems in Or-
der”. In: Annual Meeting of the Association for Computational Linguistics, 2012,
pages 1–9 (cited on page 114).
[137] Philipp Koehn. “Simulating human judgment in machine translation evaluation
campaigns”. In: International Workshop on Spoken Language Translation, 2012,
pages 179–184 (cited on page 115).
[138] Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias
Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo
Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. “Findings of
the 2015 Workshop on Statistical Machine Translation”. In: Annual Meeting of the
Association for Computational Linguistics, 2015, pages 1–46 (cited on page 115).
[139] Shujian Huang and Kevin Knight. Machine Translation: 15th China Conference,
CCMT 2019, Nanchang, China, September 27–29, 2019, Revised Selected Papers.
Volume 1104. Springer Nature, 2019 (cited on page 115).
[140] Dan Jurafsky. Speech & language processing. Pearson Education India, 2000 (cited
on pages 116, 599).
[141] Christoph Tillmann, Stephan Vogel, Hermann Ney, Arkaitz Zubiaga, and Hassan
Sawaf. “Accelerated DP based search for statistical translation”. In: European Con-
ference on Speech Communication and Technology, 1997 (cited on page 116).
[142] Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul.
“A study of translation edit rate with targeted human annotation”. In: volume 200.
6. Proceedings of association for machine translation in the Americas, 2006 (cited
on page 116).
[143] Nancy Chinchor. “MUC-4 evaluation metrics”. In: Annual Meeting of the Associ-
ation for Computational Linguistics, 1992, pages 22–29 (cited on page 118).
[144] David Chiang, Steve DeNeefe, Yee Seng Chan, and Hwee Tou Ng. “Decomposabil-
ity of Translation Metrics for Improved Evaluation and Efficient Algorithms”. In:
Annual Meeting of the Association for Computational Linguistics, 2008, pages 610–
619 (cited on page 118).
[145] Matt Post. “A Call for Clarity in Reporting BLEU Scores”. In: Annual Meeting
of the Association for Computational Linguistics, 2018, pages 186–191 (cited on
page 118).
[146] Satanjeev Banerjee and Alon Lavie. “METEOR: An Automatic Metric for MT
Evaluation with Improved Correlation with Human Judgments”. In: Annual Meet-
ing of the Association for Computational Linguistics, 2005, pages 65–72 (cited on
page 119).
[147] Michael J. Denkowski and Alon Lavie. “METEOR-NEXT and the METEOR Para-
phrase Tables: Improved Evaluation Support for Five Target Languages”. In: An-
nual Meeting of the Association for Computational Linguistics, 2010, pages 339–
342 (cited on page 122).
[148] Michael J. Denkowski and Alon Lavie. “Meteor 1.3: Automatic Metric for Reliable
Optimization and Evaluation of Machine Translation Systems”. In: Annual Meet-
ing of the Association for Computational Linguistics, 2011, pages 85–91 (cited on
page 122).
[149] Michael J. Denkowski and Alon Lavie. “Meteor Universal: Language Specific
Translation Evaluation for Any Target Language”. In: Annual Meeting of the As-
sociation for Computational Linguistics, 2014, pages 376–380 (cited on page 122).
[150] Shiwen Yu. “Automatic evaluation of output quality for Machine Translation sys-
tems”. In: volume 8. 1-2. Mach. Transl., 1993, pages 117–126 (cited on page 122).
[151] Ming Zhou, Bo Wang, Shujie Liu, Mu Li, Dongdong Zhang, and Tiejun Zhao. “Di-
agnostic Evaluation of Machine Translation Systems Using Automatically Con-
structed Linguistic Check-Points”. In: International Conference on Computational
Linguistics, 2008, pages 1121–1128 (cited on page 123).
[152] Joshua Albrecht and Rebecca Hwa. “A Re-examination of Machine Learning Ap-
proaches for Sentence-Level MT Evaluation”. In: Annual Meeting of the Associa-
tion for Computational Linguistics, 2007 (cited on page 123).
[153] Joshua Albrecht and Rebecca Hwa. “Regression for Sentence-Level MT Evalua-
tion with Pseudo References”. In: Annual Meeting of the Association for Compu-
tational Linguistics, 2007 (cited on page 123).
[154] Ding Liu and Daniel Gildea. “Source-Language Features and Maximum Correla-
tion Training for Machine Translation Evaluation”. In: Annual Meeting of the As-
sociation for Computational Linguistics, 2007, pages 41–48 (cited on page 124).
[155] Jesús Giménez and Lluı
s Màrquez. “Heterogeneous Automatic MT Evaluation
Through Non-Parametric Metric Combinations”. In: Annual Meeting of the Asso-
ciation for Computational Linguistics, 2008, pages 319–326 (cited on page 124).
[156] Markus Dreyer and Daniel Marcu. “HyTER: Meaning-Equivalent Semantics for
Translation Evaluation”. In: Annual Meeting of the Association for Computational
Linguistics, 2012, pages 162–171 (cited on page 124).
[157] Ondrej Bojar, Matous Machácek, Ales Tamchyna, and Daniel Zeman. “Scratching
the Surface of Possible Translations”. In: volume 8082. Springer, 2013, pages 465–
474 (cited on page 125).
[158] Ying Qin and Lucia Specia. “Truly Exploring Multiple References for Machine
Translation Evaluation”. In: European Association for Machine Translation, 2015
(cited on page 126).
[159] Boxing Chen and Hongyu Guo. “Representation Based Translation Evaluation Met-
rics”. In: Annual Meeting of the Association for Computational Linguistics, 2015,
pages 150–155 (cited on page 126).
[160] Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and Christo-
pher D. Manning. “Semi-Supervised Recursive Autoencoders for Predicting Sen-
timent Distributions”. In: Annual Meeting of the Association for Computational
Linguistics, 2011, pages 151–161 (cited on pages 126, 127).
[161] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning,
Andrew Y. Ng, and Christopher Potts. “Recursive Deep Models for Semantic Com-
positionality Over a Sentiment Treebank”. In: Annual Meeting of the Association
for Computational Linguistics, 2013, pages 1631–1642 (cited on page 126).
[162] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient Estimation
of Word Representations in Vector Space”. In: arXiv preprint arXiv:1301.3781,
2013 (cited on pages 127, 343).
[163] Quoc Le and Tomas Mikolov. “Distributed representations of sentences and docu-
ments”. In: International conference on machine learning, 2014, pages 1188–1196
(cited on pages 127, 551).
[164] Ben Athiwaratkun and Andrew Gordon Wilson. “Multimodal Word Distributions”.
In: Annual Meeting of the Association for Computational Linguistics, 2017, pages 1645–
1656 (cited on page 127).
[165] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
Kenton Lee, and Luke Zettlemoyer. “Deep Contextualized Word Representations”.
In: Annual Conference of the North American Chapter of the Association for Com-
putational Linguistics, 2018, pages 2227–2237 (cited on pages 127, 343, 552, 553,
[166] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. “Glove: Global
Vectors for Word Representation”. In: Annual Meeting of the Association for Com-
putational Linguistics, 2014, pages 1532–1543 (cited on pages 127, 340, 343).
[167] Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. “Skip-thought vectors”. In: Advances in neural
information processing systems, 2015, pages 3294–3302 (cited on page 127).
[168] Junki Matsuo, Mamoru Komachi, and Katsuhito Sudoh. “Word-Alignment-Based
Segment-Level Machine Translation Evaluation using Word Embeddings”. In: vol-
ume abs/1704.00380. CoRR, 2017 (cited on page 127).
[169] Francisco Guzmán, Shafiq Joty, Lluı
s Màrquez, and Preslav Nakov. “Machine
translation evaluation with neural networks”. In: volume 45. Computer Speech &
Language, 2017, pages 180–200 (cited on page 128).
[170] Karl Pearson. “Notes on the history of correlation”. In: volume 13. 1. JSTOR, 1920,
pages 25–45 (cited on page 128).
[171] Deborah Coughlin. “Correlating automated and human assessments of machine
translation quality”. In: 2003 (cited on pages 128, 129).
[172] Andrei Popescu-Belis. “An experiment in comparative evaluation: humans vs. com-
puters”. In: Proceedings of the Ninth Machine Translation Summit. New Orleans,
2003 (cited on page 128).
[173] Christopher Culy and Susanne Z Riehemann. “The limits of N-gram translation
evaluation metrics”. In: MT Summit IX, 2003, pages 71–78 (cited on page 129).
[174] Andrew Finch, Yasuhiro Akiba, and Eiichiro Sumita. “Using a paraphraser to im-
prove machine translation evaluation”. In: International Joint Conference on Natu-
ral Language Processing, 2004 (cited on page 129).
[175] Olivier Hamon and Djamel Mostefa. “The Impact of Reference Quality on Auto-
matic MT Evaluation”. In: International conference on machine learning, 2008,
pages 39–42 (cited on page 129).
[176] George Doddington. “Automatic evaluation of machine translation quality using
n-gram co-occurrence statistics”. In: Proceedings of the second international con-
ference on Human Language Technology Research, 2002, pages 138–145 (cited
on page 129).
[177] Chris Callison-Burch, Miles Osborne, and Philipp Koehn. “Re-evaluation the role
of bleu in machine translation research”. In: 11th Conference of the European Chap-
ter of the Association for Computational Linguistics, 2006 (cited on page 129).
[178] Hirotugu Akaike. “A new look at the statistical model identification”. In: volume 19.
6. IEEE, 1974, pages 716–723 (cited on page 129).
[179] Bradley Efron and Robert Tibshirani. An Introduction to the Bootstrap. Springer,
1993 (cited on page 130).
[180] Philipp Koehn. “Statistical Significance Tests for Machine Translation Evaluation”.
In: ACL, 2004, pages 388–395 (cited on page 130).
[181] Eric W Noreen. Computer-intensive methods for testing hypotheses. Wiley New
York, 1989 (cited on page 130).
[182] Stefan Riezler and John T. Maxwell III. “On Some Pitfalls in Automatic Evalua-
tion and Significance Testing for MT”. In: Annual Meeting of the Association for
Computational Linguistics, 2005, pages 57–64 (cited on page 130).
[183] Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein. “An Empirical Investiga-
tion of Statistical Significance in NLP”. In: Annual Meeting of the Association for
Computational Linguistics, 2012, pages 995–1005 (cited on pages 130, 131).
[184] Michael Gamon, Anthony Aue, and Martine Smets. “Sentence-level MT evalua-
tion without reference translations: Beyond language modeling”. In: Proceedings
of EAMT, 2005, pages 103–111 (cited on page 135).
[185] Christopher Quirk. “Training a Sentence-Level Machine Translation Confidence
Measure”. In: European Language Resources Association, 2004 (cited on page 135).
[186] Douglas A. Jones, Edward Gibson, Wade Shen, Neil Granoien, Martha Herzog,
Douglas A. Reynolds, and Clifford J. Weinstein. “Measuring human readability
of machine generated text: three case studies in speech recognition and machine
translation”. In: IEEE, 2005, pages 1009–1012 (cited on page
[187] Carolina Scarton, Marcos Zampieri, Mihaela Vela, Josef van Genabith, and Lucia
Specia. “Searching for Context: a Study on Document-Level Labels for Transla-
tion Quality Estimation”. In: European Association for Machine Translation, 2015
(cited on page 136).
[188] Pablo Fetter, Frédéric Dandurand, and Peter Regel-Brietzmann. “Word graph rescor-
ing using confidence measures”. In: volume 1. Proceeding of Fourth International
Conference on Spoken Language Processing, 1996, pages 10–13 (cited on page 137).
[189] Ergun Biçici. “Referential Translation Machines for Quality Estimation”. In: An-
nual Meeting of the Association for Computational Linguistics, 2013, pages 343–
351 (cited on pages 137, 138).
[190] José Guilherme Camargo de Souza, Christian Buck, Marco Turchi, and Matteo Ne-
gri. “FBK-UEdin Participation to the WMT13 Quality Estimation Shared Task”. In:
Annual Meeting of the Association for Computational Linguistics, 2013, pages 352–
358 (cited on page 137).
[191] Ergun Biçici and Andy Way. “Referential Translation Machines for Predicting
Translation Quality”. In: Annual Meeting of the Association for Computational
Linguistics, 2014, pages 313–321 (cited on pages 137, 141).
[192] José Guilherme Camargo de Souza, Jesús González-Rubio, Christian Buck, Marco
Turchi, and Matteo Negri. “FBK-UPV-UEdin participation in the WMT14 Quality
Estimation shared-task”. In: Annual Meeting of the Association for Computational
Linguistics, 2014, pages 322–328 (cited on pages 137, 138).
[193] Miquel Esplà-Gomis, Felipe Sánchez-Martı
nez, and Mikel L. Forcada. “UAlacant
word-level machine translation quality estimation system at WMT 2015”. In: An-
nual Meeting of the Association for Computational Linguistics, 2015, pages 309–
315 (cited on page 137).
[194] Julia Kreutzer, Shigehiko Schamoni, and Stefan Riezler. “QUality Estimation from
ScraTCH (QUETCH): Deep Learning for Word-level Translation Quality Estima-
tion”. In: Annual Meeting of the Association for Computational Linguistics, 2015,
pages 316–322 (cited on page 138).
[195] André F. T. Martins, Ramón Fernández Astudillo, Chris Hokamp, and Fabio Ke-
pler. “Unbabel’s Participation in the WMT16 Word-Level Translation Quality Es-
timation Shared Task”. In: Annual Meeting of the Association for Computational
Linguistics, 2016, pages 806–811 (cited on page 138).
[196] Zhiming Chen, Yiming Tan, Chenlin Zhang, Qingyu Xiang, Lilin Zhang, Maoxi
Li, and Mingwen Wang. “Improving Machine Translation Quality Estimation with
Neural Network Features”. In: Annual Meeting of the Association for Computa-
tional Linguistics, 2017, pages 551–555 (cited on page 138).
[197] Julia Kreutzer, Shigehiko Schamoni, and Stefan Riezler. “Quality estimation from
scratch (quetch): Deep learning for word-level translation quality estimation”. In:
Proceedings of the Tenth Workshop on Statistical Machine Translation, 2015, pages 316–
322 (cited on page 138).
[198] Kashif Shah, Varvara Logacheva, Gustavo Paetzold, Frédéric Blain, Daniel Beck,
Fethi Bougares, and Lucia Specia. “SHEF-NN: Translation Quality Estimation
with Neural Networks”. In: Annual Meeting of the Association for Computational
Linguistics, 2015, pages 342–347 (cited on page 138).
[199] Carolina Scarton, Daniel Beck, Kashif Shah, Karin Sim Smith, and Lucia Specia.
“Word embeddings and discourse information for Quality Estimation”. In: Annual
Meeting of the Association for Computational Linguistics, 2016, pages 831–837
(cited on page 138).
[200] Amal Abdelsalam, Ondrej Bojar, and Samhaa El-Beltagy. “Bilingual Embeddings
and Word Alignments for Translation Quality Estimation”. In: Annual Meeting
of the Association for Computational Linguistics, 2016, pages 764–771 (cited on
page 138).
[201] Prasenjit Basu, Santanu Pal, and Sudip Kumar Naskar. “Keep It or Not: Word Level
Quality Estimation for Post-Editing”. In: Annual Meeting of the Association for
Computational Linguistics, 2018, pages 759–764 (cited on page 138).
[202] Hou Qi. “NJU Submissions for the WMT19 Quality Estimation Shared Task”. In:
Annual Meeting of the Association for Computational Linguistics, 2019, pages 95–
100 (cited on page 138).
[203] Junpei Zhou, Zhisong Zhang, and Zecong Hu. “SOURCE: SOURce-Conditional
Elmo-style Model for Machine Translation Quality Estimation”. In: Annual Meet-
ing of the Association for Computational Linguistics, 2019, pages 106–111 (cited
on page 138).
[204] Chris Hokamp. “Ensembling Factored Neural Machine Translation Models for Au-
tomatic Post-Editing and Quality Estimation”. In: Annual Meeting of the Associa-
tion for Computational Linguistics, 2017, pages 647–654 (cited on page 138).
[205] Ziyang Wang, Hui Liu, Hexuan Chen, Kai Feng, Zeyang Wang, Bei Li, Chen Xu,
Tong Xiao, and Jingbo Zhu. “NiuTrans Submission for CCMT19 Quality Estima-
tion Task”. In: Springer, 2019, pages 82–92 (cited on page
[206] Fábio Kepler, Jonay Trénous, Marcos Treviso, Miguel Vera, António Góis, M Amin
Farajian, António V Lopes, and André FT Martins. “Unbabel’s Participation in the
WMT19 Translation Quality Estimation Shared Task”. In: 2019, pages 78–84 (cited
on page 138).
[207] Elizaveta Yankovskaya, Andre Tättar, and Mark Fishel. “Quality Estimation and
Translation Metrics via Pre-trained Word and Sentence Embeddings”. In: Annual
Meeting of the Association for Computational Linguistics, 2019, pages 101–105
(cited on page 138).
[208] Hyun Kim, Joon-Ho Lim, Hyun-Ki Kim, and Seung-Hoon Na. “QE BERT: Bilin-
gual BERT Using Multi-task Learning for Neural Quality Estimation”. In: An-
nual Meeting of the Association for Computational Linguistics, 2019, pages 85–89
(cited on page 138).
[209] Silja Hildebrand and Stephan Vogel. “MT Quality Estimation: The CMU System
for WMT’13”. In: Annual Meeting of the Association for Computational Linguis-
tics, 2013, pages 373–379 (cited on page 138).
[210] André FT Martins, Ramón Astudillo, Chris Hokamp, and Fabio Kepler. “Unbabel
s participation in the wmt16 word-level translation quality estimation shared task”.
In: Proceedings of the First Conference on Machine Translation, 2016, pages 806–
811 (cited on page 138).
[211] Ding Liu and Daniel Gildea. “Syntactic Features for Evaluation of Machine Trans-
lation”. In: Annual Meeting of the Association for Computational Linguistics, 2005,
pages 25–32 (cited on page 140).
[212] Jesús Giménez and Lluı
s Màrquez. “Linguistic Features for Automatic Evaluation
of Heterogenous MT Systems”. In: Annual Meeting of the Association for Compu-
tational Linguistics, 2007, pages 256–264 (cited on page 140).
[213] Sebastian Padó, Daniel M. Cer, Michel Galley, Dan Jurafsky, and Christopher D.
Manning. “Measuring machine translation quality as semantic equivalence: A met-
ric based on entailment features”. In: volume 23. 2-3. Machine Translation, 2009,
pages 181–193 (cited on page 140).
[214] Karolina Owczarzak, Josef van Genabith, and Andy Way. “Dependency-Based Au-
tomatic Evaluation for Machine Translation”. In: Annual Meeting of the Associa-
tion for Computational Linguistics, 2007, pages 80–87 (cited on page 140).
[215] Karolina Owczarzak, Josef van Genabith, and Andy Way. “Labelled Dependencies
in Machine Translation Evaluation”. In: Annual Meeting of the Association for
Computational Linguistics, 2007, pages 104–111 (cited on page
[216] Hui Yu, Xiaofeng Wu, Jun Xie, Wenbin Jiang, Qun Liu, and Shouxun Lin. “RED:
A Reference Dependency Based MT Evaluation Metric”. In: Annual Meeting of
the Association for Computational Linguistics, 2014, pages 2042–2051 (cited on
page 140).
[217] Rafael E. Banchs and Haizhou Li. “AM-FM: A Semantic Framework for Trans-
lation Quality Assessment”. In: Annual Meeting of the Association for Computa-
tional Linguistics, 2011, pages 153–158 (cited on page 140).
[218] Florence Reeder. “Measuring MT adequacy using latent semantic analysis”. In:
Proceedings of the 7th Conference of the Association for Machine Translation of
the Americas. Cambridge, Massachusetts, 2006, pages 176–184 (cited on page 140).
[219] Chi-kiu Lo, Meriem Beloucif, Markus Saers, and Dekai Wu. “XMEANT: Bet-
ter semantic MT evaluation without reference translations”. In: Annual Meeting
of the Association for Computational Linguistics, 2014, pages 765–771 (cited on
page 140).
[220] David Vilar, Jia Xu, Luis Fernando D’Haro, and Hermann Ney. “Error Analysis of
Statistical Machine Translation Output”. In: European Language Resources Asso-
ciation (ELRA), 2006, pages 697–702 (cited on page 140).
[221] Maja Popovic, Aljoscha Burchardt, et al. “From human to automatic error clas-
sification for machine translation output”. In: European Association for Machine
Translation, 2011 (cited on page 140).
[222] Ângela Costa, Wang Ling, Tiago Luı
s, Rui Correia, and Luı
sa Coheur. “A linguisti-
cally motivated taxonomy for Machine Translation error analysis”. In: volume 29.
2. Machine Translation, 2015, pages 127–161 (cited on page 140).
[223] Arle Lommel, Aljoscha Burchardt, Maja Popovic, Kim Harris, Eleftherios Avramidis,
and Hans Uszkoreit. “Using a new analytic measure for the annotation and analy-
sis of MT errors on real data”. In: European Association for Machine Translation,
2014, pages 165–172 (cited on page 140).
[224] Maja Popovic, Adrià de Gispert, Deepa Gupta, Patrik Lambert, Hermann Ney, José
B. Mariño, Marcello Federico, and Rafael E. Banchs. “Morpho-syntactic Informa-
tion for Automatic Error Analysis of Statistical Machine Translation Output”. In:
Annual Meeting of the Association for Computational Linguistics, 2006, pages 1–
6 (cited on page 140).
[225] Maja Popovic and Hermann Ney. “Word Error Rates: Decomposition over POS
classes and Applications for Error Analysis”. In: Annual Meeting of the Associa-
tion for Computational Linguistics, 2007, pages 48–55 (cited on page 140).
[226] Meritxell González, Laura Mascarell, and Lluı
s Màrquez. “tSEARCH: Flexible
and Fast Search over Automatic Translations for Improved Quality/Error Analy-
sis”. In: Annual Meeting of the Association for Computational Linguistics, 2013,
pages 181–186 (cited on page 140).
[227] Alex Kulesza and Stuart Shieber. “A learning approach to improving sentence-
level MT evaluation”. In: Proceedings of the 10th International Conference on
Theoretical and Methodological Issues in Machine Translation, 2004 (cited on
page 141).
[228] Simon Corston-Oliver, Michael Gamon, and Chris Brockett. “A machine learning
approach to the automatic evaluation of machine translation”. In: Annual Meeting
of the Association for Computational Linguistics, 2001, pages 148–155 (cited on
page 141).
[229] Joshua S Albrecht and Rebecca Hwa. “Regression for machine translation evalu-
ation at the sentence level”. In: volume 22. 1-2. Springer, 2008, page 1 (cited on
page 141).
[230] Kevin Duh. “Ranking vs. regression in machine translation evaluation”. In: Pro-
ceedings of the Third Workshop on Statistical Machine Translation, 2008, pages 191–
194 (cited on page 141).
[231] Boxing Chen, Hongyu Guo, and Roland Kuhn. “Multi-level evaluation for ma-
chine translation”. In: Proceedings of the Tenth Workshop on Statistical Machine
Translation, 2015, pages 361–365 (cited on page 141).
[232] Franz Josef Och. “Minimum Error Rate Training in Statistical Machine Transla-
tion”. In: Annual Meeting of the Association for Computational Linguistics, 2003,
pages 160–167 (cited on pages 141, 219, 571).
[233] Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang
Liu. “Minimum Risk Training for Neural Machine Translation”. In: Annual Meet-
ing of the Association for Computational Linguistics, 2016 (cited on pages 141,
377, 450, 453, 478, 479).
[234] Xiaodong He and Li Deng. “Maximum expected bleu training of phrase and lexi-
con translation models”. In: Annual Meeting of the Association for Computational
Linguistics, 2012, pages 292–301 (cited on page 141).
[235] Markus Freitag, Isaac Caswell, and Scott Roy. “APE at Scale and Its Implications
on MT Evaluation Biases”. In: Annual Meeting of the Association for Computa-
tional Linguistics, 2019, pages 34–44 (cited on page 141).
[236] Ergun Biçici, Declan Groves, and Josef van Genabith. “Predicting sentence trans-
lation quality using extrinsic and language independent features”. In: volume 27.
3-4. Machine Translation, 2013, pages 171–192 (cited on page 141).
[237] Ergun Biçici, Qun Liu, and Andy Way. “Referential Translation Machines for Pre-
dicting Translation Quality and Related Statistics”. In: Annual Meeting of the As-
sociation for Computational Linguistics, 2015, pages 304–308 (cited on page 141).
[238] Kevin Knight. “Decoding Complexity in Word-Replacement Translation Models”.
In: volume 25. 4. Computational Linguistics, 1999, pages 607–615 (cited on pages 158,
179, 223).
[239] Claude Elwood Shannon. “Communication theory of secrecy systems”. In: vol-
ume 28. 4. Bell system technical journal, 1949, pages 656–715 (cited on page 161).
[240] Franz Josef Och and Hermann Ney. “A Systematic Comparison of Various Sta-
tistical Alignment Models”. In: volume 29. 1. Computational Linguistics, 2003,
pages 19–51 (cited on pages 164, 178, 186, 195, 632).
[241] Robert C. Moore. “Improving IBM Word Alignment Model 1”. In: Annual Meeting
of the Association for Computational Linguistics, 2004, pages 518–525 (cited on
page 178).
[242] 肖桐, 李天宁, 陈如山, 朱靖波, and 王会珍. 面向统计机器翻译的重对齐方法
研究”. In: volume 24. 110–116. 中文信息学报, 2010 (cited on page 178).
[243] Hua Wu and Haifeng Wang. “Improving Statistical Word Alignment with Ensem-
ble Methods”. In: volume 3651. International Joint Conference on Natural Lan-
guage Processing, 2005, pages 462–473 (cited on page 178).
[244] Ye-Yi Wang and Wayne Ward. “Grammar Inference and Statistical Machine Trans-
lation”. In: Carnegie Mellon University, 1999 (cited on page 178).
[245] Ido Dagan, Kenneth Ward Church, and Willian Gale. “Robust Bilingual Word
Alignment for Machine Aided Translation”. In: Very Large Corpora, 1993 (cited
on page 178).
[246] Abraham Ittycheriah and Salim Roukos. “A Maximum Entropy Word Aligner for
Arabic-English Machine Translation”. In: Annual Meeting of the Association for
Computational Linguistics, 2005 (cited on page
[247] William A. Gale and Kenneth Ward Church. “Identifying Word Correspondences
in Parallel Texts”. In: Morgan Kaufmann, 1991 (cited on page 178).
[248] Tong Xiao and Jingbo Zhu. “Unsupervised sub-tree alignment for tree-to-tree trans-
lation”. In: volume 48. Journal of Artificial Intelligence Research, 2013, pages 733–
782 (cited on pages 178, 268, 269).
[249] Percy Liang, Benjamin Taskar, and Dan Klein. “Alignment by Agreement”. In:
Annual Meeting of the Association for Computational Linguistics, 2006 (cited on
page 178).
[250] Chris Dyer, Victor Chahuneau, and Noah A. Smith. “A Simple, Fast, and Effective
Reparameterization of IBM Model 2”. In: Annual Meeting of the Association for
Computational Linguistics, 2013, pages 644–648 (cited on pages 178, 212, 632).
[251] Benjamin Taskar, Simon Lacoste-Julien, and Dan Klein. “A Discriminative Match-
ing Approach to Word Alignment”. In: Annual Meeting of the Association for Com-
putational Linguistics, 2005, pages 73–80 (cited on pages 178, 212).
[252] Alexander Fraser and Daniel Marcu. “Measuring Word Alignment Quality for Sta-
tistical Machine Translation”. In: volume 33. 3. Computational Linguistics, 2007,
pages 293–303 (cited on page 178).
[253] John DeNero and Dan Klein. “Tailoring Word Alignments to Syntactic Machine
Translation”. In: Annual Meeting of the Association for Computational Linguistics,
2007 (cited on page 178).
[254] Paul C DavisZhuli Xie and Kevin Small. “All Links are not the Same: Evaluating
Word Alignments for Statistical Machine Translation”. In: Machine Translation
Summit XI, 2007 (cited on page 178).
[255] , , , , and . 词对
”. In: volume 23. 88-94. 中文信息学报, 2009 (cited on page 178).
[256] Shi Feng, Shujie Liu, Mu Li, and Ming Zhou. “Implicit Distortion and Fertility
Models for Attention-based Encoder-Decoder NMT Model”. In: volume abs/1601.03317.
CoRR, 2016 (cited on page 179).
[257] Raghavendra Udupa, Tanveer A. Faruquie, and Hemanta Kumar Maji. “An Algo-
rithmic Framework for Solving the Decoding Problem in Statistical Machine Trans-
lation”. In: International Conference on Computational Linguistics, 2004 (cited on
page 179).
[258] Sebastian Riedel and James Clarke. “Revisiting Optimal Decoding for Machine
Translation IBM Model 4”. In: Annual Meeting of the Association for Computa-
tional Linguistics, 2009 (cited on page 179).
[259] Raghavendra Udupa and Hemanta Kumar Maji. “Computational Complexity of
Statistical Machine Translation”. In: Annual Meeting of the Association for Com-
putational Linguistics, 2006 (cited on page 179).
[260] Gregor Leusch, Evgeny Matusov, and Hermann Ney. “Complexity of Finding the
BLEU-optimal Hypothesis in a Confusion Network”. In: Annual Meeting of the As-
sociation for Computational Linguistics, 2008, pages 839–847 (cited on page 179).
[261] Noah Fleming, Antonina Kolokolova, and Renesa Nizamee. “Complexity of align-
ment and decoding problems: restrictions and approximations”. In: volume 29. 3-4.
Machine Translation, 2015, pages 163–187 (cited on page 179).
[262] Stephan Vogel, Hermann Ney, and Christoph Tillmann. “HMM-Based Word Align-
ment in Statistical Translation”. In: International Conference on Computational
Linguistics, 1996, pages 836–841 (cited on pages 181, 185).
[263] Brown D.C. “Decentering Distortion of Lenses”. In: volume 32. Photogrammetric
Engineering, 1966, pages 444–462 (cited on page 198).
[264] David Claus and Andrew W. Fitzgibbon. “A Rational Function Lens Distortion
Model for General Cameras”. In: IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, 2005, pages 213–219 (cited on page 198).
[265] Jerneja Žganec Gros. “MSD Recombination Method in Statistical Machine Transla-
tion”. In: volume 1060. American Institute of Physics, 2008, pages 186–189 (cited
on page 198).
[266] Deyi Xiong, Qun Liu, and Shouxun Lin. “Maximum Entropy Based Phrase Re-
ordering Model for Statistical Machine Translation”. In: Annual Meeting of the
Association for Computational Linguistics, 2006 (cited on pages 198, 216, 228).
[267] Franz Josef Och and Hermann Ney. “The Alignment Template Approach to Sta-
tistical Machine Translation”. In: volume 30. 4. Computational Linguistics, 2004,
pages 417–449 (cited on pages 198, 216, 228).
[268] Shankar Kumar and William J. Byrne. “Local Phrase Reordering Models for Sta-
tistical Machine Translation”. In: Annual Meeting of the Association for Compu-
tational Linguistics, 2005, pages 161–168 (cited on pages 198, 216, 228).
[269] Peng Li, Yang Liu, Maosong Sun, Tatsuya Izuha, and Dakun Zhang. “A Neural
Reordering Model for Phrase-based Translation”. In: Annual Meeting of the Asso-
ciation for Computational Linguistics, 2014, pages 1897–1907 (cited on pages
[270] David Chiang, Adam Lopez, Nitin Madnani, Christof Monz, Philip Resnik, and
Michael Subotin. “The Hiero Machine Translation System: Extensions, Evaluation,
and Analysis”. In: Annual Meeting of the Association for Computational Linguis-
tics, 2005, pages 779–786 (cited on page 198).
[271] Jiatao Gu, James Bradbury, Caiming Xiong, Victor O. K. Li, and Richard Socher.
“Non-Autoregressive Neural Machine Translation”. In: International Conference
on Learning Representations, 2018 (cited on pages 198, 382, 476, 486, 488–490).
[272] Andrew J. Viterbi. “Error bounds for convolutional codes and an asymptotically
optimum decoding algorithm”. In: volume 13. 2. IEEE Transactions on Information
Theory, 1967, pages 260–269 (cited on page 207).
[273] Philipp Koehn and Kevin Knight. “Estimating Word Translation Probabilities from
Unrelated Monolingual Corpora Using the EM Algorithm”. In: AAAI Press, 2000,
pages 711–715 (cited on page 212).
[274] Franz Josef Och and Hermann Ney. “A Comparison of Alignment Models for
Statistical Machine Translation”. In: Morgan Kaufmann, 2000, pages 1086–1090
(cited on page 212).
[275] Kevin Knight. “Learning a translation lexicon from monolingual corpora”. In: An-
nual Meeting of the Association for Computational Linguistics, 2002, pages 9–16
(cited on page 213).
[276] M. J. D. Powell. “An efficient method for finding the minimum of a function of
several variables without calculating derivatives”. In: volume 7. 2. The Computer
Journal, 1964, pages 155–162 (cited on page 220).
[277] David Chiang, Yuval Marton, and Philip Resnik. “Online Large-Margin Training
of Syntactic and Structural Translation Features”. In: Annual Meeting of the Asso-
ciation for Computational Linguistics, 2008, pages 224–233 (cited on pages 222,
[278] Mark Hopkins and Jonathan May. “Tuning as Ranking”. In: Annual Meeting of
the Association for Computational Linguistics, 2011, pages 1352–1362 (cited on
pages 222, 229).
[279] Franz Josef Och and Hans Weber. “Improving Statistical Natural Language Trans-
lation with Categories and Rules”. In: Annual Meeting of the Association for Com-
putational Linguistics, 1998, pages 985–989 (cited on page
[280] Franz Josef Och. “Statistical machine translation: from single word models to align-
ment templates”. PhD thesis. 2002 (cited on page 228).
[281] Ye-Yi Wang and Alex Waibel. “Modeling with Structures in Statistical Machine
Translation”. In: Annual Meeting of the Association for Computational Linguistics,
1998, pages 1357–1363 (cited on pages 228, 278).
[282] Taro Watanabe, Eiichiro Sumita, and Hiroshi G. Okuno. “Chunk-Based Statistical
Translation”. In: Annual Meeting of the Association for Computational Linguistics,
2003, pages 303–310 (cited on page 228).
[283] Daniel Marcu. “Towards a Unified Approach to Memory- and Statistical-Based
Machine Translation”. In: Morgan Kaufmann Publishers, 2001, pages 378–385
(cited on page 228).
[284] Philipp Koehn, Franz Josef Och, and Daniel Marcu. “Statistical Phrase-Based Trans-
lation”. In: Annual Meeting of the Association for Computational Linguistics, 2003
(cited on pages 228, 229).
[285] Richard Zens, Franz Josef Och, and Hermann Ney. “Phrase-Based Statistical Ma-
chine Translation”. In: Annual Conference on Artificial Intelligence, 2002, pages 18–
32 (cited on pages 228, 570).
[286] Richard Zens and Hermann Ney. “Improvements in Phrase-Based Statistical Ma-
chine Translation”. In: Annual Meeting of the Association for Computational Lin-
guistics, 2004, pages 257–264 (cited on page 228).
[287] Daniel Marcu and Daniel Wong. “A Phrase-Based, Joint Probability Model for
Statistical Machine Translation”. In: Conference on Empirical Methods in Natural
Language Processing, 2002, pages 133–139 (cited on page 228).
[288] John DeNero, Dan Gillick, James Zhang, and Dan Klein. “Why Generative Phrase
Models Underperform Surface Heuristics”. In: Annual Meeting of the Association
for Computational Linguistics, 2006, pages 31–38 (cited on page 228).
[289] German Sanchis-Trilles, Daniel Ortiz-Martinez, Jesus Gonzalez-Rubio, Jorge Gon-
zalez, and Francisco Casacuberta. “Bilingual segmentation for phrasetable pruning
in Statistical Machine Translation”. In: Conference of the European Association for
Machine Translation, 2011, pages 257–264 (cited on page 228).
[290] Graeme W. Blackwood, Adrià de Gispert, and William Byrne. “Phrasal Segmenta-
tion Models for Statistical Machine Translation”. In: International Conference on
Computational Linguistics, 2008, pages 19–22 (cited on page 228).
[291] Deyi Xiong, Min Zhang, and Haizhou Li. “Learning Translation Boundaries for
Phrase-Based Decoding”. In: Annual Meeting of the Association for Computa-
tional Linguistics, 2010, pages 136–144 (cited on page 228).
[292] Christoph Tillman. “A Unigram Orientation Model for Statistical Machine Transla-
tion”. In: Annual Meeting of the Association for Computational Linguistics, 2004
(cited on page 228).
[293] Masaaki Nagata, Kuniko Saito, Kazuhide Yamamoto, and Kazuteru Ohashi. “A
Clustered Global Phrase Reordering Model for Statistical Machine Translation”.
In: Annual Meeting of the Association for Computational Linguistics, 2006 (cited
on page 228).
[294] Richard Zens and Hermann Ney. “Discriminative Reordering Models for Statistical
Machine Translation”. In: Annual Meeting of the Association for Computational
Linguistics, 2006, pages 55–63 (cited on page 228).
[295] Spence Green, Michel Galley, and Christopher D. Manning. “Improved Models of
Distortion Cost for Statistical Machine Translation”. In: Annual Meeting of the As-
sociation for Computational Linguistics, 2010, pages 867–875 (cited on page 228).
[296] Colin Cherry. “Improved Reordering for Phrase-Based Translation using Sparse
Features”. In: Annual Meeting of the Association for Computational Linguistics,
2013, pages 22–31 (cited on page 228).
[297] Matthias Huck, Joern Wuebker, Felix Rietig, and Hermann Ney. “A Phrase Orien-
tation Model for Hierarchical Machine Translation”. In: Annual Meeting of the As-
sociation for Computational Linguistics, 2013, pages 452–463 (cited on page 228).
[298] Matthias Huck, Stephan Peitz, Markus Freitag, and Hermann Ney. “Discriminative
Reordering Extensions for Hierarchical Phrase-Based Machine Translation”. In:
International Conference on Material Engineering and Advanced Manufacturing
Technology, 2012 (cited on page 228).
[299] Vinh Van Nguyen, Akira Shimazu, Minh Le Nguyen, and Thai Phuong Nguyen.
“Improving a Lexicalized Hierarchical Reordering Model Using Maximum En-
tropy”. In: Machine Translation Summit XII, 2009 (cited on page 228).
[300] Arianna Bisazza and Marcello Federico. “A Survey of Word Reordering in Statis-
tical Machine Translation: Computational Models and Language Phenomena”. In:
volume 42. 2. Computational Linguistics, 2016, pages 163–205 (cited on page
[301] Fei Xia and Michael C. McCord. “Improving a Statistical MT System with Au-
tomatically Learned Rewrite Patterns”. In: International Conference on Computa-
tional Linguistics, 2004 (cited on page 228).
[302] Michael Collins, Philipp Koehn, and Ivona Kucerova. “Clause Restructuring for
Statistical Machine Translation”. In: Annual Meeting of the Association for Com-
putational Linguistics, 2005, pages 531–540 (cited on page 228).
[303] Chao Wang, Michael Collins, and Philipp Koehn. “Chinese Syntactic Reordering
for Statistical Machine Translation”. In: Annual Meeting of the Association for
Computational Linguistics, 2007, pages 737–745 (cited on pages 228, 556).
[304] Xianchao Wu, Katsuhito Sudoh, Kevin Duh, Hajime Tsukada, and Masaaki Na-
gata. “Extracting Pre-ordering Rules from Predicate-Argument Structures”. In: An-
nual Meeting of the Association for Computational Linguistics, 2011, pages 29–37
(cited on page 228).
[305] Christoph Tillmann and Hermann Ney. “Word Re-ordering and DP-based Search
in Statistical Machine Translation”. In: Morgan Kaufmann, 2000, pages 850–856
(cited on page 229).
[306] Wade Shen, Brian Delaney, and Timothy R. Anderson. “An efficient graph search
decoder for phrase-based statistical machine translation”. In: International Sympo-
sium on Computer Architecture, 2006, pages 197–204 (cited on page 229).
[307] Robert C. Moore and Chris Quirk. “Faster Beam-Search Decoding for Phrasal Sta-
tistical Machine Translation”. In: Machine Translation Summit XI, 2007 (cited on
page 229).
[308] Kenneth Heafield, Michael Kayser, and Christopher D. Manning. “Faster Phrase-
Based Decoding by Refining Feature State”. In: Annual Meeting of the Association
for Computational Linguistics, 2014, pages 130–135 (cited on page 229).
[309] Joern Wuebker, Hermann Ney, and Richard Zens. “Fast and Scalable Decoding
with Language Model Look-Ahead for Phrase-based Statistical Machine Transla-
tion”. In: Annual Meeting of the Association for Computational Linguistics, 2012,
pages 28–32 (cited on page 229).
[310] Richard Zens and Hermann Ney. “Improvements in dynamic programming beam
search for phrase-based statistical machine translation”. In: International Sympo-
sium on Computer Architecture, 2008, pages 198–205 (cited on page 229).
[311] Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, Anoop Sarkar, Kenji Yamada,
Alexander M. Fraser, Shankar Kumar, Libin Shen, David Smith, Katherine Eng,
Viren Jain, Zhen Jin, and Dragomir R. Radev. “A Smorgasbord of Features for
Statistical Machine Translation”. In: Annual Meeting of the Association for Com-
putational Linguistics, 2004, pages 161–168 (cited on pages 229, 279).
[312] David Chiang, Kevin Knight, and Wei Wang. “11,001 New Features for Statistical
Machine Translation”. In: Annual Meeting of the Association for Computational
Linguistics, 2009, pages 218–226 (cited on page 229).
[313] Daniel Gildea. “Loosely Tree-Based Alignment for Machine Translation”. In: An-
nual Meeting of the Association for Computational Linguistics, 2003, pages 80–87
(cited on page 229).
[314] Phil Blunsom, Trevor Cohn, and Miles Osborne. “A Discriminative Latent Variable
Model for Statistical Machine Translation”. In: Annual Meeting of the Association
for Computational Linguistics, 2008, pages 200–208 (cited on page 229).
[315] Phil Blunsom, Trevor Cohn, Chris Dyer, and Miles Osborne. “A Gibbs Sampler for
Phrasal Synchronous Grammar Induction”. In: Annual Meeting of the Association
for Computational Linguistics, 2009, pages 782–790 (cited on page 229).
[316] Trevor Cohn and Phil Blunsom. “A Bayesian Model of Syntax-Directed Tree to
String Grammar Induction”. In: Annual Meeting of the Association for Computa-
tional Linguistics, 2009, pages 352–361 (cited on page 229).
[317] David A. Smith and Jason Eisner. “Minimum Risk Annealing for Training Log-
Linear Models”. In: Annual Meeting of the Association for Computational Lin-
guistics, 2006 (cited on page 229).
[318] Zhifei Li and Jason Eisner. “First- and Second-Order Expectation Semirings with
Applications to Minimum-Risk Training on Translation Forests”. In: Annual Meet-
ing of the Association for Computational Linguistics, 2009, pages 40–51 (cited on
page 229).
[319] Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki. “Online Large-
Margin Training for Statistical Machine Translation”. In: Annual Meeting of the As-
sociation for Computational Linguistics, 2007, pages 764–773 (cited on page 229).
[320] Markus Dreyer and Yuanzhe Dong. “APRO: All-Pairs Ranking Optimization for
MT Tuning”. In: Annual Meeting of the Association for Computational Linguistics,
2015, pages 1018–1023 (cited on page 229).
[321] Tong Xiao, Derek F. Wong, and Jingbo Zhu. “A Loss-Augmented Approach to
Training Syntactic Machine Translation Systems”. In: volume 24. 11. IEEE Trans-
actions on Audio, Speech, and Language Processing, 2016, pages 2069–2083 (cited
on page 229).
[322] Harold Charles Daume Iii. Practical structured learning techniques for natural
language processing. University of Southern California, 2006 (cited on page 229).
[323] Holger Schwenk, Marta R. Costa-jussà, and José A. R. Fonollosa. “Smooth Bilin-
gual N-Gram Translation”. In: Annual Meeting of the Association for Computa-
tional Linguistics, 2007, pages 430–438 (cited on page 229).
[324] Boxing Chen, Roland Kuhn, George Foster, and Howard Johnson. “Unpacking
and Transforming Feature Functions: New Ways to Smooth Phrase Tables”. In:
Machine Translation Summit, 2011 (cited on page 229).
[325] Nan Duan, Hong Sun, and Ming Zhou. “Translation Model Generalization using
Probability Averaging for Machine Translation”. In: International Conference on
Computational Linguistics, 2010 (cited on page 229).
[326] Christopher Quirk and Arul Menezes. “Do we need phrases? Challenging the con-
ventional wisdom in Statistical Machine Translation”. In: Annual Meeting of the
Association for Computational Linguistics, 2006 (cited on page 229).
[327] José B. Mariño, Rafael E. Banchs, Josep Maria Crego, Adrià de Gispert, Patrik
Lambert, José A. R. Fonollosa, and Marta R. Costa-jussà. N-gram-based Machine
Translation”. In: volume 32. 4. Computational Linguistics, 2006, pages 527–549
(cited on page 229).
[328] Richard Zens, Daisy Stanton, and Peng Xu. “A Systematic Comparison of Phrase
Table Pruning Techniques”. In: Annual Meeting of the Association for Computa-
tional Linguistics, 2012, pages 972–983 (cited on pages 229, 481).
[329] Howard Johnson, Joel D. Martin, George F. Foster, and Roland Kuhn. “Improving
Translation Quality by Discarding Most of the Phrasetable”. In: Annual Meeting
of the Association for Computational Linguistics, 2007, pages 967–975 (cited on
pages 229, 481).
[330] Wang Ling, João Graça, Isabel Trancoso, and Alan W. Black. “Entropy-based Prun-
ing for Phrase-based Machine Translation”. In: Annual Meeting of the Association
for Computational Linguistics, 2012, pages 962–971 (cited on pages 229, 481).
[331] Luke S. Zettlemoyer and Robert C. Moore. “Selective Phrase Pair Extraction for
Improved Statistical Machine Translation”. In: Annual Meeting of the Association
for Computational Linguistics, 2007, pages 209–212 (cited on page 229).
[332] Matthias Eck, Stephan Vogel, and Alex Waibel. “Translation Model Pruning via
Usage Statistics for Statistical Machine Translation”. In: Annual Meeting of the
Association for Computational Linguistics, 2007, pages 21–24 (cited on page
[333] Chris Callison-Burch, Colin J. Bannard, and Josh Schroeder. “Scaling Phrase-Based
Statistical Machine Translation to Larger Corpora and Longer Phrases”. In: Annual
Meeting of the Association for Computational Linguistics, 2005, pages 255–262
(cited on page 229).
[334] Richard Zens and Hermann Ney. “Efficient Phrase-Table Representation for Ma-
chine Translation with Applications to Online MT and Speech Translation”. In: An-
nual Meeting of the Association for Computational Linguistics, 2007, pages 492–
499 (cited on page 229).
[335] Ulrich Germann. “Dynamic Phrase Tables for Machine Translation in an Interac-
tive Post-editing Scenario”. In: Association for Machine Translation in the Ameri-
cas, 2014 (cited on page 229).
[336] David Chiang. “Hierarchical Phrase-Based Translation”. In: volume 33. 2. Compu-
tational Linguistics, 2007, pages 201–228 (cited on pages 236, 241, 251).
[337] John Cocke and J.T. Schwartz. Programming Languages and Their Compilers: Pre-
liminary Notes. Courant Institute of Mathematical Sciences, New York University,
1970 (cited on page 243).
[338] Daniel H. Younger. “Recognition and Parsing of Context-Free Languages in Time
n3̂”. In: volume 10. 2. Information and Control, 1967, pages 189–208 (cited on
page 243).
[339] Tadao Kasami. “An efficient recognition and syntax-analysis algorithm for context-
free languages”. In: Coordinated Science Laboratory Report no. R-257, 1966 (cited
on page 243).
[340] Liang Huang and David Chiang. “Better k-best Parsing”. In: Annual Meeting of the
Association for Computational Linguistics, 2005, pages 53–64 (cited on page 246).
[341] Dekai Wu. “Stochastic Inversion Transduction Grammars and Bilingual Parsing of
Parallel Corpora”. In: volume 23. 3. Computational Linguistics, 1997, pages 377–
403 (cited on pages 251, 278).
[342] Liang Huang, Kevin Knight, and Aravind Joshi. “Statistical syntax-directed trans-
lation with extended domain of locality”. In: Computationally Hard Problems &
Joint Inference in Speech & Language Processing, 2006, pages 66–73 (cited on
page 251).
[343] Michel Galleyand Mark Hopkins, Kevin Knight, and Daniel Marcu. “Whats in a
translation rule?” In: Proceedings of the Human Language Technology Conference
of the North American Chapter of the Association for Computational Linguistics,
2004, pages 273–280 (cited on pages 251, 258).
[344] Jason Eisner. “Learning Non-Isomorphic Tree Mappings for Machine Translation”.
In: Annual Meeting of the Association for Computational Linguistics, 2003, pages 205–
208 (cited on page 251).
[345] Min Zhang, Hongfei Jiang, AiTi Aw, Haizhou Li, Chew Lim Tan, and Sheng Li.
“A Tree Sequence Alignment-based Tree-to-Tree Translation Model”. In: Annual
Meeting of the Association for Computational Linguistics, 2008, pages 559–567
(cited on page 251).
[346] Daniel Marcu, Wei Wang, Abdessamad Echihabi, and Kevin Knight. “SPMT: Sta-
tistical Machine Translation with Syntactified Target Language Phrases”. In: An-
nual Meeting of the Association for Computational Linguistics, 2006, pages 44–52
(cited on pages 264, 278).
[347] Nianwen Xue, Fei Xia, Fu dong Chiou, and Martha Palmer. “Building a large an-
notated Chinese corpus: the Penn Chinese treebank”. In: volume 11. 2. Journal of
Natural Language Engineering, 2005, pages 207–238 (cited on page 265).
[348] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. “Building a
Large Annotated Corpus of English: The Penn Treebank”. In: volume 19. 2. Com-
putational Linguistics, 1993, pages 313–330 (cited on page 265).
[349] Hao Zhang, Liang Huang, Daniel Gildea, and Kevin Knight. “Synchronous Bi-
narization for Machine Translation”. In: Annual Meeting of the Association for
Computational Linguistics, 2006 (cited on pages 266, 277).
[350] Tong Xiao, Mu Li, Dongdong Zhang, Jingbo Zhu, and Ming Zhou. “Better Syn-
chronous Binarization for Machine Translation”. In: Annual Meeting of the Asso-
ciation for Computational Linguistics, 2009, pages 362–370 (cited on pages 266,
[351] Dan Klein and Christopher D. Manning. “Accurate Unlexicalized Parsing”. In: An-
nual Meeting of the Association for Computational Linguistics, 2003, pages 423–
430 (cited on page 266).
[352] Yang Liu, Yajuan Lü, and Qun Liu. “Improving Tree-to-Tree Translation with
Packed Forests”. In: Annual Meeting of the Association for Computational Lin-
guistics, 2009, pages 558–566 (cited on pages 266, 278).
[353] Declan Groves, Mary Hearne, and Andy Way. “Robust Sub-Sentential Alignment
of Phrase-Structure Trees”. In: International Conference on Computational Linguis-
tics, 2004 (cited on page
[354] Jun Sun, Min Zhang, and Chew Lim Tan. “Discriminative Induction of Sub-Tree
Alignment using Limited Labeled Data”. In: International Conference on Compu-
tational Linguistics, 2010, pages 1047–1055 (cited on pages 268, 269).
[355] Yang Liu, Tian Xia, Xinyan Xiao, and Qun Liu. “Weighted Alignment Matrices
for Statistical Machine Translation”. In: Annual Meeting of the Association for
Computational Linguistics, 2009, pages 1017–1026 (cited on pages 268, 269).
[356] Jun Sun, Min Zhang, and Chew Lim Tan. “Exploring Syntactic Structural Fea-
tures for Sub-Tree Alignment Using Bilingual Tree Kernels”. In: Annual Meeting
of the Association for Computational Linguistics, 2010, pages 306–315 (cited on
page 269).
[357] Dan Klein and Christopher D. Manning. “Parsing and Hypergraphs”. In: volume 65.
3. New Developments in Parsing Technology, 2001, pages 123–134 (cited on page 270).
[358] Joshua Goodman. “Semiring Parsing”. In: volume 25. 4. Computational Linguis-
tics, 1999, pages 573–605 (cited on page 271).
[359] Jason Eisner. “Parameter Estimation for Probabilistic Finite-State Transducers”.
In: Annual Meeting of the Association for Computational Linguistics, 2002, pages 1–
8 (cited on page 271).
[360] Jingbo Zhu and Tong Xiao. “Improving Decoding Generalization for Tree-to-String
Translation”. In: Annual Meeting of the Association for Computational Linguistics,
2011, pages 418–423 (cited on pages 275, 278).
[361] Hiyan Alshawi, Adam L. Buchsbaum, and Fei Xia. “A Comparison of Head Trans-
ducers and Transfer for a Limited Domain Translation Application”. In: Morgan
Kaufmann Publishers, 1997, pages 360–365 (cited on page 278).
[362] Dekai Wu. “Trainable Coarse Bilingual Grammars for Parallel Text Bracketing”.
In: Third Workshop on Very Large Corpor, 1995 (cited on page 278).
[363] Dekai Wu and Hongsing Wong. “Machine Translation with a Stochastic Grammat-
ical Channel”. In: Morgan Kaufmann Publishers, 1998, pages 1408–1415 (cited on
page 278).
[364] J.A.Sánchez and J.M.Benedí. “Obtaining Word Phrases with Stochastic Inversion
Transduction Grammars for Phrase-based Statistical Machine Translation”. In: An-
nual Meeting of the Association for Computational Linguistics, 2006 (cited on
page 278).
[365] Hao Zhang, Chris Quirk, Robert C. Moore, and Daniel Gildea. “Bayesian Learning
of Non-Compositional Phrases with Synchronous Parsing”. In: Annual Meeting of
the Association for Computational Linguistics, 2008 (cited on page 278).
[366] Andreas Zollmann, Ashish Venugopal, Franz Josef Och, and Jay M. Ponte. “A
Systematic Comparison of Phrase-Based, Hierarchical and Syntax-Augmented Sta-
tistical MT”. In: International Conference on Computational Linguistics, 2008,
pages 1145–1152 (cited on page 278).
[367] Taro Watanabe, Hajime Tsukada, and Hideki Isozaki. “Left-to-Right Target Gener-
ation for Hierarchical Phrase-Based Translation”. In: Annual Meeting of the Asso-
ciation for Computational Linguisticss, 2006 (cited on page 278).
[368] Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. “What’s in a trans-
lation rule?” In: Annual Meeting of the Association for Computational Linguistics,
2004, pages 273–280 (cited on page 278).
[369] Bryant Huang and Kevin Knight. “Relabeling Syntax Trees to Improve Syntax-
Based Machine Translation Quality”. In: Annual Meeting of the Association for
Computational Linguistics, 2006 (cited on page 278).
[370] Steve DeNeefe, Kevin Knight, Wei Wang, and Daniel Marcu. “What Can Syntax-
Based MT Learn from Phrase-Based MT?” In: Annual Meeting of the Association
for Computational Linguistics, 2007, pages 755–763 (cited on pages 278, 535).
[371] Ding Liu and Daniel Gildea. “Improved Tree-to-String Transducer for Machine
Translation”. In: Annual Meeting of the Association for Computational Linguistics,
2008, pages 62–69 (cited on page 278).
[372] Andreas Zollmann and Ashish Venugopal. “Syntax Augmented Machine Transla-
tion via Chart Parsing”. In: Annual Meeting of the Association for Computational
Linguistics, 2006, pages 138–141 (cited on page 278).
[373] Yuval Marton and Philip Resnik. “Soft Syntactic Constraints for Hierarchical Phrased-
Based Translation”. In: Annual Meeting of the Association for Computational Lin-
guistics, 2008, pages 1003–1011 (cited on page 278).
[374] Rebecca Nesson, Stuart M. Shieber, and Alexander Rush. “Induction of probabilis-
tic synchronous tree-insertion grammars for machine translation”. In: Annual Meet-
ing of the Association for Computational Linguistics, 2006 (cited on page 278).
[375] Min Zhang, Hongfei Jiang, Ai Ti Aw, Jun Sun, Sheng Li, and Chew Lim Tan.
A Tree-to-Tree Alignment-based Model for Statistical Machine Translation. 2007
(cited on page 278).
[376] Haitao Mi, Liang Huang, and Qun Liu. “Forest-Based Translation”. In: Annual
Meeting of the Association for Computational Linguistics, 2008, pages 192–199
(cited on page
[377] Haitao Mi and Liang Huang. “Forest-based Translation Rule Extraction”. In: An-
nual Meeting of the Association for Computational Linguistics, 2008, pages 206–
214 (cited on page 278).
[378] Jiajun Zhang, Feifei Zhai, and Chengqing Zong. “Augmenting String-to-Tree Trans-
lation Models with Fuzzy Use of Source-side Syntax”. In: Annual Meeting of the
Association for Computational Linguistics, 2011, pages 204–215 (cited on page 278).
[379] Martin Popel, David Marecek, Nathan Green, and Zdenek Zabokrtský. “Influence
of Parser Choice on Dependency-Based MT”. In: Annual Meeting of the Associa-
tion for Computational Linguistics, 2011, pages 433–439 (cited on page 278).
[380] Tong Xiao, Jingbo Zhu, Hao Zhang, and Muhua Zhu. “An Empirical Study of
Translation Rule Extraction with Multiple Parsers”. In: Chinese Information Pro-
cessing Society of China, 2010, pages 1345–1353 (cited on page 278).
[381] Feifei Zhai, Jiajun Zhang, Yu Zhou, and Chengqing Zong. “Unsupervised Tree
Induction for Tree-based Translation”. In: volume 1. Transactions of Association
for Computational Linguistic, 2013, pages 243–254 (cited on page 278).
[382] Christopher Quirk and Arul Menezes. “Dependency treelet translation: the con-
vergence of statistical and example-based machine-translation?” In: volume 20. 1.
Machine Translation, 2006, pages 43–65 (cited on page 279).
[383] Deyi Xiong, Qun Liu, and Shouxun Lin. “A Dependency Treelet String Correspon-
dence Model for Statistical Machine Translation”. In: Annual Meeting of the As-
sociation for Computational Linguistics, 2007, pages 40–47 (cited on page 279).
[384] Dekang Lin. “A Path-based Transfer Model for Machine Translation”. In: Interna-
tional Conference on Computational Linguistics, 2004 (cited on page 279).
[385] Yuan Ding and Martha Palmer. “Machine Translation Using Probabilistic Syn-
chronous Dependency Insertion Grammars”. In: Annual Meeting of the Associ-
ation for Computational Linguistics, 2005, pages 541–548 (cited on page 279).
[386] Hongshen Chen, Jun Xie, Fandong Meng, Wenbin Jiang, and Qun Liu. “A Depen-
dency Edge-based Transfer Model for Statistical Machine Translation”. In: Annual
Meeting of the Association for Computational Linguistics, 2014, pages 1103–1113
(cited on page 279).
[387] Jinsong Su, Yang Liu, Haitao Mi, Hongmei Zhao, Yajuan Lv, and Qun Liu. “Dependency-
Based Bracketing Transduction Grammar for Statistical Machine Translation”. In:
Chinese Information Processing Society of China, 2010, pages 1185–1193 (cited
on page 279).
[388] Jun Xie, Jinan Xu, and Qun Liu. “Augment Dependency-to-String Translation with
Fixed and Floating Structures”. In: Annual Meeting of the Association for Compu-
tational Linguistics, 2014, pages 2217–2226 (cited on page 279).
[389] Liangyou Li, Andy Way, and Qun Liu. “Dependency Graph-to-String Translation”.
In: Annual Meeting of the Association for Computational Linguistics, 2015, pages 33–
43 (cited on page 279).
[390] Haitao Mi and Qun Liu. “Constituency to Dependency Translation with Forests”.
In: Annual Meeting of the Association for Computational Linguistics, 2010, pages 1433–
1442 (cited on page 279).
[391] Zhaopeng Tu, Yang Liu, Young-Sook Hwang, Qun Liu, and Shouxun Lin. “Depen-
dency Forest for Statistical Machine Translation”. In: International Conference on
Computational Linguistics, 2010, pages 1092–1100 (cited on page 279).
[392] German Bordel Srinivas Bangalore and Giuseppe Riccardi. “Computing consen-
sus translation from multiple machine translation systems”. In: IEEE Workshop on
Automatic Speech Recognition and Understanding, 2001, pages 351–354 (cited on
page 279).
[393] Antti-Veikko I. Rosti, Necip Fazil Ayan, Bing Xiang, Spyridon Matsoukas, Richard
M. Schwartz, and Bonnie J. Dorr. “Combining Outputs from Multiple Machine
Translation Systems”. In: Annual Meeting of the Association for Computational
Linguistics, 2007, pages 228–235 (cited on page 279).
[394] Tong Xiao, Jingbo Zhu, and Tongran Liu. “Bagging and boosting statistical ma-
chine translation systems”. In: volume 195. Artificial Intelligence, 2013, pages 496–
527 (cited on pages 279, 478, 495).
[395] Yang Feng, Yang Liu, Haitao Mi, Qun Liu, and Yajuan Lü. “Lattice-based System
Combination for Statistical Machine Translation”. In: Annual Meeting of the Asso-
ciation for Computational Linguistics, 2009, pages 1105–1113 (cited on page 279).
[396] Xiaodong He, Mei Yang, Jianfeng Gao, Patrick Nguyen, and Robert C. Moore.
“Indirect-HMM-based Hypothesis Alignment for Combining Outputs from Ma-
chine Translation Systems”. In: Annual Meeting of the Association for Compu-
tational Linguistics, 2008, pages 98–107 (cited on page 279).
[397] Chi-Ho Li, Xiaodong He, Yupeng Liu, and Ning Xi. “Incremental HMM Align-
ment for MT System Combination”. In: Annual Meeting of the Association for
Computational Linguistics, 2009, pages 949–957 (cited on page 279).
[398] Yang Liu, Haitao Mi, Yang Feng, and Qun Liu. “Joint Decoding with Multiple
Translation Models”. In: Annual Meeting of the Association for Computational
Linguistics, 2009, pages 576–584 (cited on page 279).
[399] Mu Li, Nan Duan, Dongdong Zhang, Chi-Ho Li, and Ming Zhou. “Collaborative
Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between
Decoders”. In: Annual Meeting of the Association for Computational Linguistics,
2009, pages 585–592 (cited on page 279).
[400] Tong Xiao, Jingbo Zhu, Chunliang Zhang, and Tongran Liu. “Syntactic Skeleton-
Based Translation”. In: AAAI Conference on Artificial Intelligence, 2016, pages 2856–
2862 (cited on pages 279, 535).
[401] Eugene Charniak. “Immediate-Head Parsing for Language Models”. In: Morgan
Kaufmann Publishers, 2001, pages 116–123 (cited on page 279).
[402] Libin Shen, Jinxi Xu, and Ralph M. Weischedel. “A New String-to-Dependency
Machine Translation Algorithm with a Target Dependency Language Model”. In:
Annual Meeting of the Association for Computational Linguistics, 2008, pages 577–
585 (cited on page 279).
[403] Tong Xiao, Jingbo Zhu, and Muhua Zhu. “Language Modeling for Syntax-Based
Machine Translation Using Tree Substitution Grammars: A Case Study on Chinese-
English Translation”. In: volume 10. 4. ACM Transactions on Asian Language
Information Processing (TALIP), 2011, pages 1–29 (cited on pages 279, 499).
[404] Peter F. Brown, Vincent J. Della Pietra, Peter V. De Souza, Jennifer C. Lai, and
Robert L. Mercer. “Class-based n-gram models of natural language”. In: volume 18.
4. Computational linguistics, 1992, pages 467–479 (cited on page 288).
[405] Tomas Mikolov and Geoffrey Zweig. “Context dependent recurrent neural net-
work language model”. In: IEEE Spoken Language Technology Workshop, 2012,
pages 234–239 (cited on pages 288, 343).
[406] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. “Recurrent Neural Network
Regularization”. In: arXiv: Neural and Evolutionary Computing, 2014 (cited on
page 288).
[407] Julian G. Zilly, Rupesh Kumar Srivastava, Jan Koutnı
k, and Jürgen Schmidhuber.
“Recurrent Highway Networks”. In: International Conference on Machine Learn-
ing, 2016 (cited on page
[408] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. “Regularizing and op-
timizing LSTM language models”. In: International Conference on Learning Rep-
resentations, 2017 (cited on page 288).
[409] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. “Language models are unsupervised multitask learners”. In: volume 1.
8. OpenAI Blog, 2019, page 9 (cited on pages 288, 436).
[410] Atılım Günes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jef-
frey Mark Siskind. “Automatic differentiation in machine learning: a survey”. In:
volume 18. 1. Journal of Machine Learning Research, 2017, pages 5595–5637
(cited on pages 317, 319).
[411] Ning Qian. “On the momentum term in gradient descent learning algorithms”. In:
volume 12. 1. Neural Networks, 1999, pages 145–151 (cited on page 320).
[412] John C. Duchi, Elad Hazan, and Yoram Singer. “Adaptive Subgradient Methods for
Online Learning and Stochastic Optimization”. In: volume 12. Journal of Machine
Learning Research, 2011, pages 2121–2159 (cited on pages 320, 321).
[413] Matthew D. Zeiler. “ADADELTA:An Adaptive Learning Rate Method”. In: arXiv
preprint arXiv:1212.5701, 2012 (cited on page 320).
[414] Tijmen Tieleman and Geoffrey Hinton. “Lecture 6.5-rmsprop: Divide the gradient
by a running average of its recent magnitude”. In: volume 4. 2. COURSERA: Neu-
ral networks for machine learning, 2012, pages 26–31 (cited on pages 320, 322).
[415] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimiza-
tion”. In: International Conference on Learning Representations, 2015 (cited on
pages 320, 322, 377).
[416] Timothy Dozat. “Incorporating Nesterov Momentum into Adam”. In: International
Conference on Learning Representations, 2016 (cited on page 320).
[417] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. “On the Convergence of Adam
and Beyond”. In: International Conference on Learning Representations, 2018 (cited
on page 320).
[418] Tong Xiao, Jingbo Zhu, Tongran Liu, and Chunliang Zhang. “Fast Parallel Train-
ing of Neural Language Models”. In: International Joint Conference on Artificial
Intelligence, 2017, pages 4193–4199 (cited on pages 324, 380).
[419] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Net-
work Training by Reducing Internal Covariate Shift”. In: volume 37. International
Conference on Machine Learning, 2015, pages 448–456 (cited on page
[420] Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey Hinton. “Layer Normalization”.
In: volume abs/1607.06450. CoRR, 2016 (cited on pages 325, 422, 514).
[421] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learn-
ing for Image Recognition”. In: IEEE Conference on Computer Vision and Pattern
Recognition, 2016, pages 770–778 (cited on pages 325, 387, 395, 422, 514).
[422] Ngoc-quan Pham, German Kruszewski, and Gemma Boleda. “Convolutional Neu-
ral Network Language Models”. In: Conference on Empirical Methods in Natural
Language Processing, 2016 (cited on page 338).
[423] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean.
“Distributed Representations of Words and Phrases and their Compositionality”.
In: Conference on Neural Information Processing Systems, 2013, pages 3111–3119
(cited on pages 340, 343).
[424] Raha Moraffah, Mansooreh Karami, Ruocheng Guo, Adrienne Raglin, and Huan
Liu. “Causal Interpretability for Machine Learning-Problems, Methods and Eval-
uation”. In: volume 22. 1. ACM SIGKDD Conference on Knowledge Discovery
and Data Mining, 2020, pages 18–33 (cited on page 343).
[425] Boris Kovalerchuk, Muhammad Ahmad, and Ankur Teredesai. “Survey of explain-
able machine learning with visual and granular methods beyond quasi-explanations”.
In: volume abs/2009.10221. ArXiv, 2020 (cited on page 343).
[426] Finale Doshi-Velez and Been Kim. “Towards A Rigorous Science of Interpretable
Machine Learning”. In: arXiv preprint arXiv:1702.08608, 2017 (cited on page 343).
[427] Philip Arthur, Graham Neubig, and Satoshi Nakamura. “Incorporating Discrete
Translation Lexicons into Neural Machine Translation”. In: Conference on Empir-
ical Methods in Natural Language Processing, 2016, pages 1557–1567 (cited on
page 343).
[428] Jiacheng Zhang, Yang Liu, Huanbo Luan, Jingfang Xu, and Maosong Sun. “Prior
Knowledge Integration for Neural Machine Translation using Posterior Regulariza-
tion”. In: Annual Meeting of the Association for Computational Linguistics, 2017,
pages 1514–1523 (cited on pages 343, 386).
[429] Felix Stahlberg, Eva Hasler, Aurelien Waite, and Bill Byrne. “Syntactically Guided
Neural Machine Translation”. In: Annual Meeting of the Association for Compu-
tational Linguistics, 2016 (cited on page 343).
[430] Anna Currey and Kenneth Heafield. “Incorporating source syntax into transformer-
based neural machine translation”. In: Annual Meeting of the Association for Com-
putational Linguistics, 2019, pages 24–33 (cited on page
[431] Baosong Yang, Derek Wong, Tong Xiao, Lidia Chao, and Jingbo Zhu. “Towards
Bidirectional Hierarchical Representations for Attention-based Neural Machine
Translation”. In: Conference on Empirical Methods in Natural Language Process-
ing, 2017, pages 1432–1441 (cited on pages 343, 386, 530).
[432] David Mareček and Rudolf Rosa. “Extracting syntactic trees from transformer en-
coder self-attentions”. In: Conference on Empirical Methods in Natural Language
Processing, 2018, pages 347–349 (cited on page 343).
[433] Terra Blevins, Omer Levy, and Luke Zettlemoyer. “Deep rnns encode soft hierar-
chical syntax”. In: Annual Meeting of the Association for Computational Linguis-
tics, 2018 (cited on page 343).
[434] Youzheng Wu, Xugang Lu, Hitoshi Yamamoto, Shigeki Matsuda, Chiori Hori,
and Hideki Kashioka. “Factored Language Model based on Recurrent Neural Net-
work”. In: International Conference on Computational Linguistics, 2012 (cited on
page 343).
[435] Heike Adel, Ngoc Vu, Katrin Kirchhoff, Dominic Telaar, and Tanja Schultz. “Syn-
tactic and Semantic Features For Code-Switching Factored Language Models”. In:
volume 23. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
2015, pages 431–440 (cited on page 343).
[436] Tian Wang and Kyunghyun Cho. “Larger-Context Language Modelling”. In: An-
nual Meeting of the Association for Computational Linguistics, 2015 (cited on
page 343).
[437] Sungjin Ahn, Heeyoul Choi, Tanel Pärnamaa, and Yoshua Bengio. “A Neural Knowl-
edge Language Model”. In: arXiv preprint arXiv:1608.00318, 2016 (cited on page 343).
[438] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. “Character-
Aware Neural Language Models”. In: AAAI Conference on Artificial Intelligence,
2016 (cited on page 343).
[439] Kyuyeon Hwang and Wonyong Sung. “Character-level language modeling with
hierarchical recurrent neural networks”. In: International Conference on Acoustics,
Speech and Signal Processing, 2017, pages 5720–5724 (cited on page 343).
[440] Yasumasa Miyamoto and Kyunghyun Cho. “Gated Word-Character Recurrent Lan-
guage Model”. In: Conference on Empirical Methods in Natural Language Process-
ing, 2016, pages 1992–1997 (cited on page 343).
[441] Lyan Verwimp, Joris Pelemans, Hugo Van Hamme, and Patrick Wambacq. “Character-
Word LSTM Language Models”. In: Annual Conference of the European Associa-
tion for Machine Translation, 2017 (cited on page
[442] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. “Hybrid speech recog-
nition with Deep Bidirectional LSTM”. In: IEEE Workshop on Automatic Speech
Recognition and Understanding, 2013, pages 273–278 (cited on page 343).
[443] Jetic Gu, Hassan S. Shavarani, and Anoop Sarkar. “Top-down Tree Structured
Decoding with Syntactic Connections for Neural Machine Translation and Pars-
ing”. In: Conference on Empirical Methods in Natural Language Processing, 2018,
pages 401–413 (cited on pages 343, 386).
[444] Pengcheng Yin, Chunting Zhou, Junxian He, and Graham Neubig. “StructVAE:
Tree-structured Latent Variable Models for Semi-supervised Semantic Parsing”.
In: Annual Meeting of the Association for Computational Linguistics, 2018 (cited
on page 343).
[445] Roee Aharoni and Yoav Goldberg. “Towards String-To-Tree Neural Machine Trans-
lation”. In: Annual Meeting of the Association for Computational Linguistics, 2017
(cited on pages 343, 534).
[446] Jasmijn Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima’an.
“Graph Convolutional Encoders for Syntax-aware Neural Machine Translation”.
In: Conference on Empirical Methods in Natural Language Processing, 2017 (cited
on page 343).
[447] Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh
Hajishirzi. “Text Generation from Knowledge Graphs with Graph Transformers”.
In: Annual Conference of the North American Chapter of the Association for Com-
putational Linguistics, 2019 (cited on page 343).
[448] Bryan Mccann, James Bradbury, Caiming Xiong, and Richard Socher. “Learned in
Translation: Contextualized Word Vectors”. In: Conference on Neural Information
Processing Systems, 2017, pages 6294–6305 (cited on pages 343, 553).
[449] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard M. Schwartz,
and John Makhoul. “Fast and Robust Neural Network Joint Models for Statistical
Machine Translation”. In: Annual Meeting of the Association for Computational
Linguistics, 2014, pages 1370–1380 (cited on pages 348, 387).
[450] Holger Schwenk. “Continuous Space Translation Models for Phrase-Based Statisti-
cal Machine Translation”. In: International Conference on Computational Linguis-
tics, 2012, pages 1071–1080 (cited on page 348).
[451] Nal Kalchbrenner and Phil Blunsom. “Recurrent Continuous Translation Models”.
In: Annual Meeting of the Association for Computational Linguistics, 2013, pages 1700–
1709 (cited on pages 348, 359, 387, 394).
[452] Sepp Hochreiter. “The Vanishing Gradient Problem During Learning Recurrent
Neural Nets and Problem Solutions”. In: volume 6. 2. International Journal of Un-
certainty, Fuzziness and Knowledge-Based Systems, 1998, pages 107–116 (cited
on page 348).
[453] Yoshua Bengio, Patrice Y. Simard, and Paolo Frasconi. “Learning long-term de-
pendencies with gradient descent is difficult”. In: volume 5. 2. IEEE Transportation
Neural Networks, 1994, pages 157–166 (cited on page 348).
[454] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi,
Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff
Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian,
Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick,
Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. “Google’s Neu-
ral Machine Translation System: Bridging the Gap between Human and Machine
Translation”. In: volume abs/1609.08144. CoRR, 2016 (cited on pages 349, 359,
374, 375, 394, 476).
[455] Felix Stahlberg. “Neural Machine Translation: A Review”. In: volume 69. Journal
of Artificial Intelligence Research, 2020, pages 343–418 (cited on pages 349, 404,
[456] Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and Marcello Federico. “Neu-
ral versus Phrase-Based Machine Translation Quality: a Case Study”. In: Annual
Meeting of the Association for Computational Linguistics, 2016, pages 257–267
(cited on pages 350, 351).
[457] Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark,
Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis,
Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank
Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang,
Zhirui Zhang, and Ming Zhou. “Achieving Human Parity on Automatic Chinese
to English News Translation”. In: volume abs/1803.05567. CoRR, 2018 (cited on
pages 350, 351, 546, 557).
[458] Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey,
George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish
Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Mac-
duff Hughes. “The Best of Both Worlds: Combining Recent Advances in Neural
Machine Translation”. In: Annual Meeting of the Association for Computational
Linguistics, 2018, pages 76–86 (cited on pages 352, 483, 510).
[459] Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo Chen, and Tie-Yan Liu.
“Layer-Wise Coordination between Encoder and Decoder for Neural Machine Trans-
lation”. In: Conference on Neural Information Processing Systems, 2018 (cited on
page 352).
[460] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. “Self-Attention with Relative
Position Representations”. In: Proceedings of the Human Language Technology
Conference of the North American Chapter of the Association for Computational
Linguistics, 2018, pages 464–468 (cited on pages 352, 416, 429, 497, 502, 503,
[461] Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek Wong, and
Lidia Chao. “Learning Deep Transformer Models for Machine Translation”. In: An-
nual Meeting of the Association for Computational Linguistics, 2019, pages 1810–
1822 (cited on pages 352, 423, 426, 429, 497, 514, 516, 519, 520, 522, 524, 525).
[462] Bei Li, Ziyang Wang, Hui Liu, Yufan Jiang, Quan Du, Tong Xiao, Huizhen Wang,
and Jingbo Zhu. “Shallow-to-Deep Training for Neural Machine Translation”. In:
Conference on Empirical Methods in Natural Language Processing, 2020 (cited on
pages 352, 429, 502, 519, 524, 526).
[463] Xiangpeng Wei, Heng Yu, Yue Hu, Yue Zhang, Rongxiang Weng, and Weihua
Luo. “Multiscale Collaborative Deep Models for Neural Machine Translation”. In:
Annual Meeting of the Association for Computational Linguistics, 2020 (cited on
pages 352, 429).
[464] Yanyang Li, Qiang Wang, Tong Xiao, Tongran Liu, and Jingbo Zhu. “Neural Ma-
chine Translation with Joint Representation”. In: AAAI Conference on Artificial
Intelligence, 2020, pages 8285–8292 (cited on pages 355, 542, 544).
[465] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio.
“On the Properties of Neural Machine Translation: Encoder-Decoder Approaches”.
In: Annual Meeting of the Association for Computational Linguistics, 2014, pages 103–
111 (cited on page 359).
[466] Sébastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Bengio. “On
Using Very Large Target Vocabulary for Neural Machine Translation”. In: Annual
Meeting of the Association for Computational Linguistics, 2015, pages 1–10 (cited
on pages 359, 434, 480).
[467] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Memory”. In: vol-
ume 9. Neural Computation, Dec. 1997, pages 1735–80 (cited on page 363).
[468] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. “Learning Phrase Representa-
tions using RNN Encoder-Decoder for Statistical Machine Translation”. In: Annual
Meeting of the Association for Computational Linguistics, 2014, pages 1724–1734
(cited on page 365).
[469] Rico Sennrich, Orhan Firat, Kyunghyun Cho, Barry Haddow, Alexandra Birch, Ju-
lian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli
Barone, Jozef Mokry, and Maria Nadejde. “Nematus: a Toolkit for Neural Ma-
chine Translation”. In: Annual Conference of the European Association for Ma-
chine Translation, 2017, pages 65–68 (cited on pages 374, 633).
[470] Xavier Glorot and Yoshua Bengio. “Understanding the difficulty of training deep
feedforward neural networks”. In: volume 9. International Conference on Artificial
Intelligence and Statistics, 2010, pages 249–256 (cited on pages 377, 520, 521).
[471] Hirotugu Akaike. “Fitting autoregressive models for prediction”. In: volume 21(1).
Annals of the institute of Statistical Mathematics, 2015, pages 243–247 (cited on
page 382).
[472] Yanyang Li, Tong Xiao, Yinqiao Li, Qiang Wang, Changming Xu, and Jingbo Zhu.
“A Simple and Effective Approach to Coverage-Aware Neural Machine Transla-
tion”. In: Annual Meeting of the Association for Computational Linguistics, 2018,
pages 292–297 (cited on pages 385, 473, 477, 479, 552).
[473] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. “Modeling
Coverage for Neural Machine Translation”. In: Annual Meeting of the Association
for Computational Linguistics, 2016 (cited on pages 385, 476, 477, 552).
[474] Biao Zhang and Rico Sennrich. “A Lightweight Recurrent Network for Sequence
Modeling”. In: Annual Meeting of the Association for Computational Linguistics,
2019, pages 1538–1548 (cited on page 386).
[475] Tao Lei, Yu Zhang, and Yoav Artzi. “Training RNNs as Fast as CNNs”. In: vol-
ume abs/1709.02755. CoRR, 2017 (cited on page 386).
[476] Biao Zhang, Deyi Xiong, Jinsong Su, Qian Lin, and Huiji Zhang. “Simplifying
Neural Machine Translation with Addition-Subtraction Twin-Gated Recurrent Net-
works”. In: Conference on Empirical Methods in Natural Language Processing,
2018, pages 4273–4283 (cited on page 386).
[477] Xing Wang, Zhengdong Lu, Zhaopeng Tu, Hang Li, Deyi Xiong, and Min Zhang.
“Neural Machine Translation Advised by Statistical Machine Translation”. In: AAAI
Conference on Artificial Intelligence, 2017, pages 3330–3336 (cited on page 386).
[478] Wei He, Zhongjun He, Hua Wu, and Haifeng Wang. “Improved Neural Machine
Translation with SMT Features”. In: AAAI Conference on Artificial Intelligence,
2016, pages 151–157 (cited on page 386).
[479] Xintong Li, Guanlin Li, Lemao Liu, Max Meng, and Shuming Shi. “On the Word
Alignment from Neural Machine Translation”. In: Annual Meeting of the Associa-
tion for Computational Linguistics, 2019, pages 1293–1303 (cited on page 386).
[480] Yau-Shian Wang, Hung-yi Lee, and Yun-Nung Chen. “Tree Transformer: Integrat-
ing Tree Structures into Self-Attention”. In: Conference on Empirical Methods in
Natural Language Processing, 2019, pages 1061–1070 (cited on page 386).
[481] Xinyi Wang, Hieu Pham, Pengcheng Yin, and Graham Neubig. “A Tree-based De-
coder for Neural Machine Translation”. In: Conference on Empirical Methods in
Natural Language Processing, 2018, pages 4772–4777 (cited on pages 386, 535).
[482] Jiajun Zhang and Chengqing Zong. “Bridging Neural Machine Translation and
Bilingual Dictionaries”. In: volume abs/1610.07272. CoRR, 2016 (cited on page 386).
[483] Xiangyu Duan, Baijun Ji, Hao Jia, Min Tan, Min Zhang, Boxing Chen, Weihua Luo,
and Yue Zhang. “Bilingual Dictionary Based Neural Machine Translation without
Using Parallel Sentences”. In: Annual Meeting of the Association for Computa-
tional Linguistics, 2020, pages 1570–1579 (cited on page 386).
[484] Qian Cao and Deyi Xiong. “Encoding Gated Translation Memory into Neural Ma-
chine Translation”. In: Conference on Empirical Methods in Natural Language Pro-
cessing, 2018, pages 3042–3047 (cited on page 386).
[485] Haitao Mi, Zhiguo Wang, and Abe Ittycheriah. “Supervised Attentions for Neural
Machine Translation”. In: Annual Meeting of the Association for Computational
Linguistics, 2016, pages 2283–2288 (cited on page 386).
[486] Lemao Liu, Masao Utiyama, Andrew M. Finch, and Eiichiro Sumita. “Neural Ma-
chine Translation with Supervised Attention”. In: Annual Meeting of the Associa-
tion for Computational Linguistics, 2016, pages 3093–3102 (cited on page 386).
[487] Lesly Miculicich Werlen, Dhananjay Ram, Nikolaos Pappas, and James Hender-
son. “Document-Level Neural Machine Translation with Hierarchical Attention
Networks”. In: Conference on Empirical Methods in Natural Language Process-
ing, 2018, pages 2947–2954 (cited on pages 386, 601, 603).
[488] Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov. “Context-Aware
Neural Machine Translation Learns Anaphora Resolution”. In: Annual Meeting
of the Association for Computational Linguistics, 2018, pages 1264–1274 (cited
on pages 386, 600–603).
[489] Bei Li, Hui Liu, Ziyang Wang, Yufan Jiang, Tong Xiao, Jingbo Zhu, Tongran Liu,
and Changliang Li. “Does Multi-Encoder Help? A Case Study on Context-Aware
Neural Machine Translation”. In: Annual Meeting of the Association for Compu-
tational Linguistics, 2020, pages 3512–3518 (cited on pages 386, 448, 602, 607).
[490] Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and
Kevin J. Lang. “Phoneme recognition using time-delay neural networks”. In: vol-
ume 37. International Conference on Acoustics, Speech and Signal Processing,
1989, pages 328–339 (cited on page 387).
[491] Yann Lecun, Bernhard Boser, John Denker, Don Henderson, Richard E.Howard,
Wayne E. Hubbard, and Larry Jackel. “Backpropagation Applied to Handwritten
Zip Code Recognition”. In: volume 1. Neural Computation, 1989, pages 541–551
(cited on page 387).
[492] Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. “Gradient-based
learning applied to document recognition”. In: volume 86. 11. Proceedings of the
IEEE, 1998, pages 2278–2324 (cited on pages 387, 405).
[493] Yu Zhang, William Chan, and Navdeep Jaitly. “Very deep convolutional networks
for end-to-end speech recognition”. In: International Conference on Acoustics,
Speech and Signal Processing, 2017, pages 4845–4849 (cited on page 387).
[494] Li Deng, Ossama Abdel-Hamid, and Dong Yu. “A deep convolutional neural net-
work using heterogeneous pooling for trading acoustic invariance with phonetic
confusion”. In: International Conference on Acoustics, Speech and Signal Process-
ing, 2013, pages 6669–6673 (cited on page 387).
[495] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. “A Convolutional Neu-
ral Network for Modelling Sentences”. In: Annual Meeting of the Association for
Computational Linguistics, 2014, pages 655–665 (cited on pages 387, 392, 402,
[496] Yoon Kim. “Convolutional Neural Networks for Sentence Classification”. In: Con-
ference on Empirical Methods in Natural Language Processing, 2014, pages 1746–
1751 (cited on pages 387, 392, 393, 402, 406).
[497] Mingbo Ma, Liang Huang, Bowen Zhou, and Bing Xiang. “Dependency-based
Convolutional Neural Networks for Sentence Embedding”. In: Annual Meeting
of the Association for Computational Linguistics, 2015, pages 174–179 (cited on
page 387).
cero Nogueira dos Santos and Maira Gatti. “Deep Convolutional Neural Net-
works for Sentiment Analysis of Short Texts”. In: International Conference on
Computational Linguistics, 2014, pages 69–78 (cited on pages 387, 392).
[499] Mingxuan Wang, Zhengdong Lu, Hang Li, Wenbin Jiang, and Qun Liu. “genCNN:
A Convolutional Architecture for Word Sequence Prediction”. In: Annual Meeting
of the Association for Computational Linguistics, 2015, pages 1567–1576 (cited
on page 387).
[500] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. “Language Mod-
eling with Gated Convolutional Networks”. In: volume 70. International Confer-
ence on Machine Learning, 2017, pages 933–941 (cited on pages 387, 395, 396).
[501] Jonas Gehring, Michael Auli, David Grangier, and Yann N. Dauphin. “A Con-
volutional Encoder Model for Neural Machine Translation”. In: Annual Meeting
of the Association for Computational Linguistics, 2017, pages 123–135 (cited on
pages 387, 394).
[502] Lukasz Kaiser, Aidan N. Gomez, and François Chollet. “Depthwise Separable Con-
volutions for Neural Machine Translation”. In: International Conference on Learn-
ing Representations, 2018 (cited on pages 387, 394, 402).
[503] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-
Yang Fu, and Alexander C. Berg. “SSD: Single Shot MultiBox Detector”. In: vol-
ume 9905. European Conference on Computer Vision, 2016, pages 21–37 (cited
on page 388).
[504] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks”. In: volume 39. 6.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, pages 1137–
1149 (cited on pages 388, 596).
[505] Rie Johnson and Tong Zhang. “Effective Use of Word Order for Text Categoriza-
tion with Convolutional Neural Networks”. In: Proceedings of the Human Lan-
guage Technology Conference of the North American Chapter of the Association
for Computational Linguistics, 2015, pages 103–112 (cited on pages 392, 402).
[506] Thien Huu Nguyen and Ralph Grishman. “Relation Extraction: Perspective from
Convolutional Neural Networks”. In: Proceedings of the Human Language Tech-
nology Conference of the North American Chapter of the Association for Compu-
tational Linguistics, 2015, pages 39–48 (cited on page 392).
[507] Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. “Pay
Less Attention with Lightweight and Dynamic Convolutions”. In: International
Conference on Learning Representations, 2019 (cited on pages 394, 402, 404, 405,
429, 483, 508, 510).
[508] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. “End-To-End
Memory Networks”. In: Conference on Neural Information Processing Systems,
2015, pages 2440–2448 (cited on page 395).
[509] Md. Amirul Islam, Sen Jia, and Neil Bruce. “How much Position Information Do
Convolutional Neural Networks Encode?” In: International Conference on Learn-
ing Representations, 2020 (cited on page 396).
[510] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey Hinton. “On the im-
portance of initialization and momentum in deep learning”. In: International Con-
ference on Machine Learning, 2013, pages 1139–1147 (cited on page 400).
[511] Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu. “Advances
in optimizing recurrent networks”. In: International Conference on Acoustics, Speech
and Signal Processing, 2013, pages 8624–8628 (cited on page 401).
[512] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. “Dropout: A Simple Way to Prevent Neural Networks from Over-
fitting”. In: volume 15. Journal of Machine Learning Research, 2014, pages 1929–
1958 (cited on pages 401, 426, 445).
[513] François Chollet. “Xception: Deep Learning with Depthwise Separable Convolu-
tions”. In: IEEE Conference on Computer Vision and Pattern Recognition, 2017,
pages 1800–1807 (cited on pages 402, 540).
[514] Andrew Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,
Tobias Weyand, Marco Andreetto, and Hartwig Adam. “MobileNets: Efficient Con-
volutional Neural Networks for Mobile Vision Applications”. In: CoRR, 2017 (cited
on page 402).
[515] Rie Johnson and Tong Zhang. “Deep Pyramid Convolutional Neural Networks for
Text Categorization”. In: Annual Meeting of the Association for Computational
Linguistics, 2017, pages 562–570 (cited on page 402).
[516] Laurent Sifre and Stéphane Mallat. “Rigid-motion scattering for image classifica-
tion”. In: Citeseer, 2014 (cited on page 402).
[517] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. “DeepFace:
Closing the Gap to Human-Level Performance in Face Verification”. In: IEEE
Conference on Computer Vision and Pattern Recognition, 2014, pages 1701–1708
(cited on page 405).
[518] Yu-hsin Chen, Ignacio Lopez-Moreno, Tara Sainath, Mirkó Visontai, Raziel Al-
varez, and Carolina Parada. “Locally-connected and convolutional neural networks
for small footprint speaker recognition”. In: Conference of the International Speech
Communication Association, 2015, pages 1136–1140 (cited on page 405).
[519] Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng
Liu. “Dynamic Convolution: Attention Over Convolution Kernels”. In: IEEE Con-
ference on Computer Vision and Pattern Recognition, 2020, pages 11027–11036
(cited on page 405).
[520] Peng Zhou, Suncong Zheng, Jiaming Xu, Zhenyu Qi, Hongyun Bao, and Bo Xu.
“Joint Extraction of Multiple Relations and Entities by Using a Hybrid Neural Net-
work”. In: volume 10565. Springer, 2017, pages 135–146 (cited on page 406).
[521] Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. “Event Extraction
via Dynamic Multi-Pooling Convolutional Neural Networks”. In: Annual Meeting
of the Association for Computational Linguistics, 2015, pages 167–176 (cited on
page 406).
[522] Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. “Relation Clas-
sification via Convolutional Deep Neural Network”. In: International Conference
on Computational Linguistics, 2014, pages 2335–2344 (cited on page 406).
[523] Thien Huu Nguyen and Ralph Grishman. “Event Detection and Domain Adaptation
with Convolutional Neural Networks”. In: Annual Meeting of the Association for
Computational Linguistics, 2015, pages 365–371 (cited on page 406).
[524] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. “Recurrent Convolutional Neural
Networks for Text Classification”. In: AAAI Conference on Artificial Intelligence,
2015, pages 2267–2273 (cited on page 406).
[525] Tao Lei, Regina Barzilay, and Tommi S. Jaakkola. “Molding CNNs for text: non-
linear, non-consecutive convolutions”. In: Conference on Empirical Methods in
Natural Language Processing, 2015, pages 1565–1575 (cited on page 406).
[526] Emma Strubell, Patrick Verga, David Belanger, and Andrew Mccallum. “Fast and
Accurate Entity Recognition with Iterated Dilated Convolutions”. In: Conference
on Empirical Methods in Natural Language Processing, 2017, pages 2670–2680
(cited on page 406).
[527] Xuezhe Ma and Eduard H. Hovy. “End-to-end Sequence Labeling via Bi-directional
LSTM-CNNs-CRF”. In: Annual Meeting of the Association for Computational
Linguistics, 2016 (cited on page 406).
[528] Peng-Hsuan Li, Ruo-Ping Dong, Yu-Siang Wang, Ju-Chieh Chou, and Wei-Yun
Ma. “Leveraging Linguistic Structures for Named Entity Recognition with Bidi-
rectional Recursive Neural Networks”. In: Conference on Empirical Methods in
Natural Language Processing, 2017, pages 2664–2669 (cited on page 406).
[529] Changhan Wang, Kyunghyun Cho, and Douwe Kiela. “Code-Switched Named En-
tity Recognition with Embedding Attention”. In: Annual Meeting of the Associa-
tion for Computational Linguistics, 2018, pages 154–158 (cited on page 406).
[530] Zhouhan Lin, Minwei Feng,
cero Nogueira dos Santos, Mo Yu, Bing Xiang,
Bowen Zhou, and Yoshua Bengio. “A Structured Self-Attentive Sentence Embed-
ding”. In: International Conference on Learning Representations, 2017 (cited on
pages 408, 518).
[531] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer,
and Alexander Ku. “Image Transformer”. In: volume abs/1802.05751. CoRR, 2018
(cited on page 410).
[532] Linhao Dong, Shuang Xu, and Bo Xu. “Speech-Transformer: A No-Recurrence
Sequence-to-Sequence Model for Speech Recognition”. In: International Confer-
ence on Acoustics, Speech and Signal Processing, 2018, pages 5884–5888 (cited
on page 410).
[533] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu,
Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. “Con-
former: Convolution-augmented Transformer for Speech Recognition”. In: Inter-
national Speech Communication Association, 2020, pages 5036–5040 (cited on
pages 410, 508).
[534] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbig-
niew Wojna. “Rethinking the Inception Architecture for Computer Vision”. In:
IEEE Conference on Computer Vision and Pattern Recognition, 2016, pages 2818–
2826 (cited on pages 426, 441).
[535] Ashish Vaswani, Samy Bengio, Eugene Brevdo, François Chollet, Aidan Gomez,
Stephan Gouws, Llion Jones, Lukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan
Sepassi, Noam Shazeer, and Jakob Uszkoreit. “Tensor2Tensor for Neural Machine
Translation”. In: Association for Machine Translation in the Americas, 2018, pages 193–
199 (cited on pages 428, 429, 476, 515, 633).
[536] Matthieu Courbariaux and Yoshua Bengio. “BinaryNet: Training Deep Neural Net-
works with Weights and Activations Constrained to +1 or -1”. In: volume abs/1602.02830.
CoRR, 2016 (cited on page 428).
[537] Ye Lin, Yanyang Li, Tengbo Liu, Tong Xiao, Tongran Liu, and Jingbo Zhu. “To-
wards Fully 8-bit Integer Inference for the Transformer Model”. In: International
Joint Conference on Artificial Intelligence, 2020, pages 3759–3765 (cited on pages 428,
[538] Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. “Sharing Atten-
tion Weights for Fast Transformer”. In: International Joint Conference on Artificial
Intelligence, 2019, pages 5292–5298 (cited on pages 428, 429, 481–483).
[539] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. “An-
alyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the
Rest Can Be Pruned”. In: Annual Meeting of the Association for Computational
Linguistics, 2019, pages 5797–5808 (cited on pages 429, 482, 499, 544).
[540] Biao Zhang, Deyi Xiong, and Jinsong Su. “Accelerating Neural Transformer via
an Average Attention Network”. In: Annual Meeting of the Association for Com-
putational Linguistics, 2018, pages 1789–1798 (cited on pages 429, 483).
[541] Ye Lin, Yanyang Li, Ziyang Wang, Bei Li, Quan Du, Tong Xiao, and Jingbo Zhu.
“Weight Distillation: Transferring the Knowledge in Neural Network Parameters”.
In: volume abs/2009.09152. ArXiv, 2020 (cited on page 429).
[542] Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. “Lite Transformer
with Long-Short Range Attention”. In: International Conference on Learning Rep-
resentations, 2020 (cited on pages 429, 509).
[543] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. “Reformer: The Efficient
Transformer”. In: International Conference on Learning Representations, 2020 (cited
on pages 429, 483, 512, 601).
[544] Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. “Scaling Neural Ma-
chine Translation”. In: Annual Meeting of the Association for Computational Lin-
guistics, 2018 (cited on page 429).
[545] Aishwarya Bhandare, Vamsi Sripathi, Deepthi Karkada, Vivek Menon, Sun Choi,
Kushal Datta, and Vikram Saletore. “Efficient 8-Bit Quantization of Transformer
Neural Machine Language Translation Model”. In: volume abs/1906.00532. CoRR,
2019 (cited on pages 429, 485, 499).
[546] Abigail See, Minh-Thang Luong, and Christopher D. Manning. “Compression of
Neural Machine Translation Models via Pruning”. In: International Conference on
Computational Linguistics, 2016, pages 291–301 (cited on page 429).
[547] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. “Distilling the Knowledge in a
Neural Network”. In: volume abs/1503.02531. CoRR, 2015 (cited on pages 429,
458, 459, 482, 499, 562, 563).
[548] Yoon Kim and Alexander Rush. “Sequence-Level Knowledge Distillation”. In:
Conference on Empirical Methods in Natural Language Processing, 2016, pages 1317–
1327 (cited on pages 429, 459).
[549] Yun Chen, Yang Liu, Yong Cheng, and Victor O. K. Li. “A Teacher-Student Frame-
work for Zero-Resource Neural Machine Translation”. In: Annual Meeting of the
Association for Computational Linguistics, 2017, pages 1925–1935 (cited on pages 429,
561, 562).
[550] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan
Salakhutdinov. “Transformer-XL: Attentive Language Models Beyond a Fixed-
Length Context”. In: Annual Meeting of the Association for Computational Lin-
guistics, 2019, pages 2978–2988 (cited on pages 429, 502, 505).
[551] Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. “Learning to
Encode Position for Transformer with Continuous Dynamical Model”. In: vol-
ume abs/2003.09229. ArXiv, 2020 (cited on pages 429, 506).
[552] Ganesh Jawahar, Benoı
t Sagot, and Djamé Seddah. “What Does BERT Learn about
the Structure of Language?” In: Annual Meeting of the Association for Computa-
tional Linguistics, 2019 (cited on pages 429, 508).
[553] Baosong Yang, Zhaopeng Tu, Derek Wong, Fandong Meng, Lidia Chao, and Tong
Zhang. “Modeling Localness for Self-Attention Networks”. In: Annual Meeting of
the Association for Computational Linguistics, 2018, pages 4449–4458 (cited on
pages 429, 506, 508).
[554] Baosong Yang, Longyue Wang, Derek F. Wong, Lidia S. Chao, and Zhaopeng Tu.
“Convolutional Self-Attention Networks”. In: Annual Meeting of the Association
for Computational Linguistics, 2019, pages 4040–4045 (cited on pages 429, 508).
[555] Qiang Wang, Fuxue Li, Tong Xiao, Yanyang Li, Yinqiao Li, and Jingbo Zhu. “Multi-
layer Representation Fusion for Neural Machine Translation”. In: volume abs/2002.06714.
International Conference on Computational Linguistics, 2018 (cited on pages
514, 516).
[556] Ankur Bapna, Mia Xu Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. “Training
Deeper Neural Machine Translation Models with Transparent Attention”. In: An-
nual Meeting of the Association for Computational Linguistics, 2018, pages 3028–
3033 (cited on pages 429, 514, 516, 518, 519).
[557] Zi-Yi Dou, Zhaopeng Tu, Xing Wang, Shuming Shi, and Tong Zhang. “Exploit-
ing Deep Representations for Neural Machine Translation”. In: Annual Meeting of
the Association for Computational Linguistics, 2018, pages 4253–4262 (cited on
pages 429, 514, 516, 520, 524).
[558] Xing Wang, Zhaopeng Tu, Longyue Wang, and Shuming Shi. “Exploiting Senten-
tial Context for Neural Machine Translation”. In: Annual Meeting of the Associa-
tion for Computational Linguistics, 2019 (cited on pages 429, 514).
[559] Zi-Yi Dou, Zhaopeng Tu, Xing Wang, Longyue Wang, Shuming Shi, and Tong
Zhang. “Dynamic Layer Aggregation for Neural Machine Translation with Routing-
by-Agreement”. In: AAAI Conference on Artificial Intelligence, 2019, pages 86–
93 (cited on pages 429, 514, 516, 519).
[560] Mercedes Garcia-Martinez, Loïc Barrault, and Fethi Bougares. “Factored Neural
Machine Translation Architectures”. In: International Workshop on Spoken Lan-
guage Translation (IWSLT’16), 2016 (cited on page 434).
[561] Jason Lee, Kyunghyun Cho, and Thomas Hofmann. “Fully Character-Level Neural
Machine Translation without Explicit Segmentation”. In: volume 5. Transactions
of the Association for Computational Linguistics, 2017, pages 365–378 (cited on
pages 434, 564, 580).
[562] Minh-Thang Luong and Christopher Manning. “Achieving Open Vocabulary Neu-
ral Machine Translation with Hybrid Word-Character Models”. In: Annual Meeting
of the Association for Computational Linguistics, 2016 (cited on pages 435, 633).
[563] Philip Gage. “A new algorithm for data compression”. In: volume 12. The C Users
Journal archive, 1994, pages 23–38 (cited on page 435).
[564] Taku Kudo. “Subword Regularization: Improving Neural Network Translation Mod-
els with Multiple Subword Candidates”. In: Annual Meeting of the Association for
Computational Linguistics, 2018, pages 66–75 (cited on pages 436, 438).
[565] Mike Schuster and Kaisuke Nakajima. “Japanese and Korean voice search”. In:
IEEE International Conference on Acoustics, Speech and Signal Processing, 2012,
pages 5149–5152 (cited on page
[566] Taku Kudo and John Richardson. “SentencePiece: A simple and language indepen-
dent subword tokenizer and detokenizer for Neural Text Processing”. In: Confer-
ence on Empirical Methods in Natural Language Processing, 2018, pages 66–71
(cited on page 438).
[567] Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. “BPE-Dropout: Simple
and Effective Subword Regularization”. In: Annual Meeting of the Association
for Computational Linguistics, 2020, pages 1882–1892 (cited on page 438).
[568] Xuanli He, Gholamreza Haffari, and Mohammad Norouzi. “Dynamic Program-
ming Encoding for Subword Segmentation in Neural Machine Translation”. In: An-
nual Meeting of the Association for Computational Linguistics, 2020, pages 3042–
3051 (cited on page 438).
[569] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: volume 521.
7553. Nature, 2015, pages 436–444 (cited on pages 441, 450).
[570] Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. “Improving neural networks by preventing co-adaptation of feature
detectors”. In: volume abs/1207.0580. CoRR, 2012 (cited on page 443).
[571] Mathias Müller, Annette Rios, and Rico Sennrich. “Domain Robustness in Neural
Machine Translation”. In: Association for Machine Translation in the Americas,
2020, pages 151–164 (cited on page 445).
[572] Nicholas Carlini and David Wagner. “Towards Evaluating the Robustness of Neu-
ral Networks”. In: IEEE Symposium on Security and Privacy, 2017, pages 39–57
(cited on page 445).
[573] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. “Deep-
Fool: A Simple and Accurate Method to Fool Deep Neural Networks”. In: IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pages 2574–2582
(cited on pages 445, 469).
[574] Yong Cheng, Lu Jiang, and Wolfgang Macherey. “Robust Neural Machine Trans-
lation with Doubly Adversarial Inputs”. In: Annual Meeting of the Association for
Computational Linguistics, 2019, pages 4324–4333 (cited on pages 445, 448).
[575] Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. “Deep neural networks are easily
fooled: High confidence predictions for unrecognizable images”. In: IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2015, pages 427–436 (cited on
pages 445, 469).
[576] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan,
Ian J. Goodfellow, and Rob Fergus. “Intriguing properties of neural networks”. In:
International Conference on Learning Representations, 2014 (cited on page 445).
[577] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. “Explaining and Har-
nessing Adversarial Examples”. In: International Conference on Learning Repre-
sentations, 2015 (cited on pages 445–447).
[578] Robin Jia and Percy Liang. “Adversarial Examples for Evaluating Reading Com-
prehension Systems”. In: Conference on Empirical Methods in Natural Language
Processing, 2017, pages 2021–2031 (cited on pages 446, 469).
[579] Giannis Bekoulis, Johannes Deleu, Thomas Demeester, and Chris Develder. “Ad-
versarial training for multi-context joint entity and relation extraction”. In: Confer-
ence on Empirical Methods in Natural Language Processing, 2018, pages 2830–
2836 (cited on page 446).
[580] Michihiro Yasunaga, Jungo Kasai, and Dragomir Radev. “Robust Multilingual Part-
of-Speech Tagging via Adversarial Training”. In: Annual Conference of the North
American Chapter of the Association for Computational Linguistics, 2018, pages 976–
986 (cited on page 446).
[581] Yonatan Belinkov and Yonatan Bisk. “Synthetic and Natural Noise Both Break
