List of Tables
List of Figures
Table of Notations
Preface
Road Map
I Preliminaries
1 Introduction
1.1 Rationalist and Empiricist Approaches to Language
1.2 Scientific Content
1.2.1 Questions that linguistics should answer
1.2.2 Non-categorical phenomena in language
1.2.3 Language and cognition as probabilistic phenomena
1.3 The Ambiguity of Language: Why NLP Is Difficult
1.4 Dirty Hands
1.4.1 Lexical resources
1.4.2 Word counts
1.4.3 Zipf’s laws
1.4.4 Collocations
1.4.5 Concordances
1.5 Further Reading
1.6 Exercises
2 Mathematical Foundations
2.1 Elementary Probability Theory
2.1.1 Probability spaces
2.1.2 Conditional probability and independence
2.1.3 Bayes’ theorem
2.1.4 Random variables
2.1.5 Expectation and variance
2.1.6 Notation
2.1.7 Joint and conditional distributions
2.1.8 Determining P
2.1.9 Standard distributions
2.1.10 Bayesian statistics
2.1.11 Exercises
2.2 Essential Information Theory
2.2.1 Entropy
2.2.2 Joint entropy and conditional entropy
2.2.3 Mutual information
2.2.4 The noisy channel model
2.2.5 Relative entropy or Kullback-Leibler divergence
2.2.6 The relation to language: Cross entropy
2.2.7 The entropy of English
2.2.8 Perplexity
2.2.9 Exercises
2.3 Further Reading
3 Linguistic Essentials
3.1 Parts of Speech and Morphology
3.1.1 Nouns and pronouns
3.1.2 Words that accompany nouns: Determiners and adjectives
3.1.3 Verbs
3.1.4 Other parts of speech
3.2 Phrase Structure
3.2.1 Phrase structure grammars
3.2.2 Dependency: Arguments and adjuncts
3.2.3 X’ theory
3.2.4 Phrase structure ambiguity
3.3 Semantics and Pragmatics
3.4 Other Areas
3.5 Further Reading
3.6 Exercises
4 Corpus-Based Work
4.1 Getting Set Up
4.1.1 Computers
4.1.2 Corpora
4.1.3 Software
4.2 Looking at Text
4.2.1 Low-level formatting issues
4.2.2 Tokenization: What is a word?
4.2.3 Morphology
4.2.4 Sentences
4.3 Marked-up Data
4.3.1 Markup schemes
4.3.2 Grammatical tagging
4.4 Further Reading
4.5 Exercises
II Words
5 Collocations
5.1 Frequency
5.2 Mean and Variance
5.3 Hypothesis Testing
5.3.1 The t test
5.3.2 Hypothesis testing of differences
5.3.3 Pearson’s chi-square test
5.3.4 Likelihood ratios
5.4 Mutual Information
5.5 The Notion of Collocation
5.6 Further Reading
6 Statistical Inference: n-gram Models over Sparse Data
6.1 Bins: Forming Equivalence Classes
6.1.1 Reliability vs. discrimination
6.1.2 n-gram models
6.1.3 Building n-gram models
6.2 Statistical Estimators
6.2.1 Maximum Likelihood Estimation (MLE)
6.2.2 Laplace’s law, Lidstone’s law and the Jeffreys-Perks law
6.2.3 Held out estimation
6.2.4 Cross-validation (deleted estimation)
6.2.5 Good-Turing estimation
6.2.6 Briefly noted
6.3 Combining Estimators
6.3.1 Simple linear interpolation
6.3.2 Katz’s backing-off
6.3.3 General linear interpolation
6.3.4 Briefly noted
6.3.5 Language models for Austen
6.4 Conclusions
6.5 Further Reading
6.6 Exercises
7 Word Sense Disambiguation
7.1 Methodological Preliminaries
7.1.1 Supervised and unsupervised learning
7.1.2 Pseudowords
7.1.3 Upper and lower bounds on performance
7.2 Supervised Disambiguation
7.2.1 Bayesian classification
7.2.2 An information-theoretic approach
7.3 Dictionary-Based Disambiguation
7.3.1 Disambiguation based on sense definitions
7.3.2 Thesaurus-based disambiguation
7.3.3 Disambiguation based on translations in a second-language corpus
7.3.4 One sense per discourse, one sense per collocation
7.4 Unsupervised Disambiguation
7.5 What Is a Word Sense?
7.6 Further Reading
7.7 Exercises
8 Lexical Acquisition
8.1 Evaluation Measures
8.2 Verb Subcategorization
8.3 Attachment Ambiguity
8.3.1 Hindle and Rooth (1993)
8.3.2 General remarks on PP attachment
8.4 Selectional Preferences
8.5 Semantic Similarity
8.5.1 Vector space measures
8.5.2 Probabilistic measures
8.6 The Role of Lexical Acquisition in Statistical NLP
8.7 Further Reading
III Grammar
9 Markov Models
9.1 Markov Models
9.2 Hidden Markov Models
9.2.1 Why use HMMs?
9.2.2 General form of an HMM
9.3 The Three Fundamental Questions for HMMs
9.3.1 Finding the probability of an observation
9.3.2 Finding the best state sequence
9.3.3 The third problem: Parameter estimation
9.4 HMMs: Implementation, Properties, and Variants
9.4.1 Implementation
9.4.2 Variants
9.4.3 Multiple input observations
9.4.4 Initialization of parameter values
9.5 Further Reading
10 Part-of-Speech Tagging
10.1 The Information Sources in Tagging
10.2 Markov Model Taggers
10.2.1 The probabilistic model
10.2.2 The Viterbi algorithm
10.2.3 Variations
10.3 Hidden Markov Model Taggers
10.3.1 Applying HMMs to POS tagging
10.3.2 The effect of initialization on HMM training
10.4 Transformation-Based Learning of Tags
10.4.1 Transformations
10.4.2 The learning algorithm
10.4.3 Relation to other models
10.4.4 Automata
10.4.5 Summary
10.5 Other Methods, Other Languages
10.5.1 Other approaches to tagging
10.5.2 Languages other than English
10.6 Tagging Accuracy and Uses of Taggers
10.6.1 Tagging accuracy
10.6.2 Applications of tagging
10.7 Further Reading
10.8 Exercises
11 Probabilistic Context Free Grammars
11.1 Some Features of PCFGs
11.2 Questions for PCFGs
11.3 The Probability of a String
11.3.1 Using inside probabilities
11.3.2 Using outside probabilities
11.3.3 Finding the most likely parse for a sentence
11.3.4 Training a PCFG
11.4 Problems with the Inside-Outside Algorithm
11.5 Further Reading
11.6 Exercises
12 Probabilistic Parsing
12.1 Some Concepts
12.1.1 Parsing for disambiguation
12.1.2 Treebanks
12.1.3 Parsing models vs. language models
12.1.4 Weakening the independence assumptions of PCFGs
12.1.5 Tree probabilities and derivational probabilities
12.1.6 There’s more than one way to do it
12.1.7 Phrase structure grammars and dependency grammars
12.1.8 Evaluation
12.1.9 Equivalent models
12.1.10 Building parsers: Search methods
12.1.11 Use of the geometric mean
12.2 Some Approaches
12.2.1 Non-lexicalized grammars treebank
12.2.2 Lexicalized models using derivational histories
12.2.3 Dependency-based models
12.2.4 Discussion
12.3 Further Reading
12.4 Exercises
IV Applications and Techniques
13 Statistical Alignment and Machine Translation
13.1 Text Alignment
13.1.1 Aligning sentences and paragraphs
13.1.2 Length-based methods
13.1.3 Offset alignment by signal processing techniques
13.1.4 Lexical methods of sentence alignment
13.1.5 Summary
13.1.6 Exercises
13.2 Word Alignment
13.3 Statistical Machine Translation
13.4 Further Reading
14 Clustering
14.1 Hierarchical Clustering
14.1.1 Single-link and complete-link clustering
14.1.2 Group-average agglomerative clustering
14.1.3 An application: Improving a language model
14.1.4 Top-down clustering
14.2 Non-Hierarchical Clustering
14.2.1 K-means
14.2.2 The EM algorithm
14.3 Further Reading
14.4 Exercises
15 Topics in Information Retrieval
15.1 Some Background on Information Retrieval
15.1.1 Common design features of IR systems
15.1.2 Evaluation measures
15.1.3 The probability ranking principle (PRP)
15.2 The Vector Space Model
15.2.1 Vector similarity
15.2.2 Term weighting
15.3 Term Distribution Models
15.3.1 The Poisson distribution
15.3.2 The two-Poisson model
15.3.3 The K mixture
15.3.4 Inverse document frequency
15.3.5 Residual inverse document frequency
15.3.6 Usage of term distribution models
15.4 Latent Semantic Indexing
15.4.1 Least-squares methods
15.4.2 Singular Value Decomposition
15.4.3 Latent Semantic Indexing in IR
15.5 Discourse Segmentation
15.5.1 TextTiling
15.6 Further Reading
15.7 Exercises
16 Text Categorization
16.1 Decision Trees
16.2 Maximum Entropy Modeling
16.2.1 Generalized iterative scaling
16.2.2 Application to text categorization
16.3 Perceptrons
16.4 k Nearest Neighbor Classification
16.5 Further Reading
Tiny Statistical Tables
Bibliography
Index
Statistical approaches to processing natural language text have become dominant in recent years. This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear. The book contains all the theory and algorithms needed for building NLP tools. It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications.