HMRC MasterclassNatural Language ProcessingSyllabus

1. Introduction

This is a second draft for review and subsequent refinement of the proposed syllabus for the masterclass to be delivered in Manchester 12–14 September 2017.

The labs below depend on having followed the instructions for preparing what machine(s) you are planning to use.

2. Topics

2.1. Day 1: Foundations

Lecture 1 Language as data
- Language as such
- Corpora and annotation
- Why is NLP hard?
Lab 1 NLP using Python and NLTK
- Reading and viewing
- Tokenisation
- Frequency distributions: counting and plotting

Lecture 2 Probabilistic Language Models
- Language modelling
- NGram models
- Smoothing
Lab 2 NGram models
- Character-level models: creating, generating, scoring
- Word-level models: authorship
- Word-level models: smoothing and fallback

Lecture 3 The noisy channel model: Part-of-speech tagging and HMMs
- Evaluation
- Noisy channel model
- Spelling correction
- Finite State Machines and Hidden Markov Models
Lab 3 Name normalisation and stemming
- Viterbi search
- Surname equivalence
- Roots and affixes

2.2. Day 2: Tasks and technologies

Lecture 1 Categorisation
- Naive Bayes
- Feature spaces
- Other generative models
Lab 1
- Stop words
- Language identification
- Authorship again

Lecture 2 Disambiguation
- Logistic Regression models
- Support Vector Machines
- Sequential Minimal Optimisation
Lab 2
- Logistic regression in practice
- [HMRC to suggest
- classification tasks]

Lecture 3 Words and meaning
- Semantics roles
- Vector-space models
- Relational models
Lab 3
- Disambiguation
- Template filling
- [...]

2.3. Day 3: Real tasks, real data

Lecture 1 Classification
- Topic Modelling
- Summarisation
- Clustering
Lab 1 Freeform textbox feedback
- Topic clustering
- Supervised vs unsupervised
- Duplicate detection

Lecture 2 Named entity recognition
- Multi-word terms
- Mutual information
- Coreference
Lab 2 Patient discharge letters
- Regular expressions
- Specialised lexicons
- Pipelines

Lecture 3 Sentiment analysis: the impact of different domains
- Product reviews
- Social media
- Company reports
Lab 3 Favourable vs. unfavourable (vs. spurious)
- Sentiment vocabularies
- Large numbers of small 'documents'
- Troll detection