HMRC Masterclass
Natural Language Processing
Syllabus


ILCC, School of Informatics
University of Edinburgh

University of Edinburgh logo
$Id: syllabus.html,v 1.4 2017/08/01 10:58:35 ht Exp $

1. Introduction

This is a second draft for review and subsequent refinement of the proposed syllabus for the masterclass to be delivered in Manchester 12–14 September 2017.

The labs below depend on having followed the instructions for preparing what machine(s) you are planning to use.

2. Topics

2.1. Day 1: Foundations

  • Lecture 1    Language as data
    • Language as such
    • Corpora and annotation
    • Why is NLP hard?
  • Lab 1    NLP using Python and NLTK
    • Reading and viewing
    • Tokenisation
    • Frequency distributions: counting and plotting
  • Lecture 2    Probabilistic Language Models
    • Language modelling
    • NGram models
    • Smoothing
  • Lab 2    NGram models
    • Character-level models: creating, generating, scoring
    • Word-level models: authorship
    • Word-level models: smoothing and fallback
  • Lecture 3    The noisy channel model: Part-of-speech tagging and HMMs
    • Evaluation
    • Noisy channel model
    • Spelling correction
    • Finite State Machines and Hidden Markov Models
  • Lab 3    Name normalisation and stemming
    • Viterbi search
    • Surname equivalence
    • Roots and affixes

2.2. Day 2: Tasks and technologies

  • Lecture 1    Categorisation
    • Naive Bayes
    • Feature spaces
    • Other generative models
  • Lab 1   
    • Stop words
    • Language identification
    • Authorship again
  • Lecture 2    Disambiguation
    • Logistic Regression models
    • Support Vector Machines
    • Sequential Minimal Optimisation
  • Lab 2   
    • Logistic regression in practice
    • [HMRC to suggest
    • classification tasks]
  • Lecture 3    Words and meaning
    • Semantics roles
    • Vector-space models
    • Relational models
  • Lab 3   
    • Disambiguation
    • Template filling
    • [...]

2.3. Day 3: Real tasks, real data

  • Lecture 1    Classification
    • Topic Modelling
    • Summarisation
    • Clustering
  • Lab 1    Freeform textbox feedback
    • Topic clustering
    • Supervised vs unsupervised
    • Duplicate detection
  • Lecture 2    Named entity recognition
    • Multi-word terms
    • Mutual information
    • Coreference
  • Lab 2    Patient discharge letters
    • Regular expressions
    • Specialised lexicons
    • Pipelines
  • Lecture 3    Sentiment analysis: the impact of different domains
    • Product reviews
    • Social media
    • Company reports
  • Lab 3    Favourable vs. unfavourable (vs. spurious)
    • Sentiment vocabularies
    • Large numbers of small 'documents'
    • Troll detection