Annie Louis: Software and Resources
Solution Complexity Corpus
This corpus contains the data used for training and evaluating the models described in
Annie Louis and Mirella Lapata, Which Step Do I Take First? Troubleshooting with Bayesian Models, Transactions of ACL, 2015.
The corpus contains 300 personal computer and smart phone related FAQs and their solution lists. The data was compiled from multiple sources on the web. The cleaned up text is provided together with the output of several preprocessing steps: sentence segmentation, lemmatization, part of speech tagging, and constituent parsing. All preprocessing was done using the Stanford CoreNLP Toolkit.
The corpus also includes complexity ratings obtained through a Mechanical Turk annotation experiment (for a subset of 100 problems). Annotators were asked to rank the randomly permuted set of solutions for a problem according to complexity. For details, please see the paper above.
The file below contains 300 FAQ problems and their solution lists. The order of the solutions as they appeared on the source FAQ website is also indicated. We use this order as a proxy for low to high complexity ranking. The format of this file is as follows.
Each line corresponds to one solution. There are 9 tab-separated columns.
- A FAQ identifier. Lines with the same FAQ identifier belong to the same problem.
- Another identifier corresponding to the source website where the problem was taken from. (This field is not used for experiments.)
- Solution identifier within the problem. The identifier value starts from 0 and the increasing order of these values is the order in which the solutions appear in the FAQ website. This order is the proxy low to high complexity order used for training our models.
- A textual description of the troubleshooting problem (not used for experiments).
- The text of the solution (cleaned and tokenized).
- The solution text after sentence segmentation and lemmatization. Tokens are separated by spaces. '#@#' indicates sentence boundaries.
- POS tags for the solution tokens.
- Parse trees for the solution sentences. Again '#@#' are sentence boundaries.
- Normalized tokens taken from the full solution text. (Used for the topic models.)
Download the file here: corpus.txt
The file below contains the complexity annotations obtained for a subset of 100 problems from our corpus above. Each solution set was randomly permuted and then presented to annotators who ranked the solutions from low to high complexity. The file contains the true order of solutions that the annotator saw and the order produced by the annotator.
Four annotators ranked each solution set. Each line of the file contains one annotator's judgements for one problem's solution set. There are 6 tab-separated columns.
- The FAQ identifier from the corpus file (indicates which problem was annotated).
- Textual description of the problem.
- Annotator identifier
- The true solution order that was presented on the interface, in terms of solution identifiers. Eg. 1 0 2 indicates that the solution with id 1 was listed first, followed by solution with id 0 and then solution with id 2. This also means that the true complexity of the middle solution is the lowest.
- The ranking assigned by the annotator. Note that the annotator's ranking starts from 1 (not 0).
- The Kendall Tau correction value between the true FAQ order and the ranking produced by the annotator.
Download the file here: annot.txt
Please send any questions or suggestions to:Annie Louis
alouis at inf dot ed dot ac dot uk