Show all abstracts
 Canonical Correlation Inference for Mapping Abstract Scenes to Text, Helen Jiang, Nikos Papasarantopoulos and Shay B. Cohen, In arXiv (preprint)
 LatentVariable PCFGs: Background and Applications, Shay B. Cohen, In MoL 2017 (to appear) [pdf]
 The SUMMA Platform Prototype, Renars Liepins et al., In EACL 2017 (demo track) [pdf]
 An Incremental Parser for Abstract Meaning Representation, Marco Damonte, Shay B. Cohen and Giorgio Satta, In EACL 2017 [pdf] [arxiv] [Marco's slides]
 SemiSupervised Learning of Sequence Models with the Method of Moments, Zita Marinho, André F. T. Martins, Shay B. Cohen and Noah A. Smith , In EMNLP 2016
[pdf] [abstract] [bibtex] [Zita's slides]
We propose a fast and scalable method for semisupervised learning of sequence models,
based on anchor words and moment matching. Our method can handle hidden Markov models with featurebased loglinear emissions. Unlike other semisupervised methods, no decoding passes are necessary on the unlabeled data and no graph needs to be constructedonly one pass is necessary to collect moment statistics.
The model parameters are estimated by solving a small quadratic program for each feature.
Experiments on partofspeech (POS) tagging for Twitter and for a lowresource language (Malagasy)
show that our method can learn from very few annotated sentences.
@book{marinho16,
title={SemiSupervised Learning of Sequence Models with the Method of Moments},
author={Z. Marinho and A. F. T. Martins and S. B. Cohen and N. A. Smith},
booktitle={Proceedings of {EMNLP}},
year={2016}
}
 Bayesian Analysis in Natural Language Processing, Shay B. Cohen, Synthesis Lectures on Human Language Technologies, Morgan and Claypool, 2016 [abstract] [bibtex] [website] [hardcopy] [amazon]
Natural language processing (NLP) went through a profound transformation in the mid1980s when it shifted to make heavy use of corpora and datadriven techniques to analyze language. Since then, the use of statistical techniques in NLP has evolved in several ways. One such example of evolution took place in the late 1990s or early 2000s, when fullfledged Bayesian machinery was introduced to NLP. This Bayesian approach to NLP has come to accommodate for various shortcomings in the frequentist approach and to enrich it, especially in the unsupervised setting, where statistical learning is done without target prediction examples.
We cover the methods and algorithms that are needed to fluently read Bayesian learning papers in NLP and to do research in the area. These methods and algorithms are partially borrowed from both machine learning and statistics and are partially developed "inhouse" in NLP. We cover inference techniques such as Markov chain Monte Carlo sampling and variational inference, Bayesian estimation, and nonparametric modeling. We also cover fundamental concepts in Bayesian statistics such as prior distributions, conjugacy, and generative modeling. Finally, we cover some of the fundamental modeling techniques in NLP, such as grammar modeling, and their use with Bayesian analysis.
Keywords: natural language processing, computational linguistics, Bayesian statistics, Bayesian NLP, statistical learning, inference in NLP, grammar modeling in NLP
@book{cohen16,
title={Bayesian Analysis in Natural Language Processing},
author={Shay B. Cohen},
series = {Synthesis Lectures on Human Language Technologies},
publisher={Morgan and Claypool},
year={2016}
}
 Encoding Prior Knowledge with Eigenword Embeddings, Dominique Osborne, Shashi Narayan and Shay B. Cohen, In TACL 2016 [pdf] [abstract] [bibtex] [Shashi's slides]
Canonical correlation analysis (CCA) is a method for reducing the dimension of data represented using two views. It has been previously used to derive word embeddings, where one view indicates a word, and the other view indicates its context. We describe a way to incorporate prior knowledge into CCA, give a theoretical justification for it, and test it by deriving word embeddings and evaluating them on a myriad of datasets.
@inproceedings{osborne16,
author = "D. Osborne and S. Narayan and S. B. Cohen",
title = "Encoding Prior Knowledge with Eigenword Embeddings",
booktitle = "Transactions of the Association for Computational Linguistics",
year = "2016"
}
 Optimizing Spectral Learning for Parsing, Shashi Narayan and Shay B. Cohen, In ACL 2016 [pdf] [abstract] [bibtex] [models] [Shashi's slides]
We describe a search algorithm for optimizing the number of latent
states when estimating latentvariable PCFGs with spectral methods.
Our results show that contrary to the common belief that the number of
latent states for each nonterminal in an LPCFG can be decided in
isolation with spectral methods, parsing results significantly improve
if the number of latent states for each nonterminal is globally
optimized, while taking into account interactions between the
different nonterminals. In addition, we contribute an empirical
analysis of spectral algorithms on eight morphologically rich
languages: Basque, French, German, Hebrew, Hungarian, Korean, Polish
and Swedish. Our results show that our estimation consistently
performs better or close to coarsetofine expectationmaximization
techniques for these languages
@inproceedings{narayan16b,
title={Optimizing Spectral Learning for Parsing},
author={Shashi Narayan and Shay B. Cohen},
booktitle={Proceedings of {ACL}},
year={2016}
}
 Paraphrase Generation from LatentVariable PCFGs for Semantic Parsing, Shashi Narayan, Siva Reddy and Shay B. Cohen, In INLG, 2016 [arxiv] [abstract] [bibtex] [Shashi's slides]
One of the limitations of semantic parsing approaches to opendomain question answering is the lexicosyntactic gap between natural language questions and knowledge base entries  there are many ways to ask a question, all with the same answer. In this paper we propose to bridge this gap by generating paraphrases of the input question with the goal that at least one of them will be correctly mapped to a knowledgebase query. We introduce a novel grammar model for paraphrase generation that does not require any sentencealigned paraphrase corpus. Our key idea is to leverage the flexibility and scalability of latentvariable probabilistic contextfree grammars to sample paraphrases. We do an extrinsic evaluation of our paraphrases by plugging them into a semantic parser for Freebase. Our evaluation experiments on the WebQuestions benchmark dataset show that the performance of the semantic parser significantly improves over strong baselines.
@inproceedings{narayan16,
title={Paraphrase Generation from LatentVariable PCFGs for Semantic Parsing},
author={Shashi Narayan and Siva Reddy and Shay B. Cohen},
booktitle={Proceedings of {INLG}},
year={2015}
}
 Parsing Linear ContextFree Rewriting Systems with Fast Matrix Multiplication, Shay B. Cohen and Daniel Gildea, In Computational Linguistics, 2016 [pdf] [abstract] [bibtex] [arxiv]
We describe a matrix multiplication recognition algorithm for a subset of binary linear contextfree rewriting systems (LCFRS) with running time O(n^{\omega d}) where M(m)=O(m^\omega) is the running time for m*m matrix multiplication and d is the "contact rank" of the LCFRS  the maximal number of combination and noncombination points that appear in the grammar rules. We also show that this algorithm can be used as a subroutine to get a recognition algorithm for general binary LCFRS with running time O(n^(\omega d+1)). The currently best known \omega is smaller than 2.38. Our result provides another proof for the best known result for parsing mildly context sensitive formalisms such as combinatory categorial grammars, head grammars, linear indexed grammars, and tree adjoining grammars, which can be parsed in time O(n^4.76). It also shows that inversion transduction grammars can be parsed in time O(n^5.76). In addition, binary LCFRS subsumes many other formalisms and types of grammars, for some of which we also improve the asymptotic complexity of parsing.
@article{cohen15a,
title={Parsing Linear ContextFree Rewriting Systems with Fast Matrix Multiplication},
author={Shay B. Cohen and Daniel Gildea},
journal={Computational Linguistics},
year={2016}
}
 LowRank Approximation of Weighted Tree Automata, Guillaume Rabusseau, Borja Balle and Shay B. Cohen, In AISTATS 2016 [pdf] [abstract] [bibtex] [supplementary material] [arxiv]
We describe a technique to minimize weighted tree automata (WTA), a powerful formalisms that subsumes probabilistic contextfree grammars (PCFGs) and latentvariable PCFGs. Our method relies on a singular value decomposition of the underlying Hankel matrix defined by the WTA. Our main theoretical result is an efficient algorithm for computing the SVD of an infinite Hankel matrix implicitly represented as a WTA. We provide an analysis of the approximation error induced by the minimization, and we evaluate our method on realworld data originating in newswire treebank. We show that the model achieves lower perplexity than previous methods for PCFG minimization, and also is much more stable due to the absence of local optima.
@inproceedings{rabusseua16,
title={LowRank Approximation of Weighted Tree Automata},
author={Guillaume Rabusseau and Borja Balle and Shay B. Cohen},
booktitle={Proceedings of {AISTATS}},
year={2016}
}
 Conversation Trees: A Grammar Model for Topic Structure in Forums, Annie Louis and Shay B. Cohen, In EMNLP 2015 [pdf] [abstract] [bibtex] [data]
Online forum discussions proceed differently from facetoface conversations and
any single thread on an online forum contains posts on different subtopics.
This work aims to characterize the content of
a forum thread as a \emph{conversation tree} of topics. We present models
that jointly perform two tasks: segment a thread into subparts, and assign
a topic to each part. Our core idea is a definition of
topic structure using probabilistic grammars. By leveraging the flexibility
of two grammar formalisms, ContextFree Grammars and Linear
ContextFree Rewriting Systems, our models
create desirable structures for forum threads:
our topic segmentation is hierarchical, links nonadjacent segments on the
same topic, and jointly labels the topic during segmentation. We show
that our models outperform a number of tree generation baselines.
@inproceedings{louis15,
title={Conversation Trees: A Grammar Model for Topic Structure in Forums},
author={Annie Louis and Shay B. Cohen},
booktitle={Proceedings of {EMNLP}},
year={2015}
}
 Diversity in Spectral Learning for Natural Language Parsing, Shashi Narayan and Shay B. Cohen, In EMNLP 2015 [pdf] [abstract] [bibtex]
We describe an approach to create a diverse set of predictions with spectral
learning of latentvariable PCFGs (LPCFGs). Our approach works by
creating multiple spectral models where noise is added to the
underlying features in the training set before the estimation of
each model. We describe three ways to decode with multiple
models. In addition, we describe a simple variant of the spectral
algorithm for LPCFGs that is fast and leads to compact models. Our
experiments for natural language parsing, for English and German,
show that we get a significant improvement over baselines comparable
to state of the art. For English, we achieve the $F_1$ score of
90.18, and for German we achieve the $F_1$ score of 83.38.
@inproceedings{narayan15,
title={Diversity in Spectral Learning for Natural Language Parsing},
author={Shashi Narayan and Shay B. Cohen},
booktitle={Proceedings of {EMNLP}},
year={2015}
}
 A Coactive Learning View of Online Structured Prediction in Statistical Machine Translation, Artem Sokolov, Stefan Riezler and Shay B. Cohen, In CoNLL 2015 [pdf] [abstract] [bibtex] [Artem's slides]
We present a theoretical analysis of online parameter tuning in statistical machine translation (SMT) from a coactive learning view. This perspective allows us to give regret and generalization bounds for latent perceptron algorithms that are common in SMT, but fall outside of the standard convex optimization scenario. Coactive learning also introduces the concept of weak feedback, which we apply in a proofofconcept experiment to SMT, showing that learning from feedback that consists of slight improvements over predictions leads to convergence in regret and translation error rate. This suggests that coactive learning might be a viable framework for interactive machine translation. Furthermore, we find that surrogate translations replacing references that are unreachable in the decoder search space can be interpreted as weak feedback and lead to convergence in learning, if they admit an underlying linear model.
@inproceedings{sokolv15,
title={A Coactive Learning View of Online Structured Prediction in Statistical Machine Translation},
author={Artem Sokolv and Stefan Riezler and Shay B. Cohen},
booktitle={Proceedings of {CoNLL}},
year={2015}
}
 Lexical Event Ordering with an EdgeFactored Model, Omri Abend, Shay B. Cohen and Mark Steedman, In NAACL 2015 [pdf] [abstract] [bibtex] [data] [slides]
Extensive lexical knowledge is necessary for temporal analysis and planning tasks.
We address in this paper a lexical setting that allows for the straightforward
incorporation of rich features and structural constraints. We explore a lexical event ordering task, namely
determining the likely temporal order of events based solely on the identity of their predicates and arguments.
We propose an ``edgefactored'' model for the task that decomposes over the edges of the event graph.
We learn it using the structured perceptron. As lexical tasks require large amounts of text, we do not attempt manual annotation and
instead use the textual order of events in a domain where this order is aligned
with their temporal order, namely cooking recipes.
@inproceedings{abend15,
author = "Omri Abend and S. B. Cohen and Mark Steedman",
title = "Lexical Event Ordering with an EdgeFactored Model",
booktitle = "Proceedings of {NAACL}",
year = "2015"
}
 Online Adaptor Grammars with Hybrid Inference, Ke Zhai, Jordan BoydGraber and Shay B. Cohen, In TACL 2014
[pdf] [supplementary material] [abstract] [bibtex] [Ke's code]
Adaptor grammars are a flexible, powerful formalism for defining nonparametric, unsupervised models of grammar productions. This flexibility comes at the cost of expensive inference. We address the difficulty of inference through an online algorithm which uses a hybrid of Markov chain Monte Carlo and variational inference. We show that this inference strategy improves scalability without sacrificing performance on unsupervised word segmentation and topic modeling tasks.
@inproceedings{zhai14,
author = "K. Zhai and J. BoydGraber and S. B. Cohen",
title = "Online Adaptor Grammars with Hybrid Inference",
booktitle = "Transactions of the Association for Computational Linguistics",
year = "2014"
}
 LatentVariable Synchronous CFGs for Hierarchical Translation, Avneesh Saluja, Chris Dyer and Shay B. Cohen, In EMNLP 2014
[pdf] [abstract] [bibtex] [code]
Datadriven refinement of nonterminal categories has been demonstrated to be a reliable technique for improving monolingual parsing with PCFGs. In this paper,
we extend these techniques to learn latent refinements of singlecategory synchronous grammars, so as to improve
translation performance. We compare two estimators for this latentvariable model:
one based on EM and the other is a spectral algorithm based on the method of moments. We evaluate their performance on a
ChineseEnglish translation task. The results indicate that we can achieve significant gains over the baseline with both approaches, but in particular the momentsbased
estimator is both faster and performs better than EM.
@inproceedings{saluja14,
author = "A. Saluja and C. Dyer and S. B. Cohen",
title = "LatentVariable Synchronous {CFGs} for Hierarchical Translation",
booktitle = "Proceedings of {EMNLP}",
year = "2014"
}
 Spectral Learning of LatentVariable PCFGs: Algorithms and Sample Complexity, Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster and Lyle Ungar, In JMLR 2014
[pdf] [abstract] [bibtex]
We introduce a spectral learning algorithm for latentvariable PCFGs (Matsuzaki et al., 2005; Petrov et al., 2006).
Under a separability (singular value) condition, we prove that the method provides statistically consistent parameter
estimates. Our result rests on three theorems: the first gives
a tensor form of the insideoutside algorithm for PCFGs; the second
shows that the required tensors can be estimated directly from
training examples where hiddenvariable values are missing; the third
gives a PACstyle convergence bound for the estimation method.
@article{cohen14c,
author = "S. B. Cohen and K. Stratos and M. Collins and D. P. Foster and L. Ungar",
title = "Spectral Learning of LatentVariable {PCFGs}: Algorithms and Sample Complexity",
journal = "Journal of Machine Learning Research",
year = "2014"
}
 A Provably Correct Learning Algorithm for LatentVariable PCFGs, Shay B. Cohen and Michael Collins, In ACL 2014
[pdf] [abstract] [bibtex] [slides]
We introduce a provably correct learning algorithm for
latentvariable PCFGs.
The algorithm relies on two steps: first, the use of a
matrixdecomposition algorithm applied to a cooccurrence matrix
estimated from the parse trees in a training sample; second, the use
of EM applied to a convex objective derived from the training samples
in combination with the output from the matrix decomposition.
Experiments on parsing and a language modeling problem
show that the algorithm is efficient and
effective in practice.
@inproceedings{cohen14b,
author = "S. B. Cohen and M. Collins",
title = "A Provably Correct Learning Algorithm for LatentVariable {PCFGs}",
booktitle = "Proceedings of {ACL}",
year = "2014"
}
 Spectral Unsupervised Parsing with Additive Tree Metrics, Ankur P. Parikh, Shay B. Cohen and Eric Xing, In ACL 2014
[pdf] [abstract] [bibtex] [appendix] [Ankur's slides]
We propose a spectral approach for unsupervised constituent parsing that comes with theoretical guarantees on latent structure recovery. Our approach is grammarless  we directly learn the bracketing structure of a given sentence without using a grammar model. The main algorithm is based on lifting the concept of additive tree metrics for structure learning of latent trees in the phylogenetic and machine learning communities to the case where the tree structure varies across examples. Although finding the ``minimal'' latent tree is NPhard in general, for the case of projective trees we find that it can be found using bilexical parsing algorithms. Empirically, our algorithm performs favorably compared to the constituent context model of Klein and Manning (2002) without the need for careful initialization.
@inproceedings{parikh14,
author = "A. P. Parikh and S. B. Cohen and E. Xing",
title = "Spectral Unsupervised Parsing with Additive Tree Metrics",
booktitle = "Proceedings of {ACL}",
year = "2014"
}
 Lexical Inference over MultiWord Predicates: A Distributional Approach, Omri Abend, Shay B. Cohen and Mark Steedman, In ACL 2014
[pdf] [abstract] [bibtex] [code for train/test split]
Representing predicates in terms of their argument distribution is common
practice in NLP.
Multiword predicates (MWPs) in this context are often either
disregarded or considered as fixed expressions.
The latter treatment is unsatisfactory in two ways: (1) identifying MWPs is notoriously difficult,
(2) MWPs show varying degrees of compositionality and could benefit from
taking into account the identity of their component parts.
We propose a novel approach that integrates the distributional
representation of multiple subsets of the MWP's words. We assume a latent
distribution over subsets of the MWP, and estimate it relative
to a downstream prediction task.
Focusing on the supervised identification of lexical inference relations, we
compare against stateoftheart baselines that consider
a single subset of an MWP, obtaining substantial improvements.
To our knowledge, this is the first work to address lexical relations between MWPs
of varying degrees of compositionality within distributional semantics.
@inproceedings{abend14,
author = "O. Abend and S. B. Cohen and M. Steedman",
title = "Lexical Inference over MultiWord Predicates: A Distributional Approach",
booktitle = "Proceedings of {ACL}",
year = "2014"
}
 Spectral Learning of Refinement HMMs, Karl Stratos, Alexander M. Rush, Shay B. Cohen and Michael Collins, In CoNLL 2013 [pdf] [abstract] [bibtex] [Karl's slides]
We derive a spectral algorithm for learning the parameters of a refinement HMM.
This method is simple, efficient, and can be applied to a wide range of supervised
sequence labeling tasks. Like other spectral methods, it avoids the problem of local optima and provides a consistent estimate of the parameters. Our experiments
on a phoneme recognition task show that when equipped with informative feature functions, it performs significantly better than a supervised HMM and competitively
with EM.
@inproceedings{stratos13,
author = "K. Stratos and A. M. Rush and S. B. Cohen and M. Collins",
title = "Spectral Learning of Refinement {HMMs}",
booktitle = "Proceedings of {CoNLL}",
year = "2013"
}
 The Effect of Nontightness on Bayesian Estimation of PCFGs, Shay B. Cohen and Mark Johnson, In ACL 2013 [pdf] [mathematica output] [abstract] [bibtex] [Mark's slides] [blog post]
Probabilistic contextfree grammars have the unusual property of not always defining tight distributions (i.e., the sum of the ``probabilities'' of the trees the grammar generates can
be less than one). This paper reviews how this nontightness can arise and discusses its impact on Bayesian estimation of PCFGs. We
begin by presenting the notion of ``almost everywhere tight grammars'' and show that linear CFGs follow it. We then propose three different ways of reinterpreting nontight PCFGs
to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct
under one of them, and provide MCMC samplers for the other two. We conclude with a
discussion of the impact of tightness empirically.
@inproceedings{cohen13c,
author = "S. B. Cohen and M. Johnson",
title = "The Effect of Nontightness on Bayesian Estimation of {PCFGs}",
booktitle = "Proceedings of {ACL}",
year = "2013"
}
 Experiments with Spectral Learning of LatentVariable PCFGs, Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster and Lyle Ungar, In NAACL 2013 [pdf] [abstract] [bibtex] [talk (video)] [talk (slides)]
Latentvariable PCFGs (LPCFGs) are a highly successful model for
natural language parsing. Recent work (Cohen et al., 2012) has introduced a
spectral algorithm for parameter estimation of LPCFGs, whichunlike
the EM algorithmis guaranteed to give consistent parameter estimates
(it has PACstyle guarantees of sample complexity). This paper
describes experiments using the spectral algorithm. We show that the algorithm
provides models with the same accuracy as EM, but is an order of
magnitude more efficient.
We describe a number of key steps used to obtain this
level of performance; these should be relevant to other work on the
application of spectral learning algorithms. We view our results as strong
empirical evidence for the viability of spectral methods as an
alternative to EM.
@inproceedings{cohen13b,
author = "S. B. Cohen and K. Stratos and M. Collins and D. P. Foster and L. Ungar",
title = "Experiments with Spectral Learning of LatentVariable {PCFGs}",
booktitle = "Proceedings of {NAACL}",
year = "2013"
}
 Approximate PCFG Parsing Using Tensor Decomposition, Shay B. Cohen, Giorgio Satta and Michael Collins, In NAACL 2013 [pdf] [abstract] [bibtex]
We provide an approximation algorithm for PCFG parsing, which asymptotically improves time complexity with respect to the input grammar size, and prove upper bounds on the approximation quality.
We test our algorithm on two treebanks, and get significant improvements in parsing speed.
@inproceedings{cohen13a,
author = "S. B. Cohen and G. Satta and M. Collins",
title = "Approximate PCFG Parsing Using Tensor Decomposition",
booktitle = "Proceedings of {NAACL}",
year = "2013"
}
 Tensor Decomposition for Fast Parsing with LatentVariable PCFGs, Shay B. Cohen and Michael Collins, In NIPS 2012 [pdf] [abstract] [bibtex]
We describe an approach to speedup inference with latentvariable PCFGs, which have been shown to
be highly effective for natural language parsing. Our approach is based on a tensor formulation recently
introduced for spectral estimation of latentvariable PCFGs coupled with a tensor decomposition
algorithm wellknown in the multilinear algebra literature.
We also describe an error bound for this approximation, which gives guarantees showing
that if the underlying tensors are well approximated, then the probability distribution
over trees will also be well approximated.
Empirical evaluation on realworld natural
language parsing data demonstrates a significant speedup at minimal cost for parsing performance.
@inproceedings{cohen12c,
author = "S. B. Cohen and M. Collins",
title = "Tensor Decomposition for Fast Parsing with LatentVariable {PCFGs}",
booktitle = "Proceedings of NIPS",
year = "2012"
}
 Elimination of Spurious Ambiguity in TransitionBased Dependency Parsing, Shay B. Cohen, Carlos GómezRodríguez and Giorgio Satta, In arXiv (1206.6735), 2012 [pdf] [abstract] [bibtex]
We present a novel technique to remove spurious ambiguity from transition systems for dependency parsing. Our technique chooses a canonical sequence of transition operations (computation) for a given dependency tree. Our technique can be applied to a large class of bottomup transition systems, including for instance Nivre (2004) and Attardi (2006).
@techreport{cohen12b,
author = "S. B. Cohen and C. G{\'o}mezRodr{\'\i}guez and G. Satta",
title = "Elimination of Spurious Ambiguity in TransitionBased Dependency Parsing",
year = "2012",
eprint = "arXiv:1206.6735",
url = "http://arxiv.org/pdf/1206.6735v1"
}
 Spectral Learning of LatentVariable PCFGs, Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster and Lyle Ungar, In ACL 2012 [pdf] [longer JMLR version, stronger model] [abstract] [bibtex]
We introduce a spectral learning algorithm for latentvariable PCFGs
(Petrov et al., 2006; Matsuzaki et al., 2005). Under a separability (singular
value) condition, we prove that the method provides consistent parameter
estimates. Our result rests on three theorems: the first gives
a tensor form of the insideoutside algorithm for PCFGs; the second
shows that the required tensors can be estimated directly from
training examples where hiddenvariable values are missing; the third
gives a PACstyle convergence bound for the estimation method.
@inproceedings{cohen12a,
author = "S. B. Cohen and K. Stratos and M. Collins and D. P. Foster and L. Ungar",
title = "Spectral Learning of LatentVariable {PCFGs}",
booktitle = "Proceedings of ACL",
year = "2012"
}
 Empirical Risk Minimization for Probabilistic Grammars: Sample Complexity and Hardness of Learning, Shay B. Cohen and Noah A. Smith, Computational Linguistics (2012) [pdf] [abstract] [bibtex]
Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. They are used ubiquitously in computational linguistics. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of probabilistic grammars using the logloss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. By making assumptions about the underlying distribution that are appropriate for natural language scenarios, we are able to derive distributiondependent sample complexity bounds for probabilistic grammars. We also give simple algorithms for carrying out empirical risk minimization using this framework in both the supervised and unsupervised settings. In the unsupervised case, we show that the problem of minimizing empirical risk is NPhard. We therefore suggest an approximate algorithm, similar to expectationmaximization, to minimize the empirical risk.
@article{cohen12c,
author = "S. B. Cohen and N. A. Smith",
title = "Empirical Risk Minimization for Probabilistic Grammars: Sample Complexity and Hardness of Learning",
journal = "Computational Linguistics",
volume = "38",
number = "3",
pages = "479526",
year = "2012"
}
 Unsupervised Structure Prediction with NonParallel Multilingual Guidance, Shay B. Cohen, Dipanjan Das and Noah A. Smith, In EMNLP 2011[pdf] [abstract] [bibtex]
We describe a method for prediction of linguistic structure in a language for which only unlabeled data is available,
using annotated data from a set of one or more helper languages.
Our approach is based on a model that locally mixes between supervised
models from the helper languages.
Parallel data is not used, allowing the technique to be applied even in domains where humantranslated texts are unavailable.
We obtain stateoftheart performance for two tasks of
structure prediction: unsupervised partofspeech tagging and unsupervised dependency
parsing.
@inproceedings{cohen11b,
author = "S. B. Cohen and D. Das and N. A. Smith",
title = "Unsupervised Structure Prediction with NonParallel Multilingual Guidance",
booktitle = "Proceedings of EMNLP",
year = "2011"
}
 Exact Inference for Generative Probabilistic NonProjective Dependency Parsing, Shay B. Cohen, Carlos GómezRodríguez and Giorgio Satta, In EMNLP 2011 [pdf] [abstract] [bibtex]
We describe a generative model for nonprojective dependency parsing
based on a simplified version of a transition system that has recently appeared in the literature.
We then develop a dynamic programming parsing algorithm for our model,
and derive an insideoutside algorithm that can be used
for unsupervised learning of nonprojective dependency trees.
@inproceedings{cohen11b,
author = "S. B. Cohen and C. G{\'o}mezRodr{\'\i}guez and G. Satta",
title = "Exact Inference for Generative Probabilistic NonProjective Dependency Parsing",
booktitle = "Proceedings of EMNLP",
year = "2011"
}
 Products of Weighted Logic Programs, Shay B. Cohen, Robert J. Simmons and Noah A. Smith, In Theory and Practice of Logic Programming, 2011 [pdf] [abstract] [bibtex]
Weighted logic programming, a generalization of bottomup logic
programming, is a wellsuited framework for specifying dynamic
programming algorithms. In this setting, proofs correspond to the
algorithm's output space, such as a path through a graph or a
grammatical derivation, and are given a realvalued score (often
interpreted as a probability) that depends on the real weights of the base
axioms used in the proof. The desired output is a function over all
possible proofs, such as a sum of scores or an optimal score. We
describe the PRODUCT transformation, which can merge two weighted logic
programs into a new one. The resulting program optimizes a product
of proof scores from the original programs, constituting a scoring
function known in machine learning as a ``product of experts.''
Through the addition of intuitive constraining side conditions, we
show that several important dynamic programming
algorithms can be derived by applying PRODUCT to weighted logic programs
corresponding to simpler weighted logic programs.
@article{cohen11a,
author = "S. B. Cohen and R. J. Simmons and N. A. Smith",
title = "Products of Weighted Logic Programs",
journal = "Theory and Practice of Logic Programming",
year = "2011"
}
 Empirical Risk Minimization with Approximations of Probabilistic Grammars, Shay B. Cohen and Noah A. Smith, In NIPS, 2010 [pdf] [appendix  pdf] [abstract] [bibtex]
Probabilistic grammars are generative statistical models that are useful for
compositional and sequential structures. We present a framework, reminiscent of structural risk minimization,
for empirical risk minimization of the parameters of a fixed probabilistic
grammar using the logloss. We derive sample complexity bounds in this framework that apply
both to the supervised setting and the unsupervised setting.
@inproceedings{cohen10e,
author = "S. B. Cohen and N. A. Smith",
title = "Empirical Risk Minimization with Approximations of Probabilistic Grammars",
booktitle = "Proceedings of {NIPS}",
year = "2010"
}
 Covariance in Unsupervised Learning of Probabilistic Grammars, Shay B. Cohen and Noah A. Smith, In JMLR, 2010 [pdf] [abstract] [bibtex]
Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text.
Their symbolic component is amenable to inspection by humans, while their probabilistic
component helps resolve ambiguity. They also permit the use of wellunderstood, generalpurpose learning
algorithms. There has been an increased interest in using probabilistic grammars in the
Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior.
The Dirichlet prior has several limitations, including that it cannot directly model covariance
between the probabilistic grammar's parameters. Yet, various grammar parameters are expected to be correlated because
the elements in language they represent share linguistic properties.
In this paper, we suggest an alternative to the
Dirichlet prior, a family of logistic normal distributions.
We derive an inference algorithm for this family of distributions and
experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors
on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, nonparallel data.
@article{cohen10d,
author = "S. B. Cohen and N. A. Smith",
title = "Covariance in Unsupervised Learning of Probabilistic Grammars",
journal = "Journal of Machine Learning Research",
volume = "11",
pages = "30173051",
year = "2010"
}
 Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization, Shay B. Cohen and Noah A. Smith, In ACL 2010 [pdf] [abstract] [bibtex] [slides]
We consider the search for a maximum likelihood assignment of hidden derivations and
grammar weights for a probabilistic contextfree grammar, the problem approximately solved by ``Viterbi training.''
We show that solving and even approximating Viterbi training for PCFGs
is NPhard. We
motivate the use of uniformatrandom initialization
for Viterbi EM as an optimal initializer in
absence of further information about the correct model
parameters, providing an approximate bound on the
loglikelihood.
@inproceedings{cohen10c,
author = "S. B. Cohen and N. A. Smith",
title = "Viterbi Training for {PCFGs}: Hardness Results and Competitiveness of Uniform Initialization",
booktitle = "Proceedings of {ACL}",
year = "2010"
}
 Variational Inference for Adaptor Grammars, Shay B. Cohen, David M. Blei and Noah A. Smith, In NAACL 2010 [pdf] [abstract] [bibtex] [slides]
Adaptor grammars extend probabilistic contextfree grammars to
define prior distributions over trees with ``rich get richer''
dynamics. Inference for adaptor grammars seeks to find parse trees
for raw text. This paper describes a variational inference
algorithm for adaptor grammars, providing an alternative to Markov
chain Monte Carlo methods. To derive this method, we develop a
stickbreaking representation of adaptor grammars, a representation
that enables us to define adaptor grammars with recursion.
We report experimental results on a word segmentation
task, showing that variational inference performs comparably to
MCMC. Further, we show a significant speedup when parallelizing the
algorithm. Finally, we report promising results for a new
application for adaptor grammars, dependency grammar induction.
@inproceedings{cohen10b,
author = "S. B. Cohen and D. M. Blei and N. A. Smith",
title = "Variational Inference for Adaptor Grammars",
booktitle = "Proceedings of {NAACL}",
year = "2010"
}
 Variational Inference for Grammar Induction with Prior Knowledge, Shay B. Cohen and Noah A. Smith, In ACL 2009 (short paper track) [pdf] [abstract] [bibtex]
Variational EM has become a popular technique in probabilistic NLP
with hidden variables.
Commonly, for computational tractability,
we make strong independence assumptions, such as the meanfield
assumption, in approximating posterior distributions over hidden
variables.
We show how a looser restriction on the approximate posterior, requiring
it to be a mixture, can help inject prior knowledge to exploit soft constraints
during the variational Estep.
@inproceedings{cohen09b,
author = "S. B. Cohen and N. A. Smith",
title = "Variational Inference for Grammar Induction with Prior Knowledge",
booktitle = "Proceedings of {ACL}",
year = "2009"
}
 Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction, Shay B. Cohen and Noah A. Smith, In NAACL 2009 [pdf] [abstract] [bibtex]
We present a family of priors over probabilistic grammar weights, called the
shared logistic normal distribution. This
family extends the partitioned logistic normal distribution,
enabling factored covariance between the probabilities of different derivation events in the probabilistic grammar, providing a new
way to encode prior knowledge about an unknown grammar.
We describe a variational EM
algorithm for learning a probabilistic grammar based on this family of priors. We
then experiment with unsupervised dependency grammar induction and show
significant improvements using our model for both monolingual
learning and bilingual learning with a nonparallel, multilingual
corpus.
@inproceedings{cohen09a,
author = "S. B. Cohen and N. A. Smith",
title = "Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction",
booktitle = "Proceedings of {NAACL}",
year = "2009"
}
 Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction, Shay B. Cohen, Kevin Gimpel and Noah A. Smith, In NIPS 2008 [pdf] [code] [abstract] [bibtex]
We explore a new Bayesian model for probabilistic grammars,
a family of distributions over discrete structures that includes
hidden Markov models and probabilistic contextfree grammars.
Our model extends the correlated topic
model framework to probabilistic grammars, exploiting the logistic
normal distribution as a prior over the grammar parameters.
We derive a variational EM algorithm for that model,
and then experiment with the task of unsupervised grammar induction
for natural language dependency parsing. We show that our model
achieves superior results over
previous models that use different priors.
@inproceedings{cohen08b,
author = "S. B. Cohen and K. Gimpel and N. A. Smith",
title = "Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction",
booktitle = "Proceedings of {NIPS}",
year = "2009"
}
 Dynamic Programming Algorithms as Products of Weighted Logic Programs, Shay B. Cohen, Robert J. Simmons and Noah A. Smith, In ICLP 2008 (best student paper award) [springer] [journalversion] [abstract] [bibtex]
Weighted logic programming, a generalization of bottomup logic
programming, is a successful framework for specifying dynamic
programming algorithms. In this setting, proofs correspond to the
algorithm's output space, such as a path through a graph or a
grammatical derivation, and are given a weighted score, often
interpreted as a probability, that depends on the score of the base
axioms used in the proof. The desired output is a function over all
possible proofs, such as a sum of scores or an optimal score. We
describe the PRODUCT transformation, which can merge two weighted logic
programs into a new one. The resulting program optimizes a product
of proof scores from the original programs, constituting a scoring
function known in machine learning as a ``product of experts.''
Through the addition of intuitive constraining side conditions, we
show that several important dynamic programming
algorithms can be derived by applying PRODUCT to weighted logic programs corresponding to simpler weighted logic programs.
@inproceedings{cohen08a,
author = "S. B. Cohen and R. J. Simmons and N. A. Smith",
title = "Dynamic Programming Algorithms as Products of Weighted Logic Programs",
booktitle = "Proceedings of {ICLP}",
year = "2008"
}
 Joint Morphological and Syntactic Disambiguation, Shay B. Cohen and Noah A. Smith, In EMNLP 2007 [pdf] [abstract] [bibtex]
In morphologically rich languages, should morphological and
syntactic disambiguation be treated sequentially or as a single problem? We describe several efficient,
probabilisticallyinterpretable ways to apply joint inference to morphological and syntactic disambiguation using
lattice parsing. Joint inference is shown to compare
favorably to pipeline parsing methods across a variety of
component models. Stateoftheart performance on Hebrew Treebank
parsing is demonstrated using the new method. The benefits of joint inference
are modest with the current component models, but appear to increase as components themselves
improve.
@inproceedings{cohen07b,
author = "S. B. Cohen and N. A. Smith",
title = "Joint Morphological and Syntactic Disambiguation",
booktitle = "Proceedings of {EMNLP}",
year = "2007"
}
 Feature Selection Via Coalitional Game Theory, Shay B. Cohen, Gideon Dror and Eytan Ruppin, In Neural Computation 19:7, 2007 [pdf] [bibtex]
@article{cohen07a,
author = "S. B. Cohen and G. Dror and E. Ruppin",
title = "Feature Selection via Coalitional Game Theory",
journal = "Neural Computation",
year = "2007"
}
 Feature Selection Based on the Shapley Value, Shay B. Cohen, Gideon Dror and Eytan Ruppin, In IJCAI 2005 [pdf] [bibtex]
@inproceedings{cohen05,
author = "S. B. Cohen and G. Dror and E. Ruppin",
title = "Feature Selection Based on the {Shapley} Value",
booktitle = "Proceedings of {IJCAI}",
year = "2005"
}
