Public datasets for 1D Signal Analysis
Public Datasets
Text Databases
- SCRIBE - Spoken Corpus of British English A pilot corpus of spoken British English produced in a collaboration between UCL, Cambridge University and Edinburgh University. Now with samples of phonetically annotated material available for download.
- Brown Corpus - Brown University Standard Corpus of Present-Day American English was compiled by Henry Kucera and W. Nelson Francis at Brown University, Providence, RI as a general corpus in the field of corpus linguistics in 1967. It contains 1,000,000 words. The link is the user manual.
- British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
- American National Corpus (ANC) is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward.
- Oxford English Corpus (OEC) is a text corpus of English language used by the makers of the Oxford English Dictionary and by Oxford University Press's language research programme. It is the largest corpus of its kind, containing over two billion words.
- Penn Treebank annotates naturally-occuring text for linguistic structure. Most notably, we produce skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees.
Speech Databases
- UCL Speaker Database Comprises high-quality recordings of a range of speech materials (from words to spontaneous speech) for a set of 45 speakers (women, men and children) from a single regional accent group.
- EUROM1 - Multilingual Speech Corpus A spoken language resource for the EU with comparable speech recordings available in 7 different European languages.
- UCL Psychology Speech Group Dysfluency Database Recordings of 61 speakers who have been studied by the UCL Psychology department speech group.
- Speech Synthesis Databases speech databases based on the speech research group at Carnegie Mellon University.
- HCRC Map Task Corpus is a set of 128 dialogues that has been recorded, transcribed, and annotated for a wide range of behaviours, and has been released for research purposes.
Biology Databases
- TRANSFAC contains data on transcription factors, their experimentelly-proven binding sites, and regulated genes.
- Pfam 22.0 The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
Return to Student/Researcher Resource page
© 2008 Robert Fisher