corpus – contain utt objects, not filenames to avoid this in e.g. vsm tagger / dt proc.s
- for utt_name in speech_corpus:
- utterance = Utterance(utt_name)
Document whole module like this:
Add HTK header (for user-specified format – 9) to some data in-place
If all elements of sequence are of type test_type, return True, else False.
ConfigObj handles reading of string lists without validation, but in the case of 1-item lists, strings will of course be returned. This function does type checking and conversion for this case.
Find the final text element of an xpath which we will assume is the name of an attribute.
TODO: find a better and less error-prone way to do this!
Turn the data into int if possible, then a float, else a unicode
these (together with lemma_name) will make up header entries.
TODO: assert tab character not inside keys / values
parse the header of datafile, then return legnth of data according to header in seconds
Convert non-negative integer to base 26 representation using uppercase A-Z as symbols. Can use this instead of numbers in feature delimiters because:
– gives shorter full context model names (esp. with many features) – trivially, split-context-balanced.py expects delimiters to contain no digits
HTK wildcards allow item sets like {/feature:?/,*/feature:1?/*} to express “feature < 20” in question definitions. For a given integer max value n, return a list of strings with HTK wildcards matching non-negative integers less than n.
E.g.: make_htk_wildcards(236) gives: [‘?’, ‘??’, ‘1??’, ‘20?’, ‘21?’, ‘22?’, ‘230’, ‘231’, ‘232’, ‘233’, ‘234’, ‘235’]
Assumed format: lemma feature1 feature2 feature3 ... per line Features are numeric.
default dims_to_keep = 0 means keep all
Read HTK header of datafile, return ...
Read HTK label, assume: “start end phone word”, where word is optional. Convert times from HTK units to MS
Read HTK label with state alignment
return [word, [seg, [s1, start, end], [s2, start , end] ... ]]]
Function for reading config files, context files etc. Strip comments (#) and empty lines.
Remove everything in a string after the last dot, and the dot itself
Take a section from a ConfigObj and make it into new ConfigObj
Reverse flatten_mapping. Take dict-like object (e.g. config section), assume utf-8 coded
work one out. The substitute should be safe to use with applications of interest (e.g. in HTK modelnames), and a perhaps over-cautious subset of ASCII is used for this (uppercase A-Z).
TODO: [make this explanation complete]
To enable
reverse mapping, multicharacter safetexts are delimited with _.
Add items from new_list to end of old_list if those items are not already in old list – returned list will have unique entries. Preserve order (which is why we can’t do this quicker with dicts).
Write keys & values in dict_like (e.g. ConfigObj) to file to be read as bash config. ConfigObj only writes .ini style files. Basically, remove space around = and add double quotes.
Take data, in the form of a list of lists like: :
[(0, u’response’, ‘True’), (1, u’token_text’, ‘_COMMA_’) [...]
I.e. feature number, feature name, feature value for each feature on a line. Feature names must be same on each line. Write data file for R where first line is header with feature names, and each line contains feature values for one data point.
The default for writing utf-8 is False. This is important because the default should be to write ascii compatible files (for compatibility with HTK etc.)
Document functions like this:
work one out. The substitute should be safe to use with applications of interest (e.g. in HTK modelnames), and a perhaps over-cautious subset of ASCII is used for this (uppercase A-Z).
TODO: [make this explanation complete]
To enable
reverse mapping, multicharacter safetexts are delimited with _.
makeelement(self, _tag, attrib=None, nsmap=None, **_extra)
Creates a new element associated with this parser.
-self.data holds xml structure of the utterance.
Warning
external data? see add_external_data
Parameters: |
|
---|
Warning
UPDATE f utt_location is not None, utt filename is assumed to be relative to this, and only partial path is stored in utt structure.
Warning
UPDATE If speech_location is not None, speech_file is assumed to be relative to this, and only partial path is stored in utt structure.
Store an archived version of an utterance that will not be overwritten, and also a PDF of a visualisation.
Find values for all attributes (matching the supplied regex) from any node of utterance. Do not unique the values (instances not types).
Return e.g. {“attrib1”: [“val1”, “val2”], “attrib2”: [“val1”, “val2”, “val3”]}
For each utterance node matching target_nodes, get values for the list of contexts at that node.
[Reroute to self.data] Get attribute key’s value at root node of utterance structure.
Get the default name for a filetype directory from an utterance’s “utterance_filename”. If utterance_filename is <PATH>/utt/name.utt the dirname for type lab will be <PATH>/lab/. Make the directory if it does not exist already.
Get the default name for a filetype from an utterance’s utterance_filename. If utterance_filename is <PATH>/utt/name.utt the filename for type lab will be <PATH>/lab/name.lab.
Get the absolute path of the utt file where this structure is stored. Absolute paths are not stored directly in the structure so that files are portable.
If utt file is <PATH>/utt/name.utt and resource_name is lab, check if <PATH>/lab/name.lab exists as a file.
Instantiate an utterance from the a string representing the utterance’s text
Save utterance structure as XML to file.
Parameters: | fname – write here if specified, otherwise use utterance_location if it is set. |
---|
[Reroute to self.data] Set attribute key’s value at root node of utterance structure.
Todo
ever really necessary to check single line when init’ing utterance from text?
Use GraphViz to make an image of utterance structure (extension specifies image type).
Parameters: |
|
---|
Specialised Element class for utterances, has safe_xpath method. See here: http://lxml.de/1.3/element_classes.html on using custom Element classes
ElementBase(*children, attrib=None, nsmap=None, **_extra)
Get dict of features made by concatenatting the output of calls to xpath on this node. Return dict like {feature_name: feature_value, ... }
Get values for list of contexts at an utterance node.
Parameters: | context_list – e.g.: [(‘name1’, ‘xpath1’), (‘name2’, ‘xpath2’), ... ] |
---|---|
Returns: | vector of features made by concatenating the output of calls to xpath |
on this node. Return list of items like ((name1, value1), (name2, value2), ... )
Provide padding for e.g. end-of-sentence contexts if xpath doesn’t find anything. In order to handle different padding types (e.g. mean vector for VSM features).
The default padding is _NA_ – this will be used for e.g. end-of-sentence phone contexts.
For count based features, xpath gives 0 in sentence-edge positions, which is fine.
Todo
Is there a more proper way to do this?
(The original entry is located in /Users/owatts/repos/ossian_git_clean/Ossian/doc/source/STORAGE/modules/corpus_utterance.rst, line 25.)
Todo
ever really necessary to check single line when init’ing utterance from text?
(The original entry is located in /Users/owatts/repos/ossian_git_clean/Ossian/scripts/main/Utterance.py:docstring of main.Utterance.Utterance.text, line 1.)
Todo
Add the text from the unspoken parts
(The original entry is located in /Users/owatts/repos/ossian_git_clean/Ossian/doc/source/STORAGE/queries.rst, line 20.)
Todo
ever really necessary to check single line when init’ing utterance from text?
(The original entry is located in /Users/owatts/repos/ossian_git_clean/Ossian/scripts/main/Utterance.py:docstring of main.Utterance.Utterance.text, line 1.)
Todo
Add some notes on corpus here – holds collections of utterances and text.
(The original entry is located in /Users/owatts/repos/ossian_git_clean/Ossian/doc/source/initial_voice.rst, line 52.)
Todo
tidy this up a bit
(The original entry is located in /Users/owatts/repos/ossian_git_clean/Ossian/doc/source/initial_voice.rst, line 63.)
Todo
Come back and make this recipe functional
(The original entry is located in /Users/owatts/repos/ossian_git_clean/Ossian/doc/source/refinements.rst, line 159.)
Todo
Add details
(The original entry is located in /Users/owatts/repos/ossian_git_clean/Ossian/doc/source/refinements.rst, line 173.)
Todo
Add information
(The original entry is located in /Users/owatts/repos/ossian_git_clean/Ossian/doc/source/refinements.rst, line 186.)
Todo
Add information
(The original entry is located in /Users/owatts/repos/ossian_git_clean/Ossian/doc/source/refinements.rst, line 201.)
Todo
Add information
(The original entry is located in /Users/owatts/repos/ossian_git_clean/Ossian/doc/source/refinements.rst, line 216.)
Todo
Add information
(The original entry is located in /Users/owatts/repos/ossian_git_clean/Ossian/doc/source/refinements.rst, line 229.)