[1]  Paul Anderson and Le Zhang. Fast and Secure Laptop Backups with Encrypted Deduplication. In Proceedings of the Large Installations Systems Administration (LISA) Conference, Berkeley, CA, November 2010. Usenix Association. [ bib  .pdf ] 
[2]  Philip N. Garner, John Dines, Thomas Hain, Asmaa El Hannani, Martin Karafiát, Danil Korchagin, Mike Lincoln, Vincent Wan, and Le Zhang. RealTime ASR from Meetings. In Proceedings of Interspeech, Brighton, UK, September 2009. [ bib  .pdf ] 
[3] 
Le Zhang.
Modelling Speech Dynamics with TrajectoryHMMs.
PhD thesis, School of Informatics, University of Edinburgh, January
2009.
[ bib 
.pdf ]
The conditional independence assumption imposed by the hidden Markov models (HMMs) makes it difficult to model temporal correlation patterns in human speech. Traditionally, this limitation is circumvented by appending the first and secondorder regression coefficients to the observation feature vectors. Although this leads to improved performance in recognition tasks, we argue that a straightforward use of dynamic features in HMMs will result in an inferior model, due to the incorrect handling of dynamic constraints. In this thesis I will show that an HMM can be transformed into a TrajectoryHMM capable of generating smoothed output mean trajectories, by performing a perutterance normalisation. The resulting model can be trained by either maximising model loglikelihood or minimising mean generation errors on the training data. To combat the exponential growth of paths in searching, the idea of delayed path merging is proposed and a new timesynchronous decoding algorithm built on the concept of tokenpassing is designed for use in the recognition task. The TrajectoryHMM brings a new way of sharing knowledge between speech recognition and synthesis components, by tackling both problems in a coherent statistical framework. I evaluated the TrajectoryHMM on two different speech tasks using the speakerdependent MOCHATIMIT database. First as a generative model to recover articulatory features from speech signal, where the TrajectoryHMM was used in a complementary way to the conventional HMM modelling techniques, within a joint AcousticArticulatory framework. Experiments indicate that the jointly trained acousticarticulatory models are more accurate (having a lower Root Mean Square error) than the separately trained ones, and that TrajectoryHMM training results in greater accuracy compared with conventional BaumWelch parameter updating. In addition, the Root Mean Square (RMS) training objective proves to be consistently better than the Maximum Likelihood objective. However, experiment of the phone recognition task shows that the MLE trained TrajectoryHMM, while retaining attractive properties of being a proper generative model, tends to favour oversmoothed trajectories among competing hypothesises, and does not perform better than a conventional HMM. We use this to build an argument that models giving a better fit on training data may suffer a reduction of discrimination by being too faithful to the training data. Finally, experiments on using triphone models show that increasing modelling detail is an effective way to leverage modelling performance with little added complexity in training.

[4] 
Le Zhang and Steve Renals.
AcousticArticulatory Modelling with the Trajectory HMM.
IEEE Signal Processing Letters, 15:245248, 2008.
[ bib 
.pdf ]
In this letter, we introduce an hidden Markov model (HMM)based inversion system to recovery articulatory movements from speech acoustics. Trajectory HMMs are used as generative models for modelling articulatory data. Experiments on the MOCHATIMIT corpus indicate that the jointly trained acousticarticulatory models are more accurate (lower RMS error) than the separately trained ones, and that trajectory HMM training results in greater accuracy compared with conventional maximum likelihood HMM training. Moreover, the system has the ability to synthesize articulatory movements directly from a textual representation.

[5] 
Le Zhang and Steve Renals.
Phone Recognition Analysis for Trajectory HMM.
In Proc. Interspeech 2006, Pittsburgh, USA, September 2006.
[ bib 
.pdf ]
The trajectory HMM has been shown to be useful for modelbased speech synthesis where a smoothed trajectory is generated using temporal constraints imposed by dynamic features. To evaluate the performance of such model on an ASR task, we present a trajectory decoder based on tree search with delayed path merging. Experiment on a speakerdependent phone recognition task using the MOCHATIMIT database shows that the MLEtrained trajectory model, while retaining attractive properties of being a proper generative model, tends to favour oversmoothed trajectory among competing hypothesises, and does not perform better than a conventional HMM. We use this to build an argument that models giving better fit on training data may suffer a reduction of discrimination by being too faithful to training data. This partially explains why alternative acoustic models that try to explicitly model temporal constraints do not achieve significant improvements in ASR.

[6] 
Le Zhang, Jingbo Zhu, and Tianshun Yao.
An Evaluation of Statistical Spam Filtering Techinques.
ACM Transactions on Asian Language Information Processing
(TALIP), 3(4):243269, December 2004.
[ bib 
zh1 corpus 
.ps.gz 
.pdf ]
This paper evaluates five supervised learning methods in the context of statistical spam filtering. We study the impact of different feature pruning methods and feature set sizes on each learner's performance using costsensitive measures. It is observed that the significance of feature selection varies greatly from classifier to classifier. In particular, we found Support Vector Machine, AdaBoost and Maximum Entropy Model are top performers in this evaluation, sharing similar characteristics: not sensitive to feature selection strategy, easily scalable to very high feature dimension and good performances across different datasets. In contrast, Naive Bayes, a commonly used classifier in spam filtering, is found to be sensitive to feature selection methods on small feature set, and fail to function well in scenarios where false positives are penalized heavily. The experiments also suggest that aggressive feature pruning should be avoided when building filters to be used in applications where legitimate mails are assigned a cost much higher than spams (such as λ=999), so as to maintain a betterthanbaseline performance. An interesting finding is the effect of mail headers on spam filtering, which is often ignored in previous studies. Experiments show that classifiers using features from message header alone can achieve comparable or better performance than filters utilizing body features only. This suggests that message headers can be reliable and powerfully discriminative feature sources for spam filtering.

[7] 
Xueqiang LÜ, Le Zhang, and Junfeng Hu.
Statistical Substring Reduction in Linear Time.
In Proceeding of the 1st International Joint Conference on
Natural Language Processing (IJCNLP04), Sanya, Hainan island, China, March
2004.
[ bib 
errata 
software 
.ps.gz 
.pdf ]
We study the problem of efficiently removing equal frequency ngram substrings from an ngram set, formally called Statistical Substring Reduction (SSR). SSR is a useful operation in corpus based multiword unit research and new word identification task of oriental language processing. We present a new SSR algorithm that has linear time (O(n)), and prove its equivalence with the traditional O(n^{2}) algorithm. In particular, using experimental results from several corpora with different sizes, we show that it is possible to achieve performance close to that theoretically predicated for this task. Even in a small corpus the new algorithm is several orders of magnitude faster than the O(n^{2}) one. These results show that our algorithm is reliable and efficient, and is therefore an appropriate choice for large scale corpus processing.

[8] 
Le Zhang and Tianshun Yao.
Filtering Junk Mail with a Maximum Entropy Model.
In Proceeding of 20th International Conference on Computer
Processing of Oriental Languages (ICCPOL03), pages 446453, 2003.
[ bib 
software 
slide 
.ps.gz 
.pdf ]
The task of junk mail filtering is to rule out unsolicited bulk email (junk) automatically from a user's mail stream. Two classes of methods have been shown to be useful for classifying email messages. The rule based method uses a set of heuristic rules to classify email messages while the statistical based approach models the difference of messages statistically, usually under a machine learning framework. Generally speaking, the statistical based methods are found to outperform the rule based method, yet we found, by combining different kinds of evidence used in the two approaches into a single statistical model, further improvement can be obtained. We present such a hybrid approach, utilizing a Maximum Entropy Model, and show how to use it in a junk mail filtering task. In particular, we present an extensive experimental comparison of our approach with a Naive Bayes classifier, a widely used classifier in email filtering task, and show that this approach performs comparable or better than Naive Bayes method.

[9] 
Le Zhang, Xueqiang LÜ, Yanna Shen, and Tianshun Yao.
A Statistical Approach to Extract Chinese Chunk Candidates from
Large Corpora.
In Proceeding of 20th International Conference on Computer
Processing of Oriental Languages (ICCPOL03), pages 109117, 2003.
[ bib 
software 
slide 
.ps.gz 
.pdf ]
The extraction of Chunk candidates from real corpora is one of the fundamental tasks of building examplebased machine translation model. This paper presents a statistical approach to extract Chinese chunk candidates from large monolingual corpora. The first step is to extract large Ngrams (up to 20gram) from raw corpus. Then two newly proposed Fast Statistical Substring Reduction (FSSR) algorithms can be applied to the initial Ngram set to remove some unnecessary Ngrams using their frequency information. The two algorithms are efficient (both have a time complexity of O(n)) and can effectively reduce the size of Ngram set up to 50%. Finally, mutual information is used to obtain chunk candidates from reduced Ngram set. Perhaps the biggest contribution of this paper is that it is the first time to apply Fast Statistical Substring Reduction algorithm to large corpora and demonstrate the effectiveness and efficiency of this algorithm which, in our hope, will shed new light on large scale corpus oriented research. Experiments on three corpora with different sizes show that this method can extract chunk candidates from corpora of giga bytes efficiently under current computational power. We get an extraction accuracy of 86.3% from People Daily 2000 news corpus.

This file has been generated by bibtex2html 1.82.