This paper introduces the speech synthesis systems developed by USTC and iFlytek for Blizzard Challenge 2007. These two systems are both HMM-based ones and employ similar training algorithms, where contextual dependent HMMs for spectrum, F0 and duration are estimated according to the acoustic features and contextual information of training database. However, different synthesis methods are adopted for these two systems. In USTC system, speech parameters are generated directly from these statistical models and parametric synthesizer is used to reconstruct speech waveform. The iFlytek system is a waveform concatenation one, which uses maximum likelihood criterion of statistical models to guide the selection of phone-sized candidate units. Comparing the evaluation results of these two systems in Blizzard Challenge 2007, we find that the parametric synthesis system achieves better performance than unit selection method in intelligibility. On the other hand, the synthesized speech of the unit selection system is more similar to the original speech and more natural especially when the full training set is used.
Bibliographic reference. Ling, Zhen-Hua / Qin, Long / Lu, Heng / Gao, Yu / Dai, Li-Rong / Wang, Ren-Hua / Jiang, Yuan / Zhao, Zhi-Wei / Yang, Jin-Hui / Chen, Jie / Hu, Guo-Ping (2007): "The USTC and iflytek speech synthesis systems for Blizzard Challenge 2007", In BLZ3-2007, paper 017.