Not really cutting it like that...
If you display a text for as long as speech file lasts then immediately trigger another display text with a speech file it will play it instantaneously after the last one, so... the general idea is to cut it at the end of a word preferably just before the next spoken word in your sentence starts.
I know it sounds like a strange thing to do, but it's all about where you snip up the speech files. It's really easy to pick out what is speech, background noise & silence by looking at the wave forms; especially if you zoom into them.