If you look closely in your outputs - there are two events :
Recognizing and Recognized
Recognizing :
The event Recognizing signals that are an intermediate recognition result is received.
Recognized :
The event Recognized signals that a final recognition result is received.
So the offset that you see is for the complete sentence (Recognized event - usually before the first pause) : My voice is my passport, verify me. So for all the recognizing (intermediate) event, the offset will be same. So if you had another recognized event, you would see sequential offset. So if you had another sentence in the audio - you are likely to see additional recognized event and the offset - growing like you are expecting.
Update :
Additional Note :
The duration grows from zero for every recognized event.
The duration count traverses from zero to duration of the complete recognized event.
So for instance
Recognizing:my
Offset=6800000
Duration=2700000
Recognizing:my voice is
Offset=6800000
Duration=8500000
So if you want an offset for : My Voice is
- you could add the intial offset and duration of the previous one 6800000 + 2700000 ( begin time) and end time would be 6800000 + 8500000 (the current duration)
Update 2
RECOGNIZED: Text=My voice is my passport, verify me.
Offset=6800000
Duration=28100000
They are in 100 nano seconds( 10 ^-7 Seconds)
so let us take your case
your offset is 6800000 which is 0.68 second
So that mean the sentence(or complete recognized event for that audio stream) has been started at 0.68th second of the entire audio.
The duration or complete time taken for to utter ("My voice is my passport, verify me.") is 2.8(28100000) seconds.
The offset of the 2nd sentence (recognized event) would be greater than this duration.
The duration can be either less than or greater than offset :
In the 3rd second of the entire audio, I can utter 4 second long audio stream without a pause.
The offset would be 3 seconds and duration would be 4 seconds
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…