This study introduces a method to synchronize lip movements of virtual avatars to live speech input. To do so, we use a deep neural network, which takes snippets of an FFT spectrum as input and yields values of seven facial blendshape parameters as output. Blendshapes allow to manipulate 3D meshes, in this case faces of virtual avatars. We first train a neural network to extract blendshapes out of images, which allows us to translate videos in the form of annotated image-audio databases to blendshape-audio databases. This enables us to make use of a large quantity of readily available data containing images and audio, i.e. videos, to train our deep neural network for lip-synchronization. The algorithm is supposed to work under real-time conditions. Therefore, we restrict the maximum delay to 200ms. This time span is used to exploit future time context for improved inference and to compensate for the delay in the communications (network data transmission). First results of a comparison between an existing real-time lipsync algorithm and videos simulated with the proposed algorithm will be presented. The developed algorithm will be used in future studies of the influence of visual cues on auditory attention decoding and cortical tracking of speech.