SoFunction
Updated on 2024-10-28

Creating a speech recognition control system based on Python

Attached below is the reference article which opens the default website in the browser with the recognized text.python voice recognition by calling baidu api

The topic is very simple, the use of speech recognition to recognize the words spoken, according to the content of the text to control the graphics to move, for example, say up, after recognizing the text, the graphics on the canvas will move up. This article uses the Baidu Recognition API (because it is free), do it yourself flowchart:

Without further ado, the direct start of the program design, the first login Baidu cloud, create an application

Note that the API Key and Secret Key here should be your own to take effect.

Baidu speech recognition has a corresponding document, the specific call method is very clear, if you want to learn about it you can view theREST API Documentation

Documentation written in detail, this article only explains the methods used, voice recognition using the method of assembling the URL to obtain the token, and then process the local audio in JSON format to send to the Baidu voice recognition server, to obtain the return results.

Baidu speech recognition supports pcm, wav and other formats, Baidu server will convert non-pcm format to pcm format, so using wav, amr format will have extra conversion time consuming. Save as pcm format can be recognized, just windows comes with a player can not recognize pcm format, so instead of wav format, at the same time to quote the wave library, the function can read, write wav type audio files. The sample rate uses the pcm sample rate of 16000 fixed value, encoded as 16bit bit depth mono.

The recording function uses the PyAudio library, an audio processing module under Python for streaming audio to a computer sound card. A new audio is opened in the current folder to record and store the recording data. Local Recording:

def save_wave_file(filepath, data):
    wf = (filepath, 'wb')
    (channels)
    (sampwidth)
    (framerate)
    (b''.join(data))
    ()
 
 
# Recording
def my_record():
    pa = PyAudio()
    # Open a new audio stream
    stream = (format=paInt16, channels=channels,
                     rate=framerate, input=True, frames_per_buffer=num_samples)
    my_buf = []  # Store recording data
    t = ()
    print('Recording in progress...')
    while () < t + 5:  # Set recording time (seconds)
        # Read in a loop, 2000 frames at a time #
        string_audio_data = (num_samples)
        my_buf.append(string_audio_data)
    print('End of recording...')
    save_wave_file(FILEPATH, my_buf)
    ()

Then get the token, according to the APIKey and SecreKey obtained from the creation of the application (here to use their own) to assemble the URL to get the token, in the voice recognition function to call the obtained token and already recorded audio data, according to the required format to write into the JSON parameters for uploading the audio.

Baidu Speech requires base64 encoding of local speech binary data, which is encoded using the base64 library. To create the recognition request, we use POST to submit it, and write the short voice recognition request address provided by Baidu Voice in the recognition function. The recognition result will be returned immediately, encapsulated in JSON format, the recognition result is placed in the "result" field of JSON, and is encoded in utf-8.

# assemble url to get token
base_url = "/oauth/2.0/token?grant_type=client_credentials&client_id=%s&client_secret=%s"
APIKey = "*****************"
SecretKey = "********************"
HOST = base_url % (APIKey, SecretKey)
 
 
def getToken(host):
    res = (host)
    r = ()['access_token']
    return r
 
 
# Incoming voice binary data, token
# dev_pid for Baidu speech recognition provides several language options, the default 1537 for punctuated Mandarin
def speech2text(speech_data, token, dev_pid=1537):
    FORMAT = 'wav'
    RATE = '16000'
    CHANNEL = 1
    CUID = '*******'
    SPEECH = base64.b64encode(speech_data).decode('utf-8')
    data = {
        'format': FORMAT,
        'rate': RATE,
        'channel': CHANNEL,
        'cuid': CUID,
        'len': len(speech_data),
        'speech': SPEECH,
        'token': token,
        'dev_pid': dev_pid
    }
    url = '/server_api'  # Address for short voice recognition requests
    headers = {'Content-Type': 'application/json'}
    print('Recognizing...')
    r = (url, json=data, headers=headers)
    Result = ()
    if 'result' in Result:
        return Result['result'][0]
    else:
        return Result

Finally we write the control move function, first we need to know how to move the control graphic to render it. In this project we use tkinter module, Tkinter is a python module, is an interface to call Tcl/Tk, it is a cross-platform scripting graphical interface interface. It is a more popular python graphical programming interface. The best feature is cross-platform, the disadvantage is that the performance is not so good, the execution speed is slow.

We use the canvas in tkinter to set up a canvas and create a rectangle with event ID 1 and place the rectangle in the canvas for display. Add a Button button to the canvas, write the corresponding function in the callback, and click it to trigger recording audio and speech recognition. In order to make the code more concise, we put the move function in the speech recognition function to call, return the recognition results to make a judgment on the results, and finally make the graphic move.

def move(result):
    print(result)
    if "Up." in result:
        (1, 0, -30)  # moves the thing with ID 1 [move(2,0,-5) moves the thing with ID 2] so that the horizontal coordinate is plus 0 and the vertical coordinate is minus 30
    elif "Down." in result:
        (1, 0, 30)
    elif "To the left." in result:
        (1, -30, 0)
    elif "To the right." in result:
        (1, 30, 0)
 
 
tk = Tk()
("Speech Recognition Controls Graphic Movement.")
Button(tk, text="Start recording.", command=AI.my_record).pack()
Button(tk, text="Begin identification.", command=speech2text).pack()
canvas = Canvas(tk, width=500, height=500)  # Setting up the canvas
()  # Show canvas
r = canvas.create_rectangle(180, 180, 220, 220, fill="red")  # Event ID is 1
mainloop()

Personal habits, I wrote the speech recognition and graphics control in two files, which leads to the file there is no way to use the file function in the return value, because we use the tkinter module is constantly bad, through the mainloop () to end the loop, so constantly bad on the call can not call the return value, the use of the method is to re-construct the same function in the file to call the function in the file, and declare a global variable, the file in the return value in the global variable, so that the return value is achieved. The method used is to reconstruct the same function in the file to call the function in the file, and declare the global variable, put the return value in the file in the file's global variable, so that we get the return value, and then write the function to the Button callback to realize the corresponding function.

In fact, the code is very troublesome to write, write in a file will be simpler, I drew the relationship between the two files call:

The full demo is as follows

import wave  # Readable and writable wav type audio files.
import requests  # HTTP library based on urllib using Apache2 Licensed open source protocol. Used in this project to pass headers and POST requests.
import time
import base64  # Baidu Voice requires base64 encoding of native speech binary data
from pyaudio import PyAudio, paInt16  # Audio processing module for delivering an audio stream to a computer sound card
 
framerate = 16000  # Sampling rate
num_samples = 2000  # Sampling points
channels = 1  # Channels
sampwidth = 2  # Sampling width 2bytes
FILEPATH = ''
 
# assemble url to get token
base_url = "/oauth/2.0/token?grant_type=client_credentials&client_id=%s&client_secret=%s"
APIKey = "8bv3inF5roWBtEXYpZViCs39"
SecretKey = "HLXYiLGCpeOD6ddF1m6BvwcDZVOYtwwD"
HOST = base_url % (APIKey, SecretKey)
 
 
def getToken(host):
    res = (host)
    r = ()['access_token']
    return r
 
 
def save_wave_file(filepath, data):
    wf = (filepath, 'wb')
    (channels)
    (sampwidth)
    (framerate)
    (b''.join(data))
    ()
 
 
# Recording
def my_record():
    pa = PyAudio()
    # Open a new audio stream
    stream = (format=paInt16, channels=channels,
                     rate=framerate, input=True, frames_per_buffer=num_samples)
    my_buf = []  # Store recording data
    t = ()
    print('Recording in progress...')
    while () < t + 5:  # Set recording time (seconds)
        # Read in a loop, 2000 frames at a time #
        string_audio_data = (num_samples)
        my_buf.append(string_audio_data)
    print('End of recording...')
    save_wave_file(FILEPATH, my_buf)
    ()
 
 
def get_audio(file):
    with open(file, 'rb') as f:
        data = ()
    return data
 
 
# Incoming voice binary data, token
# dev_pid for Baidu speech recognition provides several language options, the default 1537 for punctuated Mandarin
def speech2text(speech_data, token, dev_pid=1537):
    FORMAT = 'wav'
    RATE = '16000'
    CHANNEL = 1
    CUID = '*******'
    SPEECH = base64.b64encode(speech_data).decode('utf-8')
    data = {
        'format': FORMAT,
        'rate': RATE,
        'channel': CHANNEL,
        'cuid': CUID,
        'len': len(speech_data),
        'speech': SPEECH,
        'token': token,
        'dev_pid': dev_pid
    }
    url = '/server_api'  # Address for short voice recognition requests
    headers = {'Content-Type': 'application/json'}
    print('Recognizing...')
    r = (url, json=data, headers=headers)
    Result = ()
    if 'result' in Result:
        return Result['result'][0]
    else:
        return Result

import AI
from tkinter import *  # Import all contents of the tkinter module
 
token = None
speech = None
result = None
 
 
def getToken():
    temptoken = ()
    return temptoken
 
 
def speech2text():
    global token
    if token is None:
        token = getToken()
    speech = AI.get_audio()
    result = AI.speech2text(speech, token, dev_pid=1537)
    print(result)
    move(result)
 
 
def move(result):
    print(result)
    if "Up." in result:
        (1, 0, -30)  # moves the thing with ID 1 [move(2,0,-5) moves the thing with ID 2] so that the horizontal coordinate is plus 0 and the vertical coordinate is minus 30
    elif "Down." in result:
        (1, 0, 30)
    elif "To the left." in result:
        (1, -30, 0)
    elif "To the right." in result:
        (1, 30, 0)
 
 
tk = Tk()
("Speech Recognition Controls Graphic Movement.")
Button(tk, text="Start recording.", command=AI.my_record).pack()
Button(tk, text="Begin identification.", command=speech2text).pack()
canvas = Canvas(tk, width=500, height=500)  # Setting up the canvas
()  # Show canvas
r = canvas.create_rectangle(180, 180, 220, 220, fill="red")  # Event ID is 1
mainloop()

file relation

Recorded audio is automatically saved in the current folder, which is the speech file

To test the results, run

Click to start recording

Click to start recognizing

Then you can see the graph move to the right

Tested, yelling works better 

To this article on the creation of a voice recognition control system based on Python is introduced to this article, more related Python voice recognition control system content, please search for my previous articles or continue to browse the following related articles I hope that you will support me in the future more!