Converting Speech To Text Using AI

How many instances have you wondered about the fact that it would be so much better if you could just have some device which could record what you’re saying and pen it down for you? 

Think of a scenario like this. You are conversing on the phone with your boss and have a lot of tasks assigned to you. How do you remember all the functions if you don’t note them down somewhere? 

You might use a pen or your laptop to note down your moments of the meeting. But what if there was a model that could simply do this job for you? 

Well, don’t you worry because, in the age of technology and the internet, something like that can definitely be done and has been going on for quite some time now? 

Since the beginning, we have been using texts to feed instructions to our computer. With decent advances in NLP (natural language processing) and ML (machine learning), we now have tools to use Speech as a medium to feed instructions to our computer! 

Tools like Siri, Google Assistant, Alexa are milestones in serving this Speech to text feature. We have achieved a more personal as well as a convenient dimension of talking to the digital world. 

Computers are machines, and they understand machine language, as in, programs. We use Python, which is one of the most widely used programming languages, to meet our purpose.



Before we understand the statement to text in Python, we must know how much progress has been made in this field. 

  • IBM Showbox 1962: The first speech recognition system that recognizes 16 words along with digits. It can solve simple arithmetic dictations and print the result. 
  • Audrey 1952: It was the first speech recognition system made by three Bells Lab researchers who could only recognize digits. 
  • DARPA 1970: The Defense Advanced Research Projects Agency funded the speech understanding research that Ultimately led to Harpy’s development in recognizing 1011 words. 
  • Voice Search Google: In 2001, Google put forth the Voice Search feature to search using Speech. It became prevalent afterwards. 
  • Hidden Markov Model HMM: In 1980, the statistical model helps solve problems that require sequential information. It was later applied to advancement in speech recognition. 
  • Siri, 2011: Apple introduced this model, making a real-time and convenient way to interact with the devices. 
  • Alexa 2014 and Google Home 2016: Voice instructions based virtual assistants then became the mainstream, such as Google Home and Alexa that solved over 150 million units. 

Challenges In A Speech To Text 

Despite the impeccable success in this field, there always exists a silver lining on the cloud. Some can be listed as: 

  • Interpretation is imprecise. Speech recognition doesn’t interpret spoken words correctly most often. The Voice user interface is not that adapted to voice recognition as humans in knowing the context that changes the relationship between sentences and phrases. 
  • Time: Sometimes, it may take too long for the system to process. This is because of the diversity of voice patterns that humans have. This can be avoided by slowing down Speech or being more accurate in pronouncing a word or sentence. 
  • Accents: The VUIs may have difficulty in comprehending the dialects that differ all over the world. 
  • Background noise: This would not be a problem in an ideal world. But the world isn’t one. VUIs may be problematic in loud environmental conditions. 

Speech To Text: Guidelines 

Before you proceed, please do make sure you have a decent version of Python and a working microphone. 

Step 1 

Download the following packages of Python: 

  • My audio is pip install Pyaudio. 
  • Portaudio, that is pip install Portaudio. 
  • speech_recognition or pip install SpeechRecognition. This is the main package that will run the most critical step of the Speech to text conversation. Other alternatives would have several pros and cons. 

Step 2 

Do a project and give it a name and import the speech_recognition as sr. 

Create many instances of the recognizer class. 

Step 3 

After you have made the instances, it is time to define the source of the input. Let us explain the source as the microphone itself, for now. You can use an existing audio file too. 

Step 4 

We have a variable to store the input now, and therefore, we use the ‘listen’ method to take all kinds of information from the source. Here, we are going to use the mic as a source that we already established previously. 

Step 5

We now have the input and have it stored in a variable. We just have to use the recognise_google method to convert it into text. 

We can store the result in a variable or just simply print the work. We don’t have to depend on recognize_google solely. 

This method helps to cut down on having to make your Speech to text recognition software from scratch.  

Some Speech To Text Converters 

The best tools to help you convert your Speech to text are mentioned here. Let us start with the ones that you have to pay for. 

  • Dragon Professional: Edit documents, create a spreadsheet, voice searching on the browser, voice typing, import custom word lists and transcribe files on your mobile and transfer them to the computer. 
  • Verbit: This one gives you high accuracy because it uses human transcribers. You can translate regardless of accents, use real-time and have results, use with zoom, eliminate all background noise, integrate contextual information like news into a recording. Monitor your job’s progress and status anytime, access your reports, update and edit and share files. You can also have access to a customer success manager, have high security and many more. 
  • Speechmatics: High accuracy, use of keyword triggers, automatic speech recognition, speaker identification, adjustable timestamps, generate a transcript that is searchable and editable, highlight or add a comment, custom dictionary, cover multiple languages and have transcription lead time in minutes. 
  • Just Press Record: No need to create an account because it is easy to set up and use. Unlimited recording time, transcribe Speech to text that is searchable, share files with other iOS apps, view and organize recordings, edit from the app, and punctuate recognition. It is suitable for team collaboration or if you have multilingual teams. 

Let us now discuss some free speech to text converters. 

  • Gboard Google Keyboard: Voice usage to trigger and input images into text, capture audio and translate files with Google translate, voice-activated web search, share graphics including GIFs and emojis, predictive typing, doesn’t feature ads, thereby making it interruption-free when you’re working. 
  • Windows 10 Speech Recognition (WSR): You can execute via voice on text, forms and emails and desktop user interface, dictations, custom dictionary with custom language models. 
  • Transcribe: Record voice while transcribing simultaneously, automatic video and voice transcription, adding captions to videos, import files from Dropbox, export transcribed text into various file formats.  


This technology is very quickly going to be ubiquitous. The conjunction with Python is relatively straight and easy to understand. This is why it has such wide applications. Therefore, we are paving a path to a world where we can have access to the digital world with just our fingertips and just a word.