Speech Recognition and Natural Language Processing(NLP)

K. Behera
6 min readSep 4, 2020


Speech Recognition VS Natural Language processing

Overview: -

  1. Basic of Speech Recognition.
  2. Basic of NLP.
  3. How it is related to Speech recognition.
  4. Some of the techniques in Speech Recognition.
  5. Some of the challenges.

Speech Recognition: -

Speech recognition is a technique which is highly demanding in the market. There are a lot of companies are trying to adopt speech recognition techniques into their product to compete with today’s market (for example: — Ok Google, Apple Siri, Amazon Alexa)

But how much we know about speech recognition?

Speech recognition is a technique or we can say it as software which has capable of recognizing the speech/voice that human says. it can be in any language that is already been in the solution.

Speech recognition is the solution where it’s taking the voice input and doing some use full tasks for us.

How it works: -

When speech recognition takes a voice input first it tries to make it into a number of tokens as we are doing in texts, By splitting the voice input after making token it will try to analyze each token and tries to recognize the tokens and will apply to make it work as our requirement/solution requirement.

Does it make sense.. if not then continue to next?

What is NLP: -

Natural Language Processing is a technique where it works on Natural language/ human language to make our tasks easy.

It is the solution which makes able the computer to understand human language (i.e. voice/audio, text)

NLP is an AI-based technique that allows a computer to communicate with a human.

Let’s deep dive a little bit into NLP

How it works: -

The field of NLP is divided into two categories: -

  1. Natural Language Understanding(NLU)
  2. Natural Language Generation(NLG)

So how these two things work,?

Let’s start with an example to understand in short neither it can be so long if I will start explaining each of this.

Example: — Let’s think we want to build a solution which will take our voice input and do some useful task or let’s say we want a solution in which we will give the command by speaking something or voice command and in return, it will also give a voice output or speak something.

let’s simplify it.

Here we have tasks to do :

  1. 1st we need to understand what the user is giving a command or saying.
  2. As per command, do some tasks.
  3. At last, it will return output.

1st we need to understand what the user is giving a command or saying.

Here, the scene comes where NLU will work, Our solution will analyze the voice or audio through the NLP technique and preprocess the audio and understand the audio input, understand the context of the user from the voice input.

As per command, do some tasks.

In the next step, we will do the coding part, where we will fulfill the user’s requirement by some normal coding it may be kind of searching from google or Wikipedia or any kind of doing the task.

At last, it will return output.

Here in the last step where we are using NLG where we need to give the output as in speech so now NLG will work like it will convert the output from the task, it may be in text or in numeric but as per our solution requirement, we need the output in audio so it will convert the model output to audio.

This is how NLP works in short.

How it is related to Speech recognition: -

Now here we have the main part of what we were waiting for. So, till now we knew how Speech Recognition works and also how NLP works I think you already have got an idea about how speech recognition is related.

But still, let me explain.

I’ve seen a lot of misconceptions regarding this topic how speech recognition is related to NLP even I have seen from some of the recruiters that they have doubt that is this coming under NLP or not.

In my point of view as I am writing this blog Speech Recognition is a part of NLP as it is taking only the voice command and it is recognizing but it is not NLP,

As we know NLP is a very big topic where we have a lot of tasks to do a lot of techniques like that speech recognition is a task.

Let me mention some of the NLP tasks here that we can do with.

In the area of NLP we can do: -

  1. Morphology
  2. Grammar & Parsing(syntactic analysis)
  3. Semantics
  4. Pragmatics
  5. Discourse / Dialogue
  6. Spoken Language Understanding

The area in Speech Recognition -

  1. Signal Processing
  2. Phonetics
  3. Word Recognition

Some of the techniques in Speech Recognition.

There are some of the models in speech recognition which we are using while working with speech recognition.

  1. Hidden Markov Model: -it is a statistical model that outputs a sequence of symbols or quantities.HMM is used in speech recognition because a speech signal can be viewed as a piecewise stationary signal also a short-time stationary signal. Speech can be thought of as a Markov model for many reasons.
  2. Dynamic time wrapping (DTW)-based speech recognition: -It is a successor part for the HMM-based approach it was historically used for speech recognition but now it’s widely been displaced. it is an algorithm for measuring similarity between two sequences that may vary in time or speed. It is a technique that allows a computer to find an optimal match between two sequences given by solution (i.e. time series) with certain restrictions.
  3. Neural networks: -it is an attractive acoustic modeling approach and it’s very popular in now among data scientists since neural networks used in various aspects like phoneme classification, audiovisual speech recognition, and also it has some feature statistical properties than HMM and f a lot of qualities making them attractive recognition models for speech recognition
  4. End to end automatic speech recognition and etc..

if I will go for some of the signal processing technique or little bit depth NLP

There are two modules in speech recognition: -

  1. Feature Matching
  2. Feature Extraction

Feature Matching: — It is a technique to identify the unknown speaker bt comparing the extracted feature which is the male voice or female voice.

in this technique, we can analyze better to identify a person using voice command.

Feature extraction: — if you are already a data scientist aspirant then you may have aware of this technique in machine learning.

So, Basically feature extraction means to extract features/information from the dataset.

so here in audio files also we need to extract the features.

Now I will not go deeper in Feature extraction otherwise blog will be so large but I can assure in the next blog I will explain all techniques in feature extraction

Why do we need it?

The data we will get from the audio we can’t feed directly into the model it could not be understood by our machine directly for that we need to do some preprocessing and for that, we need to extract the features and then after we can feed to our model. and we can do stuff to get the required output and prediction.

Here are some feature extraction techniques.

  1. Zero crossing rate
  2. Energy
  3. Entropy of energy
  4. Spectral centroid
  5. Spectral spread
  6. spectral entropy
  7. Spectral Flux
  8. Spectral Roll off
  9. MFCCs
  10. Chroma
  11. Chroma Vector
  12. Chroma Deviation
  13. Perceptual linear prediction(PLP)
  14. Relative spectra filtering of log domain coefficients PLP(RASTA-PLP)
  15. Linear predictive coding(LPC)

In the next blog, I will explain all this one by one on this especially on feature extraction.

Some of the challenges on speech recognition

When we are dealing with speech recognition we can imagine the challenges will come from audio or voice how we are giving input these are some challenges.

  1. Noise: — At the time of giving input to the model it may happen there is some noise in our voice. so it will difficult for our model to analyze.
  2. Echo: — at the time of analyzing it will convert all the sounds to wave. At that time it will difficult to analyze the waves.
  3. Similar sounds: — similar sounds with input sound it can make confusion.
  4. Machine error sound: — this can create a problem in recognizing the voice.
  5. Accents: — There are different, different accents in different regions and different languages it will be difficult to analyze and recognize those accents.
  6. Disorganized speech: — some of the speech waves may be in disorganized.

My Other Blogs might be interested.

Knowledgebase in Natural Language Processing

Resume/CV parser using Natural Language Processing/NER