Dev 101: Build A Text-To-Speech App Using Client-Side JavaScript

Building a Text-To-Speech Application with Tesseract OCR

The following implementation is broken down into 2 parts —

Part I. Image to Text Extraction with Tesseract-OCR

A well-established open-sourced utility would be Tesseract OCR. With sincere thanks to Jerome Wu, a pure JavaScript version of this (Tesseract.js) has been released to the online community.

For this application, a self-hosted version of Tesseract.js v2 shall be implemented to enable offline usage and portability.

Step 1. Retrieve the following 4 files of Tesseract.js v2

- tesseract.min.js
- worker.min.js
- tesseract-core.wasm.js
- eng.traineddata.gz*

* For simplicity, all text to be extracted are assumed to be in English

Import plugin

<script src='js/tesseract/tesseract.min.js'></script>

Proceed to set the worker attributes

const worker = Tesseract.createWorker({
  workerPath: 'js/tesseract/worker.min.js',
  langPath: 'js/tesseract/lang-data/4.0.0_best',
  corePath: 'js/tesseract/tesseract-core.wasm.js'
});

Note: Since app is self-hosted, the relative paths need to be re-defined to local relative paths.

Step 2. Create User Interface for Image Upload

HTML File Input

<input id='uploadImg' type='file' />

JavaScript Code Snippet

var uploadImg=document.getElementById('uploadImg');
function readFileAsDataURL(file) {
  return new Promise((resolve,reject) => {
    let fileredr = new FileReader();
    fileredr.onload = () => resolve(fileredr.result);
    fileredr.onerror = () => reject(fileredr);
    fileredr.readAsDataURL(file);
  });
}
uploadImg.addEventListener('change', (ev) => {
  const worker = Tesseract.createWorker({
    workerPath: 'js/tesseract/worker.min.js',
    langPath: 'js/tesseract/lang-data/4.0.0_best',
    corePath: 'js/tesseract/tesseract-core.wasm.js'
  });
    
  let file = ev.currentTarget.files[0];
  if(!file) return;
  readFileAsDataURL(file).then((b64str) => {
    return new Promise((resolve,reject) => {
      const img = new Image();
      img.onload = () => resolve(img)
      img.onerror = (err) => reject(err);
      img.src = b64str;
    });
  }).then((loadedImg) => {
    /* TO DO LOGIC HERE */ // In Step 3)
  });
}, false);

Note that the previous code snippet of instantiating worker has been nested in the event function.
As worker only reads in an <img> element, new Image() is initialised with the src attribute to be the uploaded image’s base64 encoded data.

Step 3. Implement Tesseract API to extract Image Text

(async () => {
  await worker.load();
  await worker.loadLanguage('eng');
  await worker.initialize('eng');
    
  let result=await worker.recognize(loadedImg);
  let extractedData=result.data;
    
  let wordsArr=extractedData.words;
  let combinedText='';
  for(let w of wordsArr) {
    combinedText+=(w.text)+' ';
  }
  inputTxt.value=combinedText;
  await worker.terminate();
})();

Preview of Part I Implementation:

Screencapture by Author | Upon upload of image, Tesseract-OCR processes file and extracts text into the textarea

🚩 Checkpoint—As illustrated, Part I leverages Tesseract-OCR to implement the Image-to-Text aspect of this application.

Part II. Convert Text to Speech with Web Speech API

In order to convert web text to browser voice, Part II of the application leverages on the Web API: SpeechSynthesis

Reusing the JavaScript code snippet from the GitHub Repo web-speech-api, the Text-to-Speech aspect of this app is rendered as follows:

Illustration by Author | After text extraction from image, selecting the “Play” Button would convert input text to browser speech. | Language Dialect + Speed + Pitch can be customised with displayed form inputs.

Full source code is available at my GitHub repo: Text-To-Speech-App or try it out at demo!

Potential Use-Cases

Data Entry For Business Documents
Aids for the Visually Impaired
Converting scanned documents to machine-readable text for data processing

Personal Comments

The capability of OCR technology to extract textual content in images eliminates the manually intensive need to re-type the text, effectively saving overhead costs (time+manpower).

While expectations for data-fueled fields such as Data Analytics and Artificial Intelligence/Machine Learning continue to surge exponentially, there is an ever-increasing demand for digital data collection.

Following the simultaneous innovations of WASM (e.g. C/C++ to JavaScript) combined with the use of existing tools such as JavaScript Web APIs, this one-off implementation is a proof-of-concept that a standalone Text-to-Speech (i.e. “Read Aloud”) application created with client-side JavaScript is within the realm of possibilities.

Reference:
Build A Text-To-Speech App Using Client-Side JavaScript | JavaScript in Plain English