NodeJS & Tesseract.js for OCR [Example]

A short example of recursing through a directory of scanned documents (JPGs) and performing Optical Character Recognition.

Start by installing Tesseract.js to do the OCR, glob to read the directory for filenames and fs-extra to write the txt files containing our cleaned text.

npm install tesseract.js glob fs-extra --save

Next the script ocr.js that uses these npm packages

// load our NPM packages
var Tesseract = require('tesseract.js');
var glob = require("glob")
var fs = require('fs-extra');
// define where the scanned docs can be found
var path = 'public/resized/';

// read the filenames in the path
glob(path + "*.jpg", function (er, files) {
// for each filename call the Tesseract function
   files.forEach(function(filename){
// set the language to French. Omitting this defaults to eng
   Tesseract.recognize(filename, 'fra')
     .progress(function (p) { console.log(filename, p) })
     .catch(err => console.error(err))
     .then(function (result) {
         var rawtxt = result.text;
// simple regex to replace all weird characters with a space
         var clntxt = rawtxt.replace(/[^a-z0-9./-]+/gi, ' ')
// write the scanned text to a text file
         fs.writeFileSync(filename+'.txt', clntxt, function(err){
            console.log('WROTE ' + filename + '.txt');
         });
      });
});
});

Run the script with

node ocr.js

It’s not fast but it works!

Exposing MongoDB Documents with a REST API via Nodejs and Expressjs (Part 3/3)

The internet is full of arguments about whether or not to store images (and blobs generally) in mongodb. I’m of the view it is NOT a bad thing. Especially in the context of Heroku and mLabs.  

This example will build off the previous post (‘Nodejs & Expressjs to Upload and save Image Files into MongoDB’) and uses in the same example site.

A few key points to consider about the general approach.

  1. Hosting user-uploaded files on disk is not a viable solution in Heroku’s hosted nodejs environment. If you really want to do this, you’ll have to introduce Amazon AWS S3 into the mix. Not difficult – just more code than I want to have to format in WordPress (which incidentally, is STILL horrible). 
  2. Passing or sharing mongodb documents from your server-side nodejs to your browser-side app is not a great idea. Particularly for image files. You are foregoing the sophisticated caching capabilities of your browser and proxy servers; both intended to speed up network transit and response times.

The solution is to build a REST API interface in nodejs to access documents (specifically image files) in your mongodb.  This is a well established and simple approach but only if you can avoid getting tangled up in encoding.

The key take away is to note that you want to save your image base64 encoded and NOT the nodejs default of UTF-8.

We’ll start with an Express route to access (.findOne()) our image using the unique mongodb ObjectId (_id) and will use this as the parameter passed in on the REST URL.

Minor changes allow you to use any other unique identifier such as a username or image filename.

You can append this code into the same  /routes/index.js used in the previous example.


router.get('/picture/:picture', function(req, res){
   // assign the URL parameter to a variable
var filename = req.params.picture;
// open the mongodb connection with the connection
// string stored in the variable called url.
   MongoClient.connect(url, function(err, db){
   db.collection('yourcollectionname')
// perform a mongodb search and return only one result.
// convert the variabvle called filename into a valid
// objectId.
     .findOne({'_id': ObjectId(filename)}, function(err, results){
         // set the http response header so the browser knows this
// is an 'image/jpeg' or 'image/png'
res.setHeader('content-type', results.contentType);
// send only the base64 string stored in the img object
// buffer element
         res.send(results.img.buffer);
      });
   });
});

To test this, log into your Heroku dashboard and navigate into your mLabs add-on. Within your Collection you hopefully have some documents uploaded from the earlier post. It should look something like this,


{
"_id": {
    "$oid": "58f8bd6a343aff254131cb17"
    },
"description": "Foo!",
"contentType": "image/jpeg",
"size": 1239201,
"img": "<Binary Data>"
}

Select and copy an ObjectId ($oid). In this example, 58f8bd6a343aff254131cb17

In the browser goto, 

   http://localhost:3000/picture/58f8bd6a343aff254131cb17

Hopefully, you’ll see your jpeg image rendered in the browser.

It will have come out of your mongodb Collection at mLabs, passed through your nodejs route at Heroku, traversed the Clourflare CDN and arrived safely at your browser for rendering and caching.

Or at least that’s how it works in the running example.

Using Nodejs & Expressjs to Upload and save Image Files into MongoDB (Part 2/3)

1101-sdt-mongoThis is a follow-on to Part 1, an introductory post

This example will use the Expressjs templating engine. Expressjs can be a deep and mind boggling domain but we’ll use only two of its key features, routes and views.

This example will NOT use mongdb GridFS. In the interests of keeping this as simple as possible, we’ll limit our image file size to < 2MB and avoid helper libraries that obfuscate what’s going on.

So assuming you have nodejs working you can stub out a templated app. I’m going to use Express Application Generator to get a directory structure and some file templates going. I’ll also install nodemon, a useful utility which restarts the nodejs service each time you save a file change.

sudo npm install express-generator nodemon -g

I’m choosing to use the ejs templating engine instead of the default Jade engine. If you’re unfamiliar with Jade, it looks weird. A bit like going straight to Coffeescript and bypassing Javascript.

express --ejs --git --testdir

Install two NPM packages

install mongodb multer --save

You may want to run this to make sure packages and dependencies are up to date.

npm install

Now start your node service from within your testdir and you should be able to access it from your browser with http://localhost:3000 (or whatever port it starts on)

nodemon bin/www

Only two files are really needed for this example,

  1. routes/index.js
  2. views/index.ejs

Open each in your code editor and we’ll start with views/index.ejs. This is the minimum HTML necessary to get a ‘choose file’ button and file selector going. It also includes a text field for some user entered meta data.

<form action="/uploadpicture" method="POST" enctype="multipart/form-data">
<input type="file" name="picture" accept="application/x-zip-compressed,image/*">

<input class="form-control" type="text" name="description" placeholder="Description or Message">

<input class="btn btn-primary" type="submit" value="submit">
</form>

Save this file and goto the routes/index.js

We’re going to create two routes. You’ll want it to look something like this,

var express = require('express')
   , router = express.Router()
   , MongoClient = require('mongodb').MongoClient
   , ObjectId = require('mongodb').ObjectId
   , fs = require('fs-extra')
   // Your mLabs connection string
   , url = 'mongodb://username:password@yourinstanced.mlab.com:29459/yourdb'
   , multer = require('multer')
   , util = require('util')
   , upload = multer({limits: {fileSize: 2000000 },dest:'/uploads/'})

// Default route http://localhost:3000/
router.get('/', function(req, res){ res.render('index'); });

// Form POST action handler
router.post('/uploadpicture', upload.single('picture'), function (req, res){

if (req.file == null) {
  // If Submit was accidentally clicked with no file selected...
  res.render('index', { title:'Please select a picture file to submit!'); });
} else {

MongoClient.connect(url, function(err, db){
   // read the img file from tmp in-memory location
   var newImg = fs.readFileSync(req.file.path);
   // encode the file as a base64 string.
   var encImg = newImg.toString('base64');
   // define your new document
   var newItem = {
      description: req.body.description,
      contentType: req.file.mimetype,
      size: req.file.size,
      img: Buffer(encImg, 'base64')
   };

db.collection('yourcollectionname')
   .insert(newItem, function(err, result){
   if (err) { console.log(err); };
      var newoid = new ObjectId(result.ops[0]._id);
      fs.remove(req.file.path, function(err) {
         if (err) { console.log(err) };
         res.render('index', {title:'Thanks for the Picture!'});
         });
      });
   });
   };
}); 

This should be about all you need to get a basic uploader running…

One important note here is that we’re converting the binary image file (jpg) into a text string that is base64 encoded. This is not the nodejs default (utf-8). The reason for this is not well documented according to Google but will become apparent in the next post.

The next post will examine how to extract these image files from mongodb and serve them up over http.

Free Nodejs & MongoDB Hosting at Heroku & mLabs (Part 1/3)

Node.js has been all the rage for several years now. The momentum and traction since ~2014 seems to have caught the ‘enterprise community’ off guard. Data center penetration is growing and raising alarms with Chief Information and Computer Security Officers, mostly concerned over source code pollution at GitHub and NPM. Justified or not, there is still no credible evidence based answer to this concern.

As developers continue to pile into this stack, more are asking where to host their prototype or hobby apps. Preferably free, or at least modestly priced.

I’ve been using and recommending Heroku for a few years. Not that it’s the best, cheapest or most performant but because it’s the one I’m most familiar with.

I’ve been meaning to get to grips with Google’s Cloud hosting of Nodejs, but have yet to find the time. Also, price transparency remains a deterrent.

herokucaptureHeroku was around long before all the nodejs excitement started. Founded ~2007 I think.

As has the other critical piece of infrastructure – free MongoDB hosting at mLab.

mLabs is closely integrated into Heroku as an ‘add-on’ so separate registration is not required.

Heroku dashboard serves as the portal to manage your nodejs instance or Dyno as Heroku calls it (not to be confused with a Slug, but I am) and your mongodb collections.

Hclourflarecaptureeroku and mLabs are now wholly owned subsidiaries of Salesforce.com and both host their offerings on Amazon AWS S3.

A third piece of crucial free infrastructure is DNS hosting, robust Content Delivery Networking and SSL encryption.

This too can be had for free by hobbyists and developers via Cloudflare.

It’s been a well kept secret for a long time now.

So what we have is,

Free Nodejs & MongoDB Hosting = Heroku + mLabs + Cloudflare

All you need to get these three pieces up and running is an email address and about a half hour (probably more if you need DNS to propagate).

Fast forward and you’ll find the Heroku Documentation is pretty clear and will get your developer tool chain installed (Heroku CLI and MongoDB Shell) and your first test app deployed with Git.

In the next post we’ll see how to get a simple app running that allows users to upload an image file and store it in mongodb.

The running example is here.

Data Visualization : Jawbone UP

ScreenShot1

IMG_0432What originally drew me to the UP back in 2013, was the offer of access to my own data. I was hoping to get sensor data. The actual discrete time stamped measurements from the accelerometer and the stop watch. Instead what I got was daily aggregations. I suspect Jawbone retain the meta data from the phone app like geotags,  network details etc.

Downloading the aggregated data from the website involves finding link buried at the bottom of the Accounts section of the User Profile page.

JBScreenshot

Interpreting the column headings required some hunting around the Jawbone Support Forums. These community Forums have since disappeared form the Jawbone website. So the table below may be the only data dictionary still floating around the internet!

It to have been deciphered by an enthusiastic user rather than an official spec from Jawbone. I’d link to the forum post and credit the user but I couldn’t find them even in Google Cache.

Essentially, this is what’s available in the CSV files.

Column name

Type

Description

e_calcium

meal

calcium content in milligrams

e_calories

meal

energy content in kcal

e_carbs

meal

carbohydrates content in grams

e_cholesterol

meal

cholesterol content in milligrams

e_fiber

meal

fiber content in grams

e_protein

meal

protein content in grams

e_sat_fat

meal

saturated fat content in grams

e_sodium

meal

sodium content in milligrams

e_sugar

meal

sugar content in grams

e_unsat_fat

meal

unsaturated fat (monounsaturated + polyunsaturated) content in grams

m_active_time

movement

total number of seconds the user has been moving

m_calories

movement

total number of calories burned in the day

m_distance

movement

total distance in meters

m_lcat

movement

longest consecutive active time in seconds

m_lcit

movement

longest consecutive idle time in seconds

m_steps

movement

total number of steps in the day

m_workout_count

movement

number of workouts in the day

m_workout_time

movement

total number of seconds the user has workedout

o_count

mood

number of workouts in the day

o_mood

mood

total sum of mood ratings in the day

s_asleep_time

sleep

number of minutes, since previous midnight, when the user fell asleep (first time the user fell into light or sleep mode).

s_awake

sleep

seconds the user was awake

s_awake_time

sleep

number of minutes, since previous midnight, when the user woke up (either the band was taken out of sleep mode, or the beginning of the last awake period)

s_awakenings

sleep

number of times the user woke up

s_bedtime

sleep

number of minutes, since previous midnight, when the user set the band into sleep mode

s_deep

sleep

number of seconds the user was in deep sleep

s_duration

sleep

total number of seconds the user slept

s_light

sleep

number of seconds the user was in light sleep

s_quality

sleep

quality score (0-100)

n_asleep_time

nap

number of minutes, since previous midnight, when the user fell asleep (first time the user fell into light or sleep mode).

n_awake

nap

seconds the user was awake

n_awake_time

nap

number of minutes, since previous midnight, when the user woke up (either the band was taken out of sleep mode, or the beginning of the last awake period)

n_awakenings

nap

number of times the user woke up

n_bedtime

nap

number of minutes, since previous midnight, when the user set the band into sleep mode

n_deep

nap

number of seconds the user was in deep sleep

n_duration

nap

total number of seconds the user slept

n_light

nap

number of seconds the user was in light sleep

n_quality

nap

quality score (0-100)

I chose to explore this data visually with D3.js and Crossfilter.js. You could just have easily done the same in MS Excel or Google Sheets.

IMG_0076My experience suggests my band’s distance estimates are off by about +20% when running. Consequently, this overestimates the speed (mph) calculations performed by the app. It’s true, I could calibrate it to my own running cadence and stride. I didn’t.

Given what I know about my own Basal Metabolic Rate (BMR) and Total Energy Expenditure (TEE), I assume the calorie expenditure estimates to be equally flattering (about 15% over estimated).

I believe the band assumes approximately 2,000 steps per mile. This would be consistent with prevailing average estimates. Hence, the further you are from “average” height and weight, the higher your margin of error.

If you’re an athletic outlier (skinny distance runner or a stocky body builder) these measurements are not useful for improving performance. If you’re “average” (overweight and inactive) you’ll get more accurate measurements out of the box. Which I suppose says a lot about who this was designed for.ScreenShot2

In performing this simple visual exploration, I was unable to learn anything I didn’t already know.

  1. I’m more active on the weekends. Not a surprise given my desk job.
  2. I sleep better at the weekends. Not a surprise given I’m more active.
  3. I appear to be more active in the warmer months. Not a surprise given I live in SD.
  4. I’m not close to the mean. Not a surprise given my heritage and genetics.

No unexpected patterns emerged. No ah-ha moments. In fact, I suspect I was happier during the missing data periods when I wasn’t wearing a band to obsessively measure my runs, workouts, sleep, vitals, macro-nutrients etc.

I’ve owned at least six UP bands (if not eight). To be honest, I’ve lost count. None lasted beyond the 90 day warranty period and all except the first were replacements. While the customer service has been excellent the durability of the hardware was disappointing. This reflects heavily in the data and what you can do with it.

Replacement bands typically take 2-3 weeks to arrive, explaining the lengthy gaps in the illustrations. My apparently choppy performance (wide swings from the mean in the horizon charts) is a device reliability issue rather than inconsistent lifestyle choices or behavior patterns.

ScreenShot3

The UP band is unlikely to have any long lasting impact on my overall fitness, health or wellbeing. I’m pretty confident in saying, it won’t improve my Quality Adjusted Life Years. At best, the inactivity warnings remind me how much of a sedentary slob I can be.

If you’ve had better luck with your Jawbone UP and are interested in trying this analysis for yourself, my source code can be found on my Github repository.