How to create a MSQRD-like app with Google Cloud Vision

Disclaimer: target of this article is more divulgative about Google Cloud Vision and it’s not meant to show a step-by-step guide to really implement an App.

I’m fascinated about what APIs can do. I’ve founded a company that basically sells APIs and sometimes companies ask me to support them to leverage APIs in their world. So it’s not a surprise that when Google announced Cloud Vision API few months ago I’ve asked to be part of the Alpha phase in order to start play with it.

Selling APIs is a though work, they are powerful tools that can reduce Time-To-Market keeping high quality but sometimes they are difficult to be understood by the Business Line, if not supported by examples.

So, when I was playing with MSQRD with some friends a couple of days ago I thought that this can be a good business case to be discussed.

MSQRD has been built with an in-app face detection engine. The target of this article, mainly academic, is to show a possible use of Cloud Vision API in the real world.

“Google Cloud Vision API enables developers to understand the content of an image by encapsulating powerful machine learning models in an easy to use REST API.” (source: https://cloud.google.com/vision/)

In other words you can upload photos using Vision API and their backend analyse them and thanks to the gorgeous amount of data collected in the last years by Google they can recognise things like:

  • Buildings, animals, faces (of course…) etc
  • Companies starting by brands in the photo
  • Emotions of the people in the picture
  • Inappropriate content
  • Text

MSQRD is an app that hit the top of App Stores charts in the last weeks gaining a lot of visibility. It allows to take selfie and put on them cool effects like Di Caprio face with a two Oscars or Joker face. The app is really engaging because the effect application is very precise, quick and easy-to-use so you can have a lot of fun playing with that.

I’ll skip all the task related to the app creation and just focus on what in my opinion is the trickiest point: how to detect face details and apply the render on it in a very effective way.

Step 1: detect face details

Here is where you need a lot of precision, every face is different even if each of them has two eyes, one mouth and nose and a chin but their positions and shape changes every time. You’d need a way to recognise every details with an engine that “knows” that two eyes are usually at the same height, they look like a whitish background and something with a different color in the centre, etc etc.

I think you got it.

Face detection technology by itself it has been known for years, but this is the first time is accessible and easy to integrate by everyone with a basic knowledge of software development.

Let’s look like what is the response of Cloud Vision API when I’upload a face image. As most of Web based APIs, the response is a text file, readable by humans but designed to be elaborated by machines. In case of Cloud Vision the response is a JSON file, a text file that use the format of Javascript.

The first information returned is about face external and internal boundaries.

In the left image you can see boundaries of the skin part of the face.

Then you’ll find x, y and z coordinates of the face details like:

  • eyes and eyebrows, for each eye details info about left and right corners and upper and lower bounds
  • upper and lower lip
  • mouth (like in the example on the left)
  • nose
  • chin

Eventually you can find info about face rotation:

  • clockwise/anti-clockwise (rollAngle)
  • leftward/rightward (panAngle)
  • upwards/downwards (tiltAngle)

As you can see Vision API returns also an estimation about sentiments detected on the face — and they are very accurate btw!.

Step 2 Render application

I don’t go in deep analysis about this step since there are tons of photo manipulation libraries around.

But the idea at this point is that you can can create your render templates gallery where of each template you know in advance all the needed details (like eyes/mouth/nose/chin positions).

Whenever you got a response from Vision API you can stretch/shrink, rotate, tilt your template in order to better fit on the detected face and show the result to the user.

Trying to put some glue on what we talked so far a possible high-level flow of a MSQRD-like app is below depicted.

Use case details:

  • Users upload a photo (step 1)
  • The photo is uploaded to Google Cloud Vision API (step 2)
  • Cloud Vision API returns a JSON with all the relevant info (step 4)
  • The Render Manager receive photo uploaded info, retrieve the render/filter template to be applied, modifies it in order to make it fitting on the photo uploaded (steps 5–6–7)
  • Photo filtered is shown to the user (step 8)

This above flow is actually a cycle that must be performed many times per second in order to give a real-time responsiveness to the user and this can be challenging since everything must work as a charm. How to obtain this is out-of-scope but it requires a great software architecture design in the app and a lot of performance tuning.

After an introduction on what are Google Cloud Vision and MSQRD, we’ve looked more deep into Cloud Vision API and finally look at how could be possibile to use them to create a MSQRD-like app.

Cloud Vision API application areas are many and this is only one of the possibility offered. For example at Stentle we are going to integrate them in order to offer to our customers more tools to be used in the realisation of their omni-channel initiatives.

You can get further info about Google Vision API on the official web site https://cloud.google.com/vision/.

Tech Entrepreneur, Co-Founder and CEO of Stentle.com (a M-Cube Group company since 2019) — AI Advisor - Retail Transformation & E-Commerce Expert

Tech Entrepreneur, Co-Founder and CEO of Stentle.com (a M-Cube Group company since 2019) — AI Advisor - Retail Transformation & E-Commerce Expert