by Melody Yang
How I built a handwriting recognizer and shipped it to the App Store
From constructing a Convolutional Neural Network to deploying an OCR to iOS
The Motivation for the Project ✍️ 🇯🇵
While I was learning how to create deep learning models for the MNIST dataset a few months ago, I ended up making an iOS app that recognized handwritten characters.
My friend Kaichi Momose was developing a Japanese language learning app, Nukon. He coincidentally wanted to have a similar feature in it. We then collaborated to build something more sophisticated than a digit recognizer: an OCR (Optical Character Recognition/Reader) for Japanese characters (Hiragana and Katakana).
During the development of Nukon, there was no API available for handwriting recognition in Japanese. We had no choice but to build our own OCR. The biggest benefit we got from building one from scratch was that ours works offline. Users can be deep in the mountains without the internet and still open up Nukon to maintain their daily routine of learning Japanese. We learned a lot throughout the process, but more importantly, we were thrilled to ship a better product for our users.
This article will break down the process of how we built a Japanese OCR for iOS apps. For those who would like to build one for other languages/symbols, feel free to customize it by changing the dataset.
Without further ado, let’s take a look at what will be covered:
Part 1️⃣: Obtain the dataset and preprocess images
Part 2️⃣: Build & train the CNN (Convolutional Neural Network)
Part 3️⃣: Integrate the trained model into iOS
Obtain the dataset & Preprocess Images 🈂
The dataset comes from the ETL Character Database, which contains nine sets of images of handwritten characters and symbols. Since we are going to build an OCR for Hiragana, ETL8 is the dataset we will use.
To get the images from the database, we need some helper functions that read and store images in
Once we have
hiragana.npz saved, let’s start processing images by loading the file and reshaping the image dimensions to 32x32 pixels. We will also add data augmentation to generate extra images that are rotated and zoomed. When our model is trained on character images from a variety of angles, our model can better adapt to people’s handwriting.
Build and Train the CNN 🏋️
Now comes in the fun part! We will use Keras to construct a CNN (Convolutional Neural Network) for our model. When I first built the model, I experimented with hyper-parameters and tuned them multiple times. The combination below gave me the highest accuracy — 98.77%. Feel free to play around with different parameters yourself.
Here are some tips if you find the performance of the model unsatisfactory in the training step:
Model is overfitting
This means that the model is not well generalized. Check out this article for intuitive explanations.
How to detect overfitting:
acc (accuracy) continues to go up, but the
val_acc (validation accuracy) does the opposite in the training process.
Some solutions to overfitting: regularization (ex. dropouts), data augmentation, improvement on quality of the dataset
How to know whether the model is “learning”
The model is not learning if
val_loss (validation loss) goes up or does not decrease as the training goes on.
Use TensorBoard — it provides visualizations for model performance over time. It gets rid of the tiresome task of looking at every single epoch and comparing values constantly.
As we are satisfied with our accuracy, we remove dropout layers before saving the weights and model configuration as a file.
The only task left before moving on to the iOS part is converting
hiraganaModel.h5 to a CoreML model.
output_labels are all possible outputs we will see in iOS later.
Fun fact: if you understand Japanese, you may know that the order of the output characters does not match with the “alphabetical order” of Hiragana. It took us some time to realize that images in ETL8 weren’t in “alphabetical order” (thanks to Kaichi for realizing this). The dataset was compiled by a Japanese university, though…🤔
Integrate the Trained Model Into iOS 📲
We are finally putting everything together! Drag and drop
hiraganaModel.mlmodel into an Xcode project. Then you will see something like this:
Note: Xcode will create a workspace upon copying the model. We need to switch our coding environment to the workspace otherwise the ML model won’t work!
The end goal is having our Hiragana model predict a character by passing in an image. To achieve this, we will create a simple UI so the user can write, and we will store the user’s writing in an image format. Lastly, we retrieve the pixel values of the image and feed them to our model.
Let’s do it step by step:
- “Draw” characters on
strokeLayer.strokeColor can be any color. However, the background color of
canvas must be black. Although our training images have a white background and black strokes, the ML model does not react well to an input image with this style.
UIImage and retrieve pixel values with CVPixelBuffer
In the extension, there are two helper functions. Together, they translate images into a pixel buffer, which is equivalent to pixel values. The input
height should both be 32 since the input dimensions of our model are 32 by 32 pixels.
As soon as we have the
pixelBuffer, we can call
model.prediction() and pass in
pixelBuffer. And there we go! We can have an output of
3. Show the output with
This step is totally optional. As shown in the GIF at the beginning , I added an alert controller to inform the result.
Voila! We just built an OCR that is demo-ready (and App-Store-ready)! 🎉🎉
Building an OCR is not all that hard. As you saw, this article consists of steps and problems and I ran into while building this project. I enjoyed the process of making a bunch of Python code demonstrable by connecting it with iOS, and I intend to continue doing so.
I hope this article provides some useful information to those who want to build an OCR but have no clue where to start.
You can find the source code here.
Bonus: if you are interested in experimenting with shallow algorithms, then keep on reading!
[Optional] Train With Shallow Algorithms 🌲
Before implementing CNN, Kaichi and I tested out other machine learning algorithms to figure out if they could get the job done (and save us some computing costs!). We picked KNN and Random Forest.
To evaluate their performances, we defined our baseline accuracy to be 1/71 = 0.014.
We assumed a person without any knowledge of the Japanese language could have a 1.4% chance of guessing a character right.
Thus, the model would be doing well if its accuracy could surpass 1.4%. Let’s see if it was the case. 😉
The final accuracy we got was 54.84%. Much higher than 1.4% already!
An accuracy of 79.23%, so Random Forest exceeded our expectations. While tuning hyper-parameters, we got better results by increasing the number of estimators and depth of trees. We thought that having more trees (estimators) in the forest meant more features in the image were learned. Also, the deeper the tree, the more details it learned from features.
If you are interested in learning more, I found this paper that discusses image classification with Random Forest.
Thank you for reading. Any thoughts and feedback are welcomed!