QR codes are everywhere: want to create a more original solution? Let’s build our own fiducial marker and learn how to detect and decode it.
In this post, let’s learn how to build a new fiducial marker and how to detect it by training an object detection model. Then, we will learn how to decode our marker using image processing techniques.
Let’s break it down to three steps:
- Creating a fiducial marker
- Detecting the marker in an image
- Decoding the marker
There are a lot of fiducial markers for computer vision already existing, the most famous being the QR code. There are other QR codes, more or less used, more or less robust, that can be used too. Below is a non-exhaustive list of codes.
As we can see in the above image, fiducial markers can be quite different but they all have the same purpose: containing easily decodable information.
What is a good fiducial marker?
Ideally, a good fiducial marker has the following properties:
- Easy to detect: before being able to decode a marker, you have to be able to accurately detect it in an image
- Easy to decode: it has to be easy to decode a marker and without any ambiguity (i.e., a decoded marker yields a unique value)
Based on those properties, let’s now build our very own marker from the existing ones.
Designing our fiducial marker
I personally like the RUNE-tag (for very arbitrary reasons):
- The circular shape and dots have something that makes it softer than square markers
- It seems very distinguishable, making it most likely easy to detect for an object detection model
- It is easily customizable: we can tweak the number of dots per circle as well as the number of circles to make it suit our needs and expected aesthetics
But it is not flawless in its original form: two rotated markers may or may not lead to the same decoding.
To mitigate this issue, we will add one condition to the marker: one and only one quadrant has no black dots, as shown below.
Such a marker can be decoded easily: let’s consider that each quadrant can take three possible values: 0, 1 or 2 depending on the three possible cases:
- A small black dot: 0
- A large black dot: 1
- Both dots: 2
More generally speaking, considering a marker with C circle layers, a quadrant can take up to 2ᶜ−1 values (because having no black dots is reserved for the quadrant 0).
At the end, for a marker with d+1 dots, the number of possible combinations is equal to (2ᶜ— 1)ᵈ. For a 2 circle layers and 20 dots per circle tag, it means 3¹⁹ ~ 1.16 billion possible values.
Building our fiducial marker
Let’s explain here a piece of code used to generate a random fiducial marker image.
As you can see, the first step is to generate a list of random values. Considering C the number of circle layers and d+1 the number of dots per circle, we generate a list of d values between 0 and 2ᶜ−1 using numpy.
Based on this list of random values, we compute the dot values: 0 for a white dot, 1 for a black dot. Finally, we draw the final tag, given a pixel size, and we save the output as an image. Of course, a link to the fully working code repository is available and documented at the end of the article.
We have chosen a marker design and we know how to generate such a marker. To be able to use such a marker in real conditions, we need a solution able to detect and decode such a marker in an image. This is really simple, sequential 2-step pipeline:
- Detecting the marker with object detection
- Decoding the detected marker
Let’s now go to the first step of this pipeline.
So, the first step is to detect the presence and location of a marker in a given image. To do so, there are many object detection models out there. We will use here a YOLOv8 model, which is really easy to train and use in a production environment.
But before being able to actually train an object detection model, we need data: images from different backgrounds and environments containing tags from different zoom levels and perspectives.
Instead of collecting and labeling data, which can be very time consuming, we will here generate and use synthetic data to train a model on this data.
Generating the data
We only need two ingredients to generate synthetic data to train an object detection model:
- Various background images, free of rights, that can be taken from Unsplash for example
- Marker images, that we will randomly generate on the fly
With those two ingredients, all we need is to apply some augmentation using Albumentations to generate a lot of unique, synthetically generated images with their associated labels.
Let’s provide here a piece of code allowing to generate images, given a path to background images and marker features such as the number of circle layers and dots per circle.
This is quite a long code, feel free to dig in, but in a few words it does the following:
- Generate a random tag, the image boundaries are the bounding box labels
- Apply transformations such as affine, perspective or scale transformations, thanks to Albumentations
- Randomly insert this tag in a randomly selected background image
- Do it as many times as needed
With this method, we can easily generate a large enough dataset with hundreds or thousands of images. Below are a few examples of the created images with labels as red bounding boxes.
As we can see, the generated images are quite various, as we have backgrounds and augmentations added such as blurring and perspective.
Of course, we do not use the same background images for the train and validation sets, so that the model evaluation remains as unbiased as possible.
A python script allowing to generate images and associated labels in the right folders for YOLO model training is available in the github repository.
Training and evaluating the model
Using the previously created dataset, we can now train an object detection model on this data. Thanks to the YOLOv8 library, just a few lines of code are needed to train a new model.
As we can see, all we have to do is to instantiate a model and train it on the data. After 100 epochs (or less if you hit early stopping condition while training, as I did after about 80 epochs here), I got a mAP@50 of about 0.5, as we can see from the generated results below.
While the results are far from being perfect, they are good enough for a dataset trained with only synthetic data. Let’s now test this model on real conditions with a webcam feed.
To do so, we can use the code in the following gist:
This code is quite straightforward:
- We load the model and get the webcam feed
- For each new image, we compute the model inference and display any detected bounding box
- We stop the feed upon hitting the escape key
I ran this code with an image of a marker on my phone, and as we can see in the image below it worked pretty well.
While it does not detect perfectly the marker in all configurations, it worked good enough for a model trained on synthetic data only. To get better results, I believe the data augmentation could be tweaked a bit, and of course real, labeled data would be very helpful.
Now that we have the first part of the pipeline done, let’s move to the second step: tag decoding.
We now have fully working codes to both generate and detect our new fiducial marker.
Once you can detect a tag in an image, next step is of course to decode it. Let’s start from a cropped image of a detected marker, thanks to our previously trained model.
I developed a decoding algorithm made of the following steps:
- Blob detection to detect the dots
- Outer circle detection and ellipse fitting
- Dots selection for homography computation
- Homography matrix computation and image unwarping
- And finally, marker decoding
The main idea is the following: as soon as I can match a detected marker with a reference marker (knowing the number of circle layers and dots per circle), I can decode it quite easily by checking the image is white or black. But to do so, I first need to unwarp the image to make it match the reference marker.
Let’s go through these steps together.
Detecting the dots
The first task is to detect the dots in the image detected by the YOLO model.
From the input cropped image, we will apply the following list of image processings using OpenCV:
- Convert the image to grayscale
- Binarize the image with Otsu’s algorithm
- Find the dots with a blob detector
This is what the code in the following gist does:
As we can see, a lot of parameters are set for the blob detector, such as a minimum and maximum area, as well as a minimum circularity, in order maximize as much as possible the effective detection of actual marker dots. It took quite some time to fine-tune these parameters, but feel free to play with those.
Using this code on our input cropped image yields the following blob detection.
As we can see, the dots are well detected. Next step is to detect the outer circle.
Detecting the outer circle
Now, we need to detect the outermost circle layer (no matter the number of circles in a tag, this solution would generalize). This will allow us to then find the dots on the outer circle, so that we can finally unwarp the image.
To compute the ellipse, all we do is keep the larger dots (named keypoints in OpenCV) and fit an ellipse equation out of those points. This is what does the following code:
When I apply this code and display the detected points as a scatter plot on which I display the fitted ellipse, I get the following result:
As we can see, the fitted ellipse is well defined and consistent with the dots positions. Note that, since we are fitting an ellipse, no matter how deformed is the detected marker because of perspective, it would be able to work.
Now we need to find the dots that are actually on this ellipse. It is quite easy: we only need to find the dots locations that fulfill the ellipse equation (with a given threshold) that we just computed. This is done with the following piece of code:
Now that we know where are the dots, and which ones are on the outermost circle, we can use them to compute the homography matrix and unwarp the image.
Selecting dots for homography computation
The goal now is to find a few matching points with the reference image, in order to compute the homography matrix.
Based on this reference image above, we need to unwarp the detected blob with the right homography matrix.
In order to compute the homography matrix, we can simply use the OpenCV function findHomography. This function needs as input at least 4 matching points in the reference image and the input image, so that it can find the homography matrix. This homography matrix would then allow us to unwarp the detected image and match it with the reference.
From our detected blobs on the outermost circle, it is impossible to know where the dots were on the original reference image. So we will just select the longest chain of nearest neighboors dots in the outermost circle, so that we can match them with the reference. To do so, there are two steps:
- Computing the adjacency matrix, so that we know, for each dot, what are its adjacent dots (if any)
- From the adjacency matrix, computing the longest chain of adjacent dots
For the first step, we can use the following code:
This code will compute the adjacency matrix, as a python dict: for each existing dot index on the outermost circle as a key, a list of found adjacent dots indexes is the value.
From this adjacency matrix, it is now quite easy to find the longest chain of adjacent neighbors. To do so, I used the following code:
This code will efficiently find the longest chain of adjacent dots, and return the list of their indexes.
If we have at least 4 dots in this output, we can theoretically compute the homography matrix. Unfortunately, in most cases it won’t be really accurate, as the dots are almost on the same line, not allowing the compute accurately the homography matrix. To solve this problem, we will add one more dot: a symmetrically placed dot with respect to the center: this will give a much more accurate homography computation.
We can find a symmetrical dot with respect to the center (computed while doing the ellipse fitting) with the following code:
Note that since we are on an ellipse, using the center estimation to find the symmetrical point to a given dot is not a 100% reliable method: it may output a wrong dot. This is something that we will keep in mind of when computing the decoding.
At the end, we end up with the results in the following image, where the blue circles are the ones of the longest chain, and the red ones are the supposed symmetrical dots (one of them being part of the longest chain).
As we can see, we indeed selected the chain of 7 adjacent dots. and we selected another dot to be the symmetrical one of the leftmost dot in the chain.
Unwarping the image
So now that we have selected a few dots in the input image, let’s get the matching dots in the reference image and compute the homography matrix. To do so, we need the following inputs:
- The positions of the selected dots in the cropped image: that’s what we just computed
- The equivalent positions of these dots in the reference image: that needs to be computed, knowing the reference marker
To compute these points locations, we will use the following code, allowing to compute the dots locations.
Note that we have given one more degree of freedom with a parameter named symmetry_index_offset: this will allow to handle possible errors in the symmetrical dot computation, by adding an offset to the symmetrical dot location.
With the right dots locations in both the cropped image and the reference image, we can now compute the homography matrix and unwarp the image. To make sure we don’t make mistakes with the symmetrical dot, we will do that for an offset value in the range [-2, 2] with a step of 1, as we can see in the code snippet below:
What we do here is that we just compute the homography matrix with the OpenCV functionfindHomography and then unwarp the image with warpPerspective. And we do so for 5 values of offset, so that we end up with 5 unwarped images.
The resulting images are the following:
As we can see, depending on the offset the unwarping result is quite different. Even if it’s quite easy with a visual inspection to understand that the offset of -1 is the right one, we want this check to be automated. We will handle this is in the next step: the actual marker decoding.
Decoding the marker
From a given unwarped image, the last step is finally to decode the marker. We are really close, and this step is probably the easiest.
All we have to do is to check, for each expected dot location, the color of the unwarped image. Since the image went through an Otsu binarization, this is quite straightforward. We will just check if there is any black pixel in an area of 3×3 pixels around an expected dot location: if yes, then there is a dot; if no, then there is no dot.
This is what the above code does basically. Then depending on the position, we assign a value. So that the output of this function is just a list of values. Finally, we look for a -1 value (meaning the expected quadrant with no black dot, check the section Designing our fiducial marker for a reminder about that), and rearrange the array to put it at the last index location.
For example, here are the computed codes for each of the 5 unwarped images:
- Offset -2: [0, 2, 0, -1, 1, -1, 0, 0, 0, 2, 2, 1, 2, 2, 0, 2, 2, 2, 0, -1]
- Offset -1: [2, 2, 2, 0, 2, 0, 1, 1, 1, 2, 2, 1, 2, 2, 0, 2, 2, 2, 0, -1]
- Offset 0: [0, -1, 2, 2, 0, 0, -1, 0, 0, -1, 0, 1, 2, 2, 0, 2, 2, 2, 0, -1]
- Offset 1: [-1, 2, 2, 2, 2, 0, -1, -1, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 0, -1]
- Offset 2: [-1, 2, 1, 2, 1, 0, 1, -1, -1, -1, -1, 0, 0, 0, 2, 2, 0, 0, 2, -1]
As we can see, there is only one image with one and only one -1 value, at the last index location: the unwarped image using an offset of -1. This is our well unwarped image (as we could see with a visual inspection), allowing to actually decode the marker.
This code list being unique for each possible marker, you can either stop here, or compute a unique integer value. A unique value can be computed really easily with the following code snippet:
In our case, this would return -1 for all the wrongly unwarped images, and a value of 377667386 for the actual marker.
That’s it, we went all the way from an input image to an actual unique code! Let’s now recap and reflect on the limitations of what we did.
Now that we have all the building blocks, we just have to put them together to get a nice, custom fiducial marker decoder, that can replace a QR code!
As a recap, here are the steps in a fully working pipeline:
- From an input image, detect the markers with object detection
- For each object detected, crop the image and do the next steps
- Detect the dots with Otsu binarization and blob detection
- Find the outermost dots with ellipse fitting
- Compute the homography matrix using the longest chain of nearest neighboors dots and a symmetrical dot
- Unwarp the image using the homography matrix
- Decode the tag using the reference image
And that’s it! I won’t make you write all the code on your own, everything is available on the github repo, as well as a pre-trained object detection model.
You will find in this repo python scripts to run substeps (e.g. generate synthetic images, train an object detection model, etc…) as well as a python script that runs the full pipeline with your webcam as input, so that you can test it out!
I hope you enjoyed this post and you learned something from it! I personally loved this project, because it uses machine learning as well as good old image processing.
Still, the algorithm I developed has a few limitations that I would love to overcome. Indeed, not all markers can be decoded:
- A marker with no more than 2 adjacent dots on the outermost circle would not be decoded properly
- Same for a marker with no symmetrical dots from the longest chain: it would give unreliable results because an inaccurate homography matrix
Another limitation is the fact that sometimes the homography is mirroring the image during the unwarping, causing the list code to be reversed, and thus the final decoded integer value is different.
If you have any idea to overcome those limitations, you are more than welcome to send me a message or even propose a pull request!
On another topic, the decoding here only gives a integer value. This is up to you to match this integer value with anything relevant (a link, an item, an image…) in your app to make it really useful. I believe it would be possible to decode such a marker as a list of ASCII characters directly, but I didn’t try it out myself: again, any contribution is more than welcome.
Original RUNE-Tag paper:
F. Bergamasco, A. Albarelli, E. Rodolà and A. Torsello, “RUNE-Tag: A high accuracy fiducial marker with strong occlusion resilience,” CVPR 2011, Colorado Springs, CO, USA, 2011, pp. 113–120, doi: 10.1109/CVPR.2011.5995544.
Original RUNE-Tag repository: https://github.com/artursg/RUNEtag