Android: Augmented reality (AR) app that overlays 3D object in the scene

[Download code for the project: GitHub]

In this post we show how to build a simple AR app that positions a 3D model in a particular location in the scene. We do not use predefined markers and will instead create a marker based on the scene’s contents. If you are familiar with various AR SDKs but curious about the algorithms that they implement, then this post will give a peak into techniques to solve the problem. In particular we will see how to

The output of the project on a Nexus 5 is shown below.


We use the double-tap gesture to pick a location in the scene as the reference. The red rectangle displayed in the video indicates the reference location. Please choose a location in your scene with sufficient features as the reference. You can change the reference marker location by pointing at a new location in the scene and double-tapping the screen again.

As mentioned before, our approach is largely based on native C++ code in Android and there are a few pre-requisites to understand the project. This project uses concepts that we covered in previous tutorials and it will not be possible to understand the contents of this post without familiarity with previous posts. We also assume that the reader has the required mathematical maturity and is familiar with concepts in computer vision.

Get the code


This project is available on GitHub. You can find more instructions to compile the project here.

This project requires devices with ABI armeabi-v7a. I have not provided libraries of other ABI’s to reduce the project and apk size. If your device supports a different ABI, then please get in touch with me for required libraries.

Code overview


We retain the project structure from the previous tutorial. Let us look at the new and modified files (all paths are relative to <project_path>/app/src/main):
  • com.anandmuralidhar.cornerdetectandroid/SimpleARActivity.java: As always we have only one activity in the project as defined in this file.
  • com.anandmuralidhar.cornerdetectandroid/GestureClass.java: We bring back this file from an earlier tutorial that dealt with touch gestures. We only use the double-tap gesture in this project.
  • jni/nativeCode/simpleARClass/simpleARClass.cpp: It has methods of SimpleARClass that implements all the AR-related algorithms to detect feature points and estimate the pose.
  • jni/nativeCode/common/myGLCamera.cpp: We have added a couple of functions to MyGLCamera that help in pose estimation.
  • jni/nativeCode/common/misc.cpp: We have added functions that draw the reference marker’s location and compute the camera intrinsic matrix.
  • assets/amenemhat: It contains OBJ, MTL, and JPEG files for a 3D model that we used in an earlier tutorial.

Match feature points between two images


Previously we had seen how to extract and highlight feature points in the scene. Now we will use feature points to identify the reference location in a scene. We will mainly focus on the class SimpleARClass since it contains all relevant methods.

In this project, when the user double-taps the screen, the current location that is captured in the camera’s preview is chosen as the reference image. This is the so-called “marker” in augmented reality. Then we begin tracking the
marker’s position in new frames. If we are able to identify the reference image or marker in a new frame, then we mark it with a red rectangle that indicates its position in the new frame. The two steps are:

Let us look at both these steps in detail.

How to create a marker?

A marker in augmented reality usually refers to a prominent image that is printed on a sheet of paper and placed in the scene. A search on the internet reveals this to be the most common example of augmented reality. But this approach places an additional constraint of requiring the user to print a marker. We avoid this restriction by creating a marker out of our scene’s contents. We assume that the user will double-tap when pointing to a planar surface since algorithms are designed to match against a planar marker.

When the user double-taps, we call SimpleARClass::DoubleTapAction and check if there are sufficient feature points in the scene:

if(DetectKeypointsInReferenceImage()) {
    trackingIsOn = true;
} else {
    trackingIsOn = false;
}

Note that we use feature points and key points interchageably to refer to the same. If DetectKeypointsInReferenceImage does not detect sufficient feature points, then we do not create a marker. DetectKeypointsInReferenceImage is similar to DetectAndHighlightCorners (covered in the previous tutorial) but also saves the gravity vector:

bool SimpleARClass::DetectKeypointsInReferenceImage() {

    //Detect feature points and descriptors in reference image
    cameraMutex.lock();
    cornerDetector->detectAndCompute(cameraImageForBack, cv::noArray(),
                                     referenceKeypoints, referenceDescriptors);
    cameraMutex.unlock();
    MyLOGD("Numer of feature points in source frame %d", (int)referenceKeypoints.size());

    if(referenceKeypoints.size() < MIN_KPS_IN_FRAME){
        return false;
    }

    // source gravity vector used to project keypoints on imaginary floor at certain depth
    gravityMutex.lock();
    sourceGravityVector.x = gravity[0];
    sourceGravityVector.y = gravity[1];
    sourceGravityVector.z = gravity[2];
    gravityMutex.unlock();

    return true;
}

We use OpenCV's detectAndCompute API to compute the key points as well as their descriptors. A feature descriptor is usually a vector that encodes and stores the local neighborhood of a feature point. This helps us to uniquely identify and match the feature point in a different image. Feature point detectors like ORB (that is used in this project), SIFT, SURF, etc., are evaluated on the basis of their ability to successfully match descriptors of feature points in the query image. Note that referenceKeypoints and referenceDescriptors are private variables and store the locations and descriptors for later use.

Match the marker in a query image

To match the marker, we compute the feature points and their descriptors for a new camera image. We try to match descriptors of the new feature points with the stored descriptors of the marker image. If we are able to find sufficient number of matches between the two sets, then we declare that we have a match between the two images (though we need to still estimate the new pose). The matching is done in SimpleARClass::MatchKeypointsInQueryImage. This function is called whenever a new camera preview image is available in ProcessCameraImage. We downscale the camera image in ProcessCameraImage to speedup processing and then call MatchKeypointsInQueryImage.

First we calculate new feature points and their descriptors in MatchKeypointsInQueryImage:

cameraMutex.lock();
cornerDetector->detectAndCompute(cameraImageForBack, cv::noArray(), queryKeypoints,
                                 queryDescriptors);
cameraMutex.unlock();

MyLOGD("Number of kps in query frame %d", (int) queryKeypoints.size());
if (queryKeypoints.size() == 0) {
    MyLOGD("Not enough feature points in query image");
    return false;
}

Next we match the above descriptors with those of the marker's using k-nearest neighbour algorithm:

std::vector<std::vector<cv::DMatch> > descriptorMatches;
std::vector<cv::KeyPoint> sourceMatches, queryMatches;
// knn-match with k = 2
matcher->knnMatch(referenceDescriptors, queryDescriptors, descriptorMatches, 2);

We have earlier initialized matcher to be of type BruteForce-Hamming in the constructor of SimpleARClass. Since we choose k=2, we get the two key points whose descriptors are the closest match to each query feature point.

Empirical studies have shown that simply choosing the closest neighbor of each feature point leads to a poor performance in matching features. So we compare the distances of a feature point to both its neighbors and only choose those feature points in which distance to the closest neighbor is less than a threshold:

for (unsigned i = 0; i < descriptorMatches.size(); i++) {
    if (descriptorMatches[i][0].distance < NN_MATCH_RATIO * descriptorMatches[i][1].distance) {
        sourceMatches.push_back(referenceKeypoints[descriptorMatches[i][0].queryIdx]);
        queryMatches.push_back(queryKeypoints[descriptorMatches[i][0].trainIdx]);
    }
}

Then we use the pairs of matching feature points to compute the homography matrix that relates the reference and query images:

homography = cv::findHomography(Keypoint2Point(sourceMatches),
                                Keypoint2Point(queryMatches),
                                cv::RANSAC, RANSAC_THRESH, inlierMask);

We do not extract the camera pose from the homography matrix and, strictly speaking, do not need to compute it, but we use it to remove outliers from the pairs of matching key points:

for (unsigned i = 0; i < sourceMatches.size(); i++) {
    if (inlierMask.at<uchar>(i)) {
        sourceInlierKeypoints.push_back(sourceMatches[i]);
        queryInlierKeypoints.push_back(queryMatches[i]);
    }
}

Then we use the homography matrix to draw the position of the marker in the new camera image.

DrawShiftedCorners(cameraImageForBack, homography);

Pose estimation


Pose estimation is a rich area of study in computer vision and it can have subtle variants depending on the nature of the problem. In our case, we began with double-tapping to select a reference image as the marker. Then we identified this marker in a new image in camera preview frame. Now we want to compute the displacement of the marker with respect to the camera preview image. This displacement is captured by a translation vector and a rotation vector that correspond to the six degrees of freedom of the marker image. In order to estimate the pose, we need to calculate the translation and rotation vectors. This is done in the function TrackKeypointsAndUpdatePose. This function is called by ProcessCameraImage if MatchKeypointsInQueryImage is able to successfully match feature points across the query and reference images. If we are not able to match the feature points, then we cannot estimate the pose in the new camera preview image.

In TrackKeypointsAndUpdatePose, we begin with projecting the reference image's key points on a floor that is assumed to be 75 units away from the device's camera:

sourceKeypointLocationsIn3D = myGLCamera->GetProjectedPointsOnFloor(sourceInlierPoints,
                                                                    sourceGravityVector,
                                                                    CAM_HEIGHT_FROM_FLOOR,
                                                                    cameraImageForBack.cols,
                                                                    cameraImageForBack.rows);

This is done to convert the 2D locations of the keypoints on the image to 3D locations in the scene. We have made two assumptions while computing the 3D locations. First, we assume that the feature points lie on the ground. Second, we assume that the device is at a height of CAM_HEIGHT_FROM_FLOOR from the reference image. The value of CAM_HEIGHT_FROM_FLOOR is not critical and can be chosen to be any reasonable number. The first assumption is critical since if the points are not on a planar surface then the performance of the algorithm will be very poor.

The function GetProjectedPointsOnFloor is implemented in myGLCamera.cpp. It involves simple trigonometric calculations to determine the 3D location of a point. Briefly, we consider a point's location in the image, i.e., its 2D coordinates. We use this to determine the point's location on the near plane. We create a ray that points from the camera's eye towards the point on the near plane. Then using the gravity vector, we intersect the ray with the ground and find the point's 3D location on the floor. We skip the mathematical details.

Next we determine the camera's intrinsic matrix in TrackKeypointsAndUpdatePose, and this topic deserves a sub-section due to its importance in computer vision.

Determine camera's intrinsic matrix

We refer to OpenCV's documentation for the equations describing a pinhole camera model that is commonly used to model a scene in image processing. In this model, the camera intrinsic matrix is given by

K =     \begin{bmatrix}      f_x & 0   & c_x \\      0   & f_y & c_y \\      0   & 0   & 1  \\   \end{bmatrix},   

where f_x and f_y are the focal length for X and Y axis respectively in pixel units and (c_x, c_y) is the principal point of the image ideally located at the image center. We have ignored distortion coefficients that can distort the image due to certain physical properties of real lenses. In order to determine this matrix, many augmented reality tutorials describe an extra step involvin camera-calibration. In this step, they effectively determine the camera's intrinsic parameters and construct this matrix. Such an approach is required when we are working with cameras whose intrinsic parameters need to be estimated. Since Android provides an API to query the field-of-view (FOV) of the device camera, we can determine the focal length of the camera from its FOV. We assume that the principal point is located in the center of the camera.

Let us look at the function MyGLCamera::ConstructCameraIntrinsicMatForCV:

cv::Mat MyGLCamera::ConstructCameraIntrinsicMatForCV(float imageWidth, float imageHeight) {

    //derive camera intrinsic mx from GL projection-mx
    cv::Mat cameraIntrinsicMat = cv::Mat::zeros(3, 3, CV_32F);

    // fx, fy, camera centers need to be in pixels for cv
    // assume fx = fy = focal length
    // FOV = 2 arctan(imageHeight / focalLength)
    float focalLength = imageHeight / 2 / tan((FOV * M_PI / 180)/2);
    cameraIntrinsicMat.at<float>(0, 0) = focalLength;
    cameraIntrinsicMat.at<float>(1, 1) = focalLength;

    // principal point = image center
    cameraIntrinsicMat.at<float>(0, 2) = imageWidth / 2;
    cameraIntrinsicMat.at<float>(1, 2) = imageHeight / 2;
    
    cameraIntrinsicMat.at<float>(2, 2) = 1.;    
    return cameraIntrinsicMat;
}

It uses the relation FOV = 2 arctan(imageHeight / focalLength) to calculate the focal length of the camera from the FOV. Remaining entries of the intrinsic matrix are easy to fill.

We query the horizontal FOV of the device (since we found this to be a reliable value in few devices) in the Java function CameraClass::SaveCameraFOV and derive the vertical FOV from it. The vertical FOV is passed to SimpleARClass through the JNI call for SetCameraParamsNative. Then the vertical FOV is passed to MyGLCamera while creating its object in SimpleARClass::PerformGLInits.

Determine extrinsic parameters and estimate pose

Translation and rotation vectors are referred to as the extrinsic parameters of the camera. So far we have computed 3D locations of feature points from reference frame and the camera intrinsic matrix. We already know the 2D locations of feature points in the camera preview image (query frame). With these three inputs we can solve for the translation and rotation vectors that determine the pose corresponding to the query frame. This is referred to as the perspective-n-point problem. We use the solvePnP function from OpenCV that solves this problem:

pnpResultIsValid = cv::solvePnP(sourceKeypointLocationsIn3D, queryInlierPoints,
                                cameraMatrix, distCoeffs,
                                rotationVector, translationVector);

Remember that all these operations are happening on the camera thread. Even though we have computed the extrinsic parameters, we cannot use them for rendering the 3D model in this thread. Let us see how to render on the GLES thread.

Render a 3D model with extrinsic parameters


Consider the function SimpleARClass::Render. In the beginning of this function, we render the camera image as the background texture as described in previous tutorial.

Note that the render thread executes much faster than the camera thread. By the time a new frame is processed on the camera thread, Render function would have been called multiple times. Hence we need to save the translation and rotation vectors that are the result of solvePnP so that they can be used to render the 3D model while the next camera frame gets processed. We also need to flip the axis of the extrinsic parameters to be consistent with OpenGL ES's coordinate system:

// make a copy of pnp result, it will be retained till result is updated again
translationVectorCopy = translationVector.clone();
rotationVectorCopy = rotationVector.clone();

// flip OpenCV results to be consistent with OpenGL's coordinate system
translationVectorCopy.at<double>(2, 0) = -translationVectorCopy.at<double>(2, 0);
rotationVectorCopy.at<double>(0, 0) = -rotationVectorCopy.at<double>(0, 0);
rotationVectorCopy.at<double>(1, 0) = -rotationVectorCopy.at<double>(1, 0);
renderModel = true;

We set a flag renderModel to indicate that extrinsic results are available and can be used to render the model.

Then we update the Model matrix in MyGLCamera by passing the extrinsic parameters:

 
cv::Mat defaultModelPosition = cv::Mat::zeros(3,1,CV_64F);
defaultModelPosition.at<double>(2,0) = -CAM_HEIGHT_FROM_FLOOR;
myGLCamera->UpdateModelMat(translationVectorCopy, rotationVectorCopy, defaultModelPosition);

We pass an extra parameter defaultModelPosition that indicates the position of the 3D model in world space. Since we are assuming that the reference plane is at distance of CAM_HEIGHT_FROM_FLOOR from the camera, we need to also place the 3D model at the same distance from the camera in world space. Else the model will appear to be "floating" above the reference plane. You can check this by passing some value other than CAM_HEIGHT_FROM_FLOOR in above code.

Let us look at the function MyGLCamera::UpdateModelMat. First part of the function converts the translation vector to newModelMat and this is fairly straightforward. Then we use OpenCV's Rodrigues function to convert the rotation vector to a rotation matrix. The rotation vector is actually a quaternion and is stored as a 3-tuple using a compact notation.

cv::Mat newRotationMat;
cv::Rodrigues(rotationVector, newRotationMat);
newRotationMat.copyTo(newModelMat(cv::Rect(0, 0, 3, 3)));

The rotation matrix is copied into the top left corner of the model matrix. This is similar to the rotation-translation matrix as described in OpenCV's documentation.

Then we create a glm::mat4 from OpenCV's Mat:

newModelMat.convertTo(newModelMat, CV_32F);
newModelMat = newModelMat.t();
modelMat = glm::make_mat4((float *) newModelMat.data);

We use defaultModelPosition to create a translation matrix that corresponds to the default position of the 3D model in world space. This is multiplied with modelMat to create the new Model matrix to be used to render the model.

Back to SimpleARClass:Render; we obtain the new MVP matrix after aligning it with respect to the current gravity vector using GetMVPAlignedWithGravity. Finally we render the 3D model (we loaded the OBJ using Assimp as described in an earlier tutorial). The 3D model will always appear on the same location in the scene as long as we were able to track the reference marker in the query image. Furthermore, it will appear to be standing on the floor in an upright position since we use the gravity vector to align it. This creates an illusion that the model is actually present in the scene and is referred to as "verisimilitude" in AR-parlance.

Note: In this project we use "tracking" to refer to matching feature points in the query image with those from the marker image. This is a computationally intensive operation and cannot be performed in real-time even in many high end phones. AR apps generally use faster algorithms for tracking of the object and may use such a matching technique to occasionally correct for the extrinsic parameters. For the sake of simplicity we avoid introducing tracking algorithms in this project.

22 thoughts on “Android: Augmented reality (AR) app that overlays 3D object in the scene”

  1. These are far and away the best examples and discussions about integrating OpenCV, OpenGL and Android NDK anywhere, with examples of the Experimental gradle build system thrown in too! Brilliantly clear and a huge help to those of us like me coming to image-processing in Android for the first time. Many, many thanks.

  2. Hello,
    I found your tutorial really helpful as i was stuck to do something like this since long and NDK was a headache for me.
    I want to show a video instead of 3D model, I have doen that earlier in an android app using VideoView ; But don’t know how to modify this project to display video when image is matched. Please Help!

    1. Hi Meghal,
      Apologies for the long delay in replying! Strangely I did not get notifications for new comments since last month.
      I’ve not used VideoView so am not sure how to modify it. But it should be easy to display a video in a certain perspective if you read it as a stream of images (or OpenCV Mats). We are using a certain perspective transform to display a 3D model in this project, instead you can use the same perspective transform on the OpenCV Mat so that the image (or video) is overlaid at the original location in the scene. Hope this helps.

    1. We do not save the original Mat but save the keypoint locations and their descriptors in referenceKeypoints and referenceDescriptors respectively. This is covered in more detail in this section.

      1. ok thanks, I got it. Can you tell me how did you get your life sorted with NDK . Please suugest any tuttorial if you followed any for combining android ,openCV and openGL.

    1. Hi Meghal – I ran it successfully on a Galaxy S5 – it’s much more about having the correct versions of the Android NDK and OpenCV than it is about the specific hardware.

            1. Mike,
              Thanks for your answers 🙂

              Meghal,
              You can look at the logcat and see if there are any errors as Mike pointed out. Did the previous projects work on your device?

  3. You will need to step through the code in the debugger to see where it fails and see if there are any clues in any error messages or return codes – we can’t guess at the problem without more information I’m afraid. Good luck!

  4. I got back to this awesome tutorial again and thought to change it so as to be able to add two markers, but it does not recognise the previous one. How do I add two markers here ? what part of code should I put in loop if any ?

  5. Hi Meghal,
    This tutorial is mainly designed to recognise only one marker. You will need to go through the logic in the project and make a few changes to save two markers and compare against both of them when you receive a query image. It will require changes to all the private functions of SimpleARClass to store the keypoints for multiple markers and match against them.
    You will need to add some variables that determine how many markers you want to save, and the logic to replace existing markers with new ones when you double-tap the screen.

    1. You cannot measure the size accurately unless you have an object in the scene whose size is known. There are other techniques to guess the size based on the neighbourhood.
      It is a different problem than the one we are trying to solve here.

      1. You can explain to me. How to do that. I can be using the formula.

        (u v 1) = A[R|t](Xw Yw Zw)

        Where (u v 1) is matrix 2D get from Image.
        A camera matrix
        R Rotation matrix
        t Transform matrix
        (Xw Yw Zw) The 3D world Matrix.

        to measure actual size?

Leave a Reply

Your email address will not be published. Required fields are marked *