3-D Object Segmentation for Robot Handling

A commercial humanoid service robot needs to have capabilities to perform human-like tasks. One such task for a robot in a medical scenario would be to provide medicine to a patient. The robot would need to detect the medicine bottle and move its hand to the object to pick it up. The task of locating and picking a medicine bottle up is quite trivial for a human. What does it take to enable a robot to do the same task? This, in fact, is a challenging problem for a robot. A robot tries to make sense of its environment based on the visual information it receives from a camera. Even then, creating efficient algorithms to identify an object of interest in an image, calculating the location of the robot’s arm in space, and enabling it to pick the object up is a daunting task. For our senior capstone project at Portland State University, we researched techniques that would enable a humanoid robot to locate and identify a common object (e.g., a medicine bottle) and acquire real-time position information about the robot’s hand in order to guide it to the target object. We used an InMoov open-source, 3-D humanoid robot for this project (see Photo 1).

Photo 1 The InMoov robot built at Portland State University’s robotics lab

Photo 1: The InMoov robot built at Portland State University’s robotics lab


In the field of computer vision, there are two dominant approaches to this problem—one using pixel-based 2-D imagery and another using 3-D depth imagery. We chose the 3-D approach because of the availability of state-of-the-art open source algorithms, and because of the recent influx of cheap stereo depth cameras, like the Intel RealSense R200.

Solving this problem further requires a proper combination of hardware and software along with a physical robot to implement the concept. We used an Intel Realsense R200 depth camera to collect 3-D images, and an Intel NUC with a 5th Generation Core i5 to process the 3-D image information. Likewise, for software, we used the open-source Point Cloud Library (PCL) to process 3-D point cloud data.[1] PCL contains several state-of-the-art 3-D segmentation and recognition algorithms, which made it easier for us to compare our design with other works in the same area. Similarly, the information relating to the robot arm and object position computed using our algorithms is published to the robot via the Robot Operating System (ROS). It can then be used by other modules, such as a robot arm controller, to move the robot hand.


Object segmentation is widely applied in computer vision to locate objects in an image.[2] The basic architecture of our package, as well as many others in this field, is a sequence of processing stages—that is, a pipeline. The segmentation pipeline starts with capturing an image from a 3-D depth camera. By the last stage of the pipeline, we have obtained the location and boundary information of the objects of interest, such as the hand of the robot and the nearest grabbable object.

Figure 1: 3-D object segmentation pipeline

Figure 1: 3-D object segmentation pipeline

The object segmentation pipeline of our design is shown in Figure 1. There are four main stages in our pipeline: downsampling the input raw image, using RANSAC and plane extraction algorithms, using the Euclidean Clustering technique to segment objects, and applying a bounding box to separate objects. Let’s review each one.

The raw clouds coming from the camera have a resolution which is far too high for segmentation to be feasible in real time. The basic technique for solving this problem is called “voxel filtering,” which entails compressing several nearby points into a single point.[3] In other words, all points in some specified cubical region of volume will be combined into a single point. The parameter that controls the size of this volume element is called the “leaf size.” Figure 2 shows an example of applying the voxel filter with several different leaf sizes. As the leaf size increases, the point cloud density decreases proportionally.

Figure 2: Down-sampling results for different leaf sizes

Figure 2: Down-sampling results for different leaf sizes

Random sample consensus (RANSAC) is a quick method of finding mathematical models. In the case of a plane, the RANSAC method will create a virtual plane that is then rotated and translated throughout the scene, looking for the plane with the data points that fit the model (i.e., inliers). The two parameters used are the threshold distance and the number of iterations. The greater the threshold, the thicker the plane can be. The more iteration RANSAC is allowed, the greater the probability of finding the plane with the most inliers.

Figure 3: The effects of varying the number of iterations of RANSAC. Notice that the plane on the left (a), which only used 200 iterations, was not correctly identified, while the one on the right (b), with 600 iterations, was correctly identified.

Figure 3: The effects of varying the number of iterations of RANSAC. Notice that the plane on the left, which only used 200 iterations, was not correctly identified, while the one on the right, with 600 iterations, was correctly identified.

Refer to Figure 3 to see what happens as the number of iterations is changed. The blue points represent the original data. The red points represent the plane inliers. The magenta points represent the noise (i.e., outliers) remaining after a prism extraction. As you can see, the image on the left shows how the plane of the table was not found due to RANSAC not being given enough iterations. The image on the right shows the plane being found, and the objects above the plane are properly segmented from the original data.

After RANSAC and plane extraction in the segmentation pipeline, Euclidean Clustering is performed. This process takes the down-sampled point cloud—without the plane and its convex hull—and breaks it into clusters. Each cluster hopefully corresponds to one of the objects on the table.[4] This is accomplished by first creating a kd-tree data structure, which stores the remaining points in the cloud in a way that can be searched efficiently. The cloud points are then iterated again with a radius search being performed for each point. Neighboring points within the threshold radius are then added to the current cluster and marked as processed. This continues until all points in the cloud have been marked as processed and put into different segments before the algorithm terminates. After the object segmentation and recognition has been performed, the robot knows which object to pick up, but it doesn’t know the boundaries of the object.

Saroj Bardewa (saroj@pdx.edu) is pursuing an MS in Electrical and Computer Engineering at Portland State University, where he earned a BS in Computer Engineering in June 2016. His interests include computer architecture, computer vision, machine learning, and robotics.

Sean Hendrickson (hsean@pdx.edu) is a senior studying Computer Engineering at Portland State University. His interests include computer vision and machine learning.

This complete article appears in Circuit Cellar 320 (March 2017).