Skip to main content

PROCESSING IMAGES IN HADOOP USING MAPREDUCE

HIPI:

Hipi is Hadoop's Image Processing interface. This provides a set of tools and Input format to process a bulk amount of images using Hadoop's Distributes File System (HDFS) and MapReduce.

STEPS INVOLVED:

In hipi, the entire process can be categorized into 2 parts.

1) Converting all images into a bulk file(HIPI Image Bundle).

2) Processing the created bulk file of an image using HIPI's image input formats.
    The cull (culler class) is used to filter out images with low clarity or defects




ISSUES WITH HIPI:

To simulate my bulk image processing scenario, I used a java program to create multiple copies of the same image with different names in a single directory. then by using hipi's utility, I converted all images into a bulk file (known as the hip file).

To check whether all images exist in the bulk file, I have done the reverse process (Converted HIP file into multiple images). There is a utility of hipi to do the same. But I didn't get all the images back and found that I am loosing some images.  ( If you are not finding this issue please let me know)

I couldn't proceed with my POC by using HIPI and thought of creating a new framework to process bulk images using MapReduce.


NEW IMAGE PROCESSING FRAMEWORK:

In order to avoid spawning multiple maps (each per file) we must do as like HIPI does, that is, to convert all images into a single bundle file.

This bundle file is given as the input to map-reduce. The image input format parses the bundle file and create a Buffered Image Object corresponding to each image.


IMAGE INPUT FORMAT-IMPORTANT CLASSES:

Image Combiner: 

    •  Merges multiple images into a single bundle file.
ImageInputFormat : 
    • Return ImageRecordRreader, and manages splits
ImageRecordReader:
    • Manage reading each splits.
    • Perform initial seek of the file pointer to the start of each split.
    • nextKeyValue() method reads each Image from the split and converts to  BufferedImage.
BufferedImageWritable: 
    • Since the key value classes of map-reduce should be a writable serializable type we cannot keep BufferedImage directly as the value in the map method. 
    • This is a wrapper class that just holds the BufferedImage in it.


BufferedImageWritable
  { 
           BufferedImage img;                                          
           
             @Override
              public void readFields(DataInput arg0) throws IOException {
              }

             @Override
              public void write(DataOutput arg0) throws IOException {
              }

              @Override
               public int compareTo(byte[] o) {
                     return 0;
              }
 }


I didn't implemented readFiled(),write(),compareTo() methods since I was not storing the input image. My POC purpose is to detect faces and to take total face count.

If you want to write back any images in HDFS(In map or reduce), you may have to implement all these methods. in write() you may need to write the logic to store the image as like we used to write images while creating the bulk file. The readFiled() should contain the opposite logic of write(). compareTo() is need not be implemented, since we never keep the image as key in map-reduce.(compareTo() is invoked during the sort face of map-reduce).



We will get the image as BufferedImages (a Java class for image processing) in the value of the map. It is easy to do most of the operations on BufferedImages. But in the case of HIPI, the image is available in map's value as Hipi's FloatImage class, and you might feel difficulty in doing manipulations on top of it.


I have successfully implemented a face detection program using this custom input format and OpenCV.

The code that I used to develop the same will be shared soon on GitHub

For more information about HIPI , http://dinesh-malav.blogspot.com/2015/05/image-processing-using-opencv-on-hadoop.html

Comments