Toggle Theme Editor
Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate Charcoal

Extracting Images From P D F Documents

Discussion in 'Java Update' started by Joe, 26/7/19.

  1. Joe

    Joe Thành viên VIP

    Extracting Images out of a PDF Document

    Today I show you how to build yourself an API that allows you to extract all embedded Images in a PDF Document. As we know, PDF documents contain images. BUT the question is: what kind of images? Let's take a look into the content of a PDF document
    PHP:
    25504446 2D312E34 0A25C3A4 C3BCC3B6 C39F0A32 | %PDF -1.4 .%.. .... ...2 0
    2030206F 626A0A3C 3C2F4C65 6E677468 20332030 
    |  0 o bj.< </Le ngth  3 0 1
    20522F46 696C7465 722F466C 61746544 65636F64 
    |  R/F ilte r/Fl ateD ecod 2
                       
    . . .                                . . .
    5247422F 46696C74 65722F44 43544465 636F6465 RGBFilt er/D CTDe code 19
    2F4C656E 67746820 36333831 38202F53 4D61736B 
    | /Len gth  6381 8 /S Mask 20
    20362030 2052203E 3E0A7374 7265616D 0AFFD8FF 
    |  6 0  R > >.st ream .... | 21
    E000104A 46494600 01010000 01000100 00FFDB00 
    | ...J FIF. .... .... .... | 22
                       
    . . .                                . . .
    25216C11 DD156FA2 56B804DD 621008E1 A4BBEB2D | %!l. ..oV... b... ...- | 3210
    36A7E205 20B3190B FC927FF6 876A1B1A 124CF40D 
    6...  ... .... .j.. .L.. | 3211
    7542B496 32FA4F14 9E82BAB2 44FFD90A 656E6473 
    uB.. 2.O. .... D... ends 3212
    74726561 6D0A656E 646F626A 0A0A3620 30206F62 
    trea m.en dobj ..6  0 ob 3213
                       
    . . .                                . . .
    Very confusing, isn't it? Well, PDF document converts all images (pgn, gif, etc.) into jpg and that can be found in the confusing HEX content as following:
    • Starting with: FF D8 FF E0 00 10 4A 46 49 46 whereas the last 8 hex stand for JFIF
    • Ending with: FF D9
    So, knowing the starting point and the ending point it's a piece of cake to extract all embedded images in a PDF document and render them as JPG/JPEG images with SWING or JavaFX. I show you hereunder how to implement the PDFImages.java API with three static methods:
    • swingImages(String pdfDoc) returns JPG images in form of BufferedImage which is useful for SWING applications
    • jfxImages(String pdfDoc) returns JPG images in form of JFX Image which is useful for JavaFX applications.
    • extractImages(String pdfDoc) returns a List of byte[] of the images. The byte[] is the content of an according JPG file.
    PDFImages.java

    PHP:
    import java.util.*;
    import java.nio.file.Files;
    import java.io.*;
    import java.awt.image.BufferedImage;
    import javax.imageio.ImageIO;
    import javax.swing.ImageIcon;
    import javafx.scene.image.Image;
    // Joe Nartca (C)
    public class PDFImages {
        
    /**
        * Load all images from a PDF file using java.io.FileInputStream
        * @param docFile  String, file name
        * @return List<java.awt.image.BufferedImage> JAVA Bufferedimages found in PDF file
        * @exception Exception thrown by JAVA (e.g. nullPointer, etc.)
        */
        
    public static List<java.awt.image.BufferedImageswingImages(String pdfDocthrows Exception {
          List<
    byte[]> list = extractImages(pdfDoc);
          
    ArrayList<BufferedImageImgs = new ArrayList<java.awt.image.BufferedImage>();
          for (
    byte[] bb : list) Imgs.add(ImageIO.read(new ByteArrayInputStream(bb)));
          return 
    Imgs;
        }
        
    /**
        * Load all images from a PDF file using java.io.FileInputStream
        * @param docFile  File of PDF file
        * @return List<javafx.scene.image.Image> of ALL images found in PDF file
        * @exception Exception thrown by JAVA (e.g. nullPointer, etc.)
        */
        
    public static List<javafx.scene.image.ImagejfxImages(String pdfDocthrows Exception {
          List<
    byte[]> list = extractImages(pdfDoc);
          
    ArrayList<javafx.scene.image.ImageImgs = new ArrayList<javafx.scene.image.Image>();
          for (
    byte[] bb : list) Imgs.add(new Image(new ByteArrayInputStream(bb)));
          return 
    Imgs;
        }
        
    //
        
    public static List<byte[]> extractImages(String pdfDocthrows Exception {
          
    byte[] bImg = { (byte)0xFF, (byte)0xD8, (byte)0xFF, (byte)0xE0, (byte)0x00,
                          (
    byte)0x10, (byte)0x4A, (byte)0x46, (byte)0x49, (byte)0x46 };
          
    int bLen bImg.length;
          
    //
          
    int mx 1024*1024// 1 MB
          
    byte[] buf Files.readAllBytes((new File(pdfDoc)).toPath());
          
    ByteArrayOutputStream bao = new ByteArrayOutputStream(buf.length mxmx:buf.length);
          
    ArrayList<byte[]> list = new ArrayList<byte[]>();
          
    mx buf.length;
          
    L1:for (int k0mx; ++i) {
            if (
    buf[i] == bImg[0]) {
              
    1;
              for (
    int l 1bLen && mx; ++l, ++k) if (buf[k] != bImg[l]) {
                
    k;
                continue 
    L1;
              }
              
    int b i// look for the Ending bytes
              
    for (; mx; ++i) if ((byte)0xFF == buf[i] && (byte)0xD9 == buf[i+1]) {
                
    += 2;
                break;
              }
              
    bao.write(bufbi-b);
              list.
    add(bao.toByteArray());
              
    bao.reset(); // clear bao
            
    }
          }
          
    bao.close();
          return list;
      }
    }
    Application
    The following JavaFX example shows you how to use the static method jfxImages(String pdfDoc) (with similar to swingImages(String pdfDoc)) from a PDF document
    PHP:
    import javafx.application.Application;
    import javafx.geometry.*;
    import javafx.scene.Scene;
    import javafx.scene.control.*;
    import javafx.scene.image.*;
    import javafx.scene.layout.*;
    import javafx.stage.Stage;
     
    import java.util.List;
    //
    // Joe Nartca (C)
    public class PDFtest extends Application {
     
        private 
    int pos 0;
      
        public 
    void start(Stage stage) {
            
    // get the passing arguments args[]
            
    Application.Parameters params getParameters();
            List<
    String> list = params.getRaw();
            
    //
            
    try {
              
    // extract all PDF images
              
    List<Imageimgs PDFImages.jfxImages(list.get(0));
              
    //
              
    Image img imgs.get(0);
              
    ImageView imgView = new ImageView(img);
              
    imgView.setPreserveRatio(true);        
              
    //
              
    Button next = new Button("Next");
              
    next.setOnAction(-> {
                if (++
    pos == imgs.size()) pos 0;
                
    Image I imgs.get(pos);
                
    imgView.setImage(I);
                
    stage.setHeight(I.getHeight()+90);
                
    stage.setWidth(I.getWidth());        
              });
              
    //
              
    HBox hbx1 = new HBox(10);
              
    hbx1.setAlignment(Pos.CENTER);
              
    hbx1.setPadding(new Insets(0505));
              
    hbx1.getChildren().setAll(next);
              
    //
              
    VBox vbox = new VBox(10);
              
    vbox.setAlignment(Pos.CENTER);
              
    vbox.setPadding(new Insets(5555));
              
    vbox.getChildren().setAll(imgViewhbx1);
              
    vbox.setVgrow(imgViewPriority.ALWAYS);
              
    //
              
    Scene scene = new Scene(vbox);
              
    //
              
    stage.setScene(scene);
              
    stage.setWidth(img.getWidth());        
              
    stage.setHeight(img.getHeight()+90);
              
    stage.setTitle("PDFImages");       
              
    stage.show();
            } catch (
    Exception ex) {
              
    ex.printStackTrace();
            }
        }
        public static 
    void main(String... args) {
            
    String[] parms= { "all.pdf" };
            
    launch(args.length == 0parmsargs);
        }
    }
    The example needs a default PDF file all.pdf which contains 3 embedded images and can be downloaded hereunder.
     

    Attached Files:

    • all.pdf
      File size:
      181.1 KB
      Views:
      0
    Last edited: 27/7/19
    tranhuyvc likes this.
  2. tranhuyvc

    tranhuyvc Administrator Staff Member

    its's sound great Joe. and then with images we extract from PDF, can we OCR, convert to text? how about that process :)
     
  3. Joe

    Joe Thành viên VIP

    You mean the PDF text? Or OCR from a OCR scanner? It depends on what base the text-image is created.
     
    tranhuyvc likes this.
  4. tranhuyvc

    tranhuyvc Administrator Staff Member

    I mean OCR from image, but I thing it's other problem. as I know OCR use many of fonts to extract text, is it right Joe? :)
     
  5. Joe

    Joe Thành viên VIP

    Boss, OCR is Optical Character Recognition and that means literally that the "characters" are recognized by their patterns or features. OCR works as following:
    1. the characters are scanned for different areas: dark and light area.
    2. the dark area (or background) is mapped as characters
    3. the characters are then identified by some existing patterns (fonts)
    4. if the characters are blurred or bad and don't match any pattern the features of characters are involved and "interpolated" (guessed). Example: letter E has the features of 1 vertical line and 3 short horizontal lines. if 1 vertical line and only 2 lines (e.g. the upper and middle line) are detected the letter is then interpolated as the letter E or F and that is the fault-tolerance which could be only ironed out by Artificial Intelligence (AI) and/or with Machine Learning (ML).
    5. when the characters are recognized (by patterns or features) they are finally converted into the designated Character Set (e.g. ASCII)
    Is the explanation clear enough for you? OCR is a complex work if one wants to have a "perfect" OCR because of the sensitive fault-tolerance such as scribbled handwriting, bad printing, calligraphic writing, etc. (see E or F). AI/ML is involved using (human) syntactical grammar or word. Example the word: Existing and the letter E is blurred. E or F ? Fxisting is unlikely because it isn't "English", but Existing is an English word. Also: E and not F
     
    Last edited: 5/8/19
    tranhuyvc likes this.
  6. tranhuyvc

    tranhuyvc Administrator Staff Member

    :) it's very clear Joe, I was try use Opencv to OCR image but the result not good at all. I think it's a complex process :)
     
  7. Joe

    Joe Thành viên VIP

    Yes, Huy.
    OCR is a very complex process. It involves hardware (Optical Scanner or camera) and Operating-System-Specific subroutines (system calls) which must be implemented either in C/C++ or in Assembler. JAVA has to go NATIVE (interface to C/C++) if you want to access the OCR hardware using JAVA and that makes your JAVA app OS-dependent. Try to download the OpenCV sources and you would see a bunch of C/C++ codes and each of them is OS specific (see the platforms folder).
    If you could give me some contents of OCR image I probably could help you.
     
    Last edited: 9/8/19
    tranhuyvc likes this.
  8. tranhuyvc

    tranhuyvc Administrator Staff Member

    Good day Joe!
    Example below I wanna to get an image from T-shirt, :)

    ao-thun-nam-ufc-mau-den-15.jpg ao-thun-tet-2019-Khong-say-khong-ve.jpg
     
  9. Joe

    Joe Thành viên VIP

    Boss, the images are JPG images and not the OCR contents...:(
     
    tranhuyvc likes this.
  10. tranhuyvc

    tranhuyvc Administrator Staff Member

    But Joe, inside image have some text :)
     

Chia sẻ trang này

Loading...