Neeraj's Blog

There is always an open source solution..

Pdf to Html – update

After studying Michel Tu’s code i understand that he is using Apche Pdfbox , to process pdf and to convert it into Json format. So i decided to experiment with pdfbox. Pdfbox is a nice tool to work with pdfs. You can download its binaries or sources from here. With pdfbox you can easily convert pdf to text or html. You can extract only the text files from the pdf and can convert it into desired format. With  ExtractText  we can easily  extract text from pdf.

usage: java -jar pdfbox-app-x.y.z.jar ExtractText [OPTIONS] <PDF file> [Text file]

I downloaded pdfbox-app-1.6.0.jar and converted different pdf files to html and text to analyse its characteristics. From my experiments i observed that

  • It can convert every pdf, to html or text format, regardless of its size.
  • It convert pdf, to html or text format with fairly good speed.
  • It will extract all the text from the pdf files.
  • It can distinguish bulletins and number formats.
  • It will simply discard images, shapes drawn etc.It will not properly work with headings and tables. It will extract the heading but cannot distinguish that, it is a heading. And also we will get the table entries and its headings, but we cannot distinguish that they are from a table.

Now i want to analyse the running time of this conversion. So i decided to work with large number of files, and to convert them to html and text. For this i have to create a directory with a lot of pdf files. For this purpose i decides to make several copy of a same file (neeraj.pdf) and put it in a desired directory. For this i wrote a simple Perl script. I am adding the code below.


#!/usr/bin/perl
use File::Copy;

$filetobecopied = "neeraj.pdf";
for ($count = 10; $count >= 1; $count--) {
 $newfile = "./data/".$count.".pdf";
 copy($filetobecopied, $newfile) or die "File cannot be copied.";

My intention was to search for pdf files in that directory and convert it into text format. But for this i cannot call java -jar pdfbox-app-x.y.z.jar ExtractText [OPTIONS] <PDF file> [Text file] for each of my pdf files, because invoking jvm is a time consuming process. So i have to use pdfbox as library and to write a program for converting each file.
I got a code which convert a pdf file to text format from here. I edited this code for my purpose. I have to edit the code first to apt for my pdfbox version. For this i edited import statements. The next problem was this code will work with only one pdf. So i changed the code to search all the pdf in the directory, and convert them to text. Below I am showing only the changes i have made in the code. I just edited the main class only.

</pre>
public static void main(String args[]) {

String path = "/home/neeraj/Desktop/projtest/project/data/";

 String files;
 File folder = new File(path);
 File[] listOfFiles = folder.listFiles();

 for (int i = 0; i < listOfFiles.length; i++)
 {

 if (listOfFiles[i].isFile())
 {
 files = listOfFiles[i].getName();
 if (files.endsWith(".pdf") || files.endsWith(".PDF"))
 {
 System.out.println(files);
 String nfiles = "/home/neeraj/Desktop/projtest/project/data/";
 PDFTextParser pdfTextParserObj = new PDFTextParser();
 String pdfToText = pdfTextParserObj.pdftoText(nfiles+files);

 if (pdfToText == null) {
 System.out.println("PDF to Text Conversion failed.");
 }
 else {
 System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
 pdfTextParserObj.writeTexttoFile(pdfToText,nfiles+files+".txt");
 }
 }
}
 }
 }

It worked correctly and created text files of all pdf files in that directory.
Now i experiment with the running time of this code. So for this i ran this program with 1000, 2000, 3000, 4000, 5000 and 10,000 same pdf files. I worked with a fairly good speed.
I ran this code for 5 times in each number of files to check whether it work consistently with same time. The result is shown below.

Ran the code 5 times for same number of same document

Then i calculated the average time of this running time and plotted a graph.

Mean of the running times for different number of documents

The graph of the above table is given below. (This image is created from a pdf file using pdfbox).

X axis : Number of documents, Y axis : Running Time in sec

Limitation of PdfBox

The pdfbox simply extract the text from the pdf file. It cannot determine the logical structure of the content. That is whether the current word  is a heading, or from table, or list etc. Pdfbox converts pdf files to text with no intelligence, only by extracting all the text. This is the main limitation of the pdf box.

Single Post Navigation

11 thoughts on “Pdf to Html – update

  1. K P Varma on said:

    Good one . Neeraj .

    Krishnaprasad Varma
    Technical Architect
    Prescience Soft (P) Ltd

  2. Heya i’m for the primary time here. I came across this board and I to find It truly useful & it helped me out a lot. I hope to offer one thing again and help others like you aided me.

  3. This is really interesting, You’re a very professional blogger. I have joined your feed and look ahead to in the hunt for more of your great post. Additionally, I have shared your site in my social networks

  4. naturally like your web site but you need to take a look at the spelling on several of your posts. A number of them are rife with spelling issues and I find it very troublesome to tell the truth nevertheless I will certainly come again again.

  5. Thank you a bunch for sharing this with all people you really recognize what you are speaking about! Bookmarked. Please additionally talk over with my website =). We will have a hyperlink trade contract among us

  6. Useful info. Lucky me I discovered your web site accidentally, and I am stunned why this coincidence didn’t happened in advance! I bookmarked it.

  7. Writers like you are one in a million. Thank you for your great article and interesting, original and new views on this subject. I find your content refreshing and engaging. http://www.samsung1080phdtv.net/

  8. tauruzian on said:

    Hi,
    Do you have examples on how to convert PDF to HTML?

  9. Hi Neeraj, Thank you this is really useful. I am using PDFBox currently in a java app and works very well. However I want to link it to my Android project and that is where it fails. As I understand after reading a lot of posts on the web that it uses awt and swing and hence will not be compatible with Android. Any thoughts on how I could convert PDFBox into Android compatible?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: