Neeraj's Blog

There is always an open source solution..

Archive for the category “projects”

Compiling Mupdf from Source

MuPdf is a free and open source software library written in C that implements a pdf and xps parsing and rendering engine. You can download mupdf from http://code.google.com/p/mupdf/downloads/list. For windows just download “mupdf-1.0-tools-windows.zip” and install. For linux you cannot get a ready made package. You should download the source file and compile it.  For compiling from source, you will need several third party libraries: freetype2, jbig2dec, libjpeg, openjpeg, and zlib. From mupdf-download link, given above, you can download both the source file (mupdf-1.0-source.tar.gz) and third party files (mupdf-thirdparty-2012-04-23.zip).

(From here I explain very detail. It is for beginners in linux. If you are already familiar with “make“, don’t waste your time. Compile all the dependencies first. And after that compile mupdf source).

Compiling third party dependencies.

Extract the third party files. You will get a folder named “thirdparty“. Inside thirdparty, there will be 5 folders. We have to compile all of them one by one. Before starting you should install libtool.

sudo apt-get install  libtool.

After that we can start our compilation work. Lets start with freetype-2.4.9.

cd freetype-2.4.9

./configure (If it doesn’t work, sh autogen.sh and then ./configure).

make

sudo make install

Next we go for jbig2dec.

cd jbig2dec

sh autogen.sh

./configure (If you have no permission over configure to run, chmod 777 configure).

make

sudo make install

Next is jpeg-8d.

With the  jpeg-8d given by mupdf, I can’t compile it. So i decided to download it separately. You can also download it from http://www.ijg.org/. Download  jpegsrc.v8d.tar.gz from there (It is for linux). Decompress it. Then you will get a folder named   jpeg-8d. Now continue our compilation.

cd  jpeg-8d (new  jpeg-8d)

./configure (If you have no permission over configure to run, chmod 777 configure). 

make

sudo make install

Now we go for openjpeg-1.5.0

cd openjpeg-1.5.0

./configure (If you have no permission over configure to run, chmod 777 configure).

make

sudo make install

Next is  zlib-1.2.5

cd zlib-1.2.5

./configure (If you have no permission over configure to run, chmod 777 configure).

make

sudo make install

Compiling MuPdf source

Now all the dependencies are compiled. Now we can start compiling our mupdf source. Extract mupdf-1.0-source.tar.gz. Then,

cd mupdf-1.0-source

make

sudo make install

Now it is done. 🙂

man mupdf will show you the man page.

Note :- I am using ubuntu 11.10

Advertisements

Dynamic Log File

I am now working on a project, in which I have to  create a folder structure for each output and store the outputs in that folders. For storing log file, I decided to create a new folder named “log”, inside that folder structure, and save the log file in that folder for each input.

Previously, I was craeting a single log file named “admin.log”. So for several inputs, a single log file was a mess. In that case my log4j.property file was like this

# Root logger option
log4j.rootLogger=ALL, file
# Direct log messages to a log file
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.File=<my path>admin.log
log4j.appender.file.MaxFileSize=1GB
log4j.appender.file.MaxBackupIndex=1
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{ABSOLUTE} %5p %c{1}:%L - %m%n

<my path> : The absolute path, where you want to save the admin.log

Now for my purpose I have to edit my main function only (no need to edit my log4j.property file). I added the following code snippet to my main function.

//Setting log file path
 String dynamicLog = "<my path>" + "admin.log";
 Properties p = new Properties();
 p.load(new FileInputStream("<property file path>\\log4j.properties"));
 p.put( "log4j.appender.file.File", dynamicLog ); // overwrite "log.dir"
 PropertyConfigurator.configure( p );

<my path> : the absolute  path to where I have to save admin.log. ( For me it is now ./inputId/log/admin.log”).

<property file path> : Absolute path to our log4j.property file.

I tested it, and worked properly. 🙂

Pdf to Html – update

After studying Michel Tu’s code i understand that he is using Apche Pdfbox , to process pdf and to convert it into Json format. So i decided to experiment with pdfbox. Pdfbox is a nice tool to work with pdfs. You can download its binaries or sources from here. With pdfbox you can easily convert pdf to text or html. You can extract only the text files from the pdf and can convert it into desired format. With  ExtractText  we can easily  extract text from pdf.

usage: java -jar pdfbox-app-x.y.z.jar ExtractText [OPTIONS] <PDF file> [Text file]

I downloaded pdfbox-app-1.6.0.jar and converted different pdf files to html and text to analyse its characteristics. From my experiments i observed that

  • It can convert every pdf, to html or text format, regardless of its size.
  • It convert pdf, to html or text format with fairly good speed.
  • It will extract all the text from the pdf files.
  • It can distinguish bulletins and number formats.
  • It will simply discard images, shapes drawn etc.It will not properly work with headings and tables. It will extract the heading but cannot distinguish that, it is a heading. And also we will get the table entries and its headings, but we cannot distinguish that they are from a table.

Now i want to analyse the running time of this conversion. So i decided to work with large number of files, and to convert them to html and text. For this i have to create a directory with a lot of pdf files. For this purpose i decides to make several copy of a same file (neeraj.pdf) and put it in a desired directory. For this i wrote a simple Perl script. I am adding the code below.


#!/usr/bin/perl
use File::Copy;

$filetobecopied = "neeraj.pdf";
for ($count = 10; $count >= 1; $count--) {
 $newfile = "./data/".$count.".pdf";
 copy($filetobecopied, $newfile) or die "File cannot be copied.";

My intention was to search for pdf files in that directory and convert it into text format. But for this i cannot call java -jar pdfbox-app-x.y.z.jar ExtractText [OPTIONS] <PDF file> [Text file] for each of my pdf files, because invoking jvm is a time consuming process. So i have to use pdfbox as library and to write a program for converting each file.
I got a code which convert a pdf file to text format from here. I edited this code for my purpose. I have to edit the code first to apt for my pdfbox version. For this i edited import statements. The next problem was this code will work with only one pdf. So i changed the code to search all the pdf in the directory, and convert them to text. Below I am showing only the changes i have made in the code. I just edited the main class only.

</pre>
public static void main(String args[]) {

String path = "/home/neeraj/Desktop/projtest/project/data/";

 String files;
 File folder = new File(path);
 File[] listOfFiles = folder.listFiles();

 for (int i = 0; i < listOfFiles.length; i++)
 {

 if (listOfFiles[i].isFile())
 {
 files = listOfFiles[i].getName();
 if (files.endsWith(".pdf") || files.endsWith(".PDF"))
 {
 System.out.println(files);
 String nfiles = "/home/neeraj/Desktop/projtest/project/data/";
 PDFTextParser pdfTextParserObj = new PDFTextParser();
 String pdfToText = pdfTextParserObj.pdftoText(nfiles+files);

 if (pdfToText == null) {
 System.out.println("PDF to Text Conversion failed.");
 }
 else {
 System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
 pdfTextParserObj.writeTexttoFile(pdfToText,nfiles+files+".txt");
 }
 }
}
 }
 }

It worked correctly and created text files of all pdf files in that directory.
Now i experiment with the running time of this code. So for this i ran this program with 1000, 2000, 3000, 4000, 5000 and 10,000 same pdf files. I worked with a fairly good speed.
I ran this code for 5 times in each number of files to check whether it work consistently with same time. The result is shown below.

Ran the code 5 times for same number of same document

Then i calculated the average time of this running time and plotted a graph.

Mean of the running times for different number of documents

The graph of the above table is given below. (This image is created from a pdf file using pdfbox).

X axis : Number of documents, Y axis : Running Time in sec

Limitation of PdfBox

The pdfbox simply extract the text from the pdf file. It cannot determine the logical structure of the content. That is whether the current word  is a heading, or from table, or list etc. Pdfbox converts pdf files to text with no intelligence, only by extracting all the text. This is the main limitation of the pdf box.

PDF to HTML

Now iam working on a new project of Michel Tu.

You can get the source code from  https://github.com/neumino/PDF-to-unusual-HTML/commit/6c28fd52962e68b17a5142db5bc5a7dc4b00cdc2.

I cloned the project to my local system. (You can use either mercurial or git for cloning. You can also download zip file of this project from the above link.) After that I created a java package and try to run it by the command java -jar. But it always ended up with an error. So I decided to to go for a better option of using NetBeans IDE instead of fixing the package building issue. I tried to run the source code of the project using NetBeans. (But before trying to run the project, please make sure that you have Imagemagick installed on your system.) But this time also, i got an error.

So I checked the code, and found that the system call,

command = pathToImagemagick+” -density ” +density+” “+pathToPdf+” “+pathToDirectory+imageName; 

in the file Pdf2Json is not working properly. To make sure that, I run the command in my terminal as

convert -density 108 /home/neeraj/NetBeansProjects/icresume.pdf /home/neeraj/NetBeansProjects/icresume-0.png 

I found that it is not working, and got error messages.

Then I run the same command with a little change

/usr/bin/convert -density 108 /home/neeraj/NetBeansProjects/icresume.pdf /home/neeraj/NetBeansProjects/icresume-0.png 

Then it worked properly. So  I changed the pathToImagemagick variable in the class ConvertPdf as follows

static String pathToImagemagick = “/usr/bin/convert”;

After doing this, i can successfully run the project. After running, i got the following files.

pdffile-x.png image files of each page of the pdf file

and a pdfile_words.txt which contains the all words of pdf file with their position in JSON format.

Post Navigation