Converting PDFs’ into text


PDFtoTXT

I am currently working on a project to extract the text from a PDF purchase order and create an xml file to feed into our ERP system, fully automated obviously.

I therefore needed a way to extract the text from a PDF at the command line, luckily there is a cool utility called poppler-utils. This works perfectly and is extremely fast.

I have installed this on both a CentOS server and more recently and Ubuntu server,

The caveat to this product is that you need the PDF to be real and not an image, this is why the Ubuntu server may be used.

mkdir ~/software
wget https://poppler.freedesktop.org/poppler-0.40.0.tar.xz
wget https://poppler.freedesktop.org/poppler-data-0.4.7.tar.gz
tar xvf poppler-0.40.0.tar.xz
tar xvf poppler-data-0.4.7.tar.gz

There are a few requirements

yum -y install fontconfig fontconfig-devel
yum -y install cairo cairo-devel libjpeg libjpeg-devel libcurl-devel gtk-doc 
yum -y install libtool gcc-c++ lcms2 openjpeg-libs xz openjpeg-devel
yum -y install libtiff-devel lcms2-devel

cd poppler-data-0.4.7
make install

tar (child): xz: Cannot exec: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now

Ensure that you install xz

cd poppler-0.40.0
./configure

ERROR

checking for a BSD-compatible install… /usr/bin/install -c
checking whether build environment is sane… yes
checking for a thread-safe mkdir -p… /bin/mkdir -p
checking for gawk… gawk
checking whether make sets $(MAKE)… yes
checking whether make supports nested variables… yes
checking whether make supports nested variables… (cached) yes
checking for style of include used by make… GNU
checking for gcc… no
checking for cc… no
checking for cl.exe… no
configure: error: in `/root/software/poppler-0.40.0′:
configure: error: no acceptable C compiler found in $PATH
See `config.log’ for more details

ERROR

No package ‘fontconfig’ found

Ensure that you installed  fontconfig fontconfig-devel

INFORMATION

Building poppler with support for:
font configuration: fontconfig
splash output: yes
cairo output: no (requires cairo >= 1.10.0)
qt4 wrapper: no
qt5 wrapper: no
glib wrapper: no (requires cairo output)
introspection: no
cpp wrapper: yes
use gtk-doc: no
use libjpeg: no
use libpng: no
use libtiff: no
use zlib: no
use libcurl: no
use libopenjpeg: no
use cms: no
command line utils: yes
test data dir: /root/software/poppler-0.40.0/./../test

Warning: Using libjpeg is recommended. The internal DCT decoder is unmaintained.
Warning: Using libopenjpeg is recommended. The internal JPX decoder is unmaintained.

Ensure that you installed  cairo cairo-devel libjpeg libjpeg-devel libcurl-devel gtk-doc

make
make all-recursive

make[1]: Entering directory `/root/software/poppler-0.40.0′
Making all in goo
make[2]: Entering directory `/root/software/poppler-0.40.0/goo’
CXX libgoo_la-gfile.lo
../libtool: line 1129: g++: command not found
make[2]: *** [libgoo_la-gfile.lo] Error 1
make[2]: Leaving directory `/root/software/poppler-0.40.0/goo’
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/software/poppler-0.40.0′
make: *** [all] Error 2

Ensure that you  installed libtool gcc-c++ lcms-libs openjpeg-libs

make install

You should now have a working copy of the software.

To convert a pdf you just simply issue

pdftotext -layout PDF_Name  Output_Filename

 

 

Ubuntu Install

mkdir ~/software

apt-get -y install libcurl4-gnutls-dev libcairo2-dev libcairo2 libjpeg libjpeg-dev \
libtiff5-dev libgtk-doc g++ fontconfig fontconfig-dev fontconfig*

wget https://launchpad.net/ubuntu/+archive/primary/+files/poppler-data_0.4.6.orig.tar.gz
wget https://launchpad.net/ubuntu/+archive/primary/+files/poppler_0.41.0.orig.tar.xz

tar -xvf poppler-data_0.4.6.orig.tar.gz

tar -xvf poppler_0.41.0.orig.tar.xz

cd 

cd poppler-data-0.4.6/

make install

cd poppler-0.41.0/

./configure –prefix=/usr

make

make install

 

Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s