Setting up pdftotext and search_files on a shared host (Bluehost)

26 Jul 2008

Posted by CrashTest_

So this week I had to get search_files module for Drupal 6 running on a shared host, Bluehost, for one of my customers. I had promised that we would be able to search his PDF files, but I didn't realize that search_files, as well as search_attachments modules require a Linux command line utility named pdftotext to be installed.

I did request that Bluehost install it on my box and was told that anything requiring root access wasn't going to happen.

Fine, maybe I can run it myself, after all, I did get SVN running on Bluehost, how hard could it be?

Thanks to rasc on the Sphider PHP Search Engine forums for the solution, which I catered to work for search_files.

Setting up Bluehost

  • First, go to foolabs and grab yourself a hot copy of XPDF
  • Un-archive it, and get get rid of everything other than the pdftotext file
  • Rename this file to pdftotext.script
  • Create a shell script (flavor of your choosing) that calls this pdftotext.script file, for example:
#!/bin/sh
/home/YOURBLUEHOSTUSERNAME/bin/pdftotext/pdftotext.script $1 -
  • As you can tell from the code there, you will need to change YOURBLUEHOSTUSERNAME to the correct name.
  • Log in via FTP or SSH, and create a /bin and a /bin/pdftotext directory from your home directory, NOT FROM public_html, but one directory above that
  • Make /bin/pdftotext writable by all, this is where pdftotext will save the temporary files it creates
  • Upload both pdftotext and pdftotext.script to the /bin/pdftotext directory, and make them executable (chmod 755 should work)
  • If you don't have one already, in your home directory (not public_html) create a .bashrc file, and add the following so that the web server knows where your executable files are:
export PATH=$PATH:$HOME/bin:$HOME/bin/pdftotext:.
export pdftotext_path=/home/YOURBLUEHOSTUSERNAME/bin/pdftotext/pdftotext

Setting up search_files in Drupal 6

  • Go to the search_files project page and download the module
  • Upload the module
  • Go to Admin - Site Building - Modules and activate the module
  • You may need to adjust your permissions to let you use the module, do that if needed
  • Go to /admin/settings/search_files/helpers page and click "PDF"
  • In the Helper Path* box, put in:
/home/YOURBLUEHOSTUSERNAME/bin/pdftotext/pdftotext %file% -
  • Click Update. Please notice the - at the end, it's needed. I have it in both the helper line and the script, it's probably not needed in both, but it DOES work this way
  • Now, find the Directories page (/admin/settings/search_files/directories) and start adding directories where you have those PDF files that you would like to have indexed, such as /home/YOURBLUEHOSTUSERNAME/public_html/files - making sure that you use the full server path

Run it!

I took advantage of this time as an opportunity to setup my cron job on Bluehost, and then cleared my cache in Drupal, ran cron, and watched it find hundreds of files for me.

You MAY need to create a new custom php.ini file for your Drupal installation (Bluehost has a utility to create a default one in c-panel) and increase the limits so that you don't run into memory allocation or timeout issues.

Hope this helps, if you have any questions, go ahead and leave a comment!

Sitewide Terms: 

Comments

Searching PDF files is pretty important and I believe Bluehost handles this pretty well. You definetly need to set up a setup a cron job to easy the manual workload.


Check out ps2ascii on Bluehost. It also converts pdfs to text>>

NAME
       ps2ascii - Ghostscript translator from PostScript or PDF to ASCII

SYNOPSIS
       ps2ascii [ input.ps [ output.txt ] ]
       ps2ascii input.pdf [ output.txt ]


you say to add the directory of where the files are as follows:
/home/YOURBLUEHOSTUSERNAME/public_html/files
im testing localy searching txt files
I have mine as /sites/all/files
but when I search for a file nothing comes up and have file with the keywords in.
PLEASE HELP.


Conrad,

You are using the path from document root. You need to use the path from system root. It will look like this:

on unix:
/home/[yourusername]/public_html/sites/all/files

on Windows:
C:\Inetpub\wwwroot\sites\all\files


Thank you very much, now when I want to run cron im getting the following error.

user warning: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'LIMIT 0, 10' at line 7 query: SELECT f.* FROM files AS f LEFT JOIN search_dataset AS d ON d.`type` = 'attachment' AND d.`sid` = f.`fid` WHERE (d.`sid` IS NULL OR d.`reindex` <> 0) ORDER BY d.`reindex` ASC, f.`fid` ASC; LIMIT 0, 10 in C:\xampp\htdocs\dot\sites\all\modules\search_files\search_files_attachments.module on line 156.


Ok I disabled the attachments part, error went away, now when I serach for a file there is no results, please help.


Hello and thanks for the solution upfront.

i tried what you described but it seemes that i am to stupid to understand it.
I tried it as far as i understand it, but it still doesnt work.

So i lined out the things i dont understand.

1. Un-archive it, and get get rid of everything other than the pdftotext file

? Which one, there is a pdftotext.1 and a pdftotext.cc and some oter pdftotext files, that doesnt seem to be important.

2. Rename this file to pdftotext.script

? Still not knowing which but used the .cc file

3. create a shell script (flavor of your choosing) that calls this pdftotext.script file, for example:
#!/bin/sh
/home/YOURBLUEHOSTUSERNAME/bin/pdftotext/pdftotext.script $1

?I Copied the Snippet and made a file shellscript.sh, it worked so far, but i dont know what to do with it...where do i put it
how do i execute it?

4. Upload both pdftotext and pdftotext.script to the /bin/pdftotext directory, and make them executable (chmod 755 should work)

?I thougth i have only one file pdftotext.script?

Please help me, i really need to use these module and i dont see any way to do it...

Thanks for your time

Ben


I'm having trouble setting up the directories.

I've added three directories where pdfs are stored and used the model referenced in the entry above

/home/YOURBLUEHOSTUSERNAME/public_html/files

substituting my user name for YOURBLUEHOSTUSERNAME and my file directory for "files"

The module sees my listings but clearly I'm pointing to the driectories improperly as it lists no files

Last Index = 2009-11-22 16:52:32
Files indexed = 0
Files indexed and scheduled for reindexing = 0
Directory Rescan Age = 86400 [sec]
Next Directory (Re-)Scan at or after = 2009-11-23 16:52:32
Number of Directories configured = 3
Files found in configured Directories and Subdirectories = 0
Files without index attempt = 0
Update index

I ran Cron and pressed the Update Index button, tried adjusting my path to be relative, absolute. Wasn't completely sure what to enter for the URI path - tried several different things, including the full url to my pdfs and the full url to the pdftotext directory in /bin

I was able to run the pdftotext helper from the shell

Any help would be greatly appreciated.