PDFs and SharePoint: What is recommended??

Scanning PDFs? 5 Steps to Follow

SharePoint PDF Scanning

Capturing PDF Files in SharePoint

When scanning to SharePoint, capturing pre-existing images, and creating searchable PDFs, there are several things you should make sure you can enable in your capture software.  Below is a laundry list:

  1. PDF + Hidden Text is the preferred format.  Most scanning devices/applications will allow you to create PDFs, but note that these are image PDFs, and not searchable.  The de facto standard right now in the imaging industry is the PDF image + Hidden Text format.  This requires a capable OCR engine to produce the text layer, and is what I call a “suitcase” document: it contains a pristine image, and a hidden text layer for search.  
  2. Ensure your document capture software can import PDF files.   Just about every organization has pre-existing scanned PDF files.  In almost every case, these are purely PDF Image format, and cannot be searched, or crawled through the PDF ifilter in SharePoint.  If your capture application can import and process PDFs, you have the ability to harvet these documents, extract metadata, and OCR them to create searchable PDFs, or PDF Image + Hidden Text format.
  3. Require the ability to create and populate custom PDF headers.   PDF headers allow custom metadata to be built into the core PDF file.  Why is this necessary?  Once again, I always go back to the “suitcase” analogy, you always want to pack everything you need.  If you create a searchable PDF, and pack metadata into the headers, the file is now an all inclusive data package.  Headers speed up search, and provide for flexibility if you ever export files, or import your PDFs into another system.
  4. Require support for the latest standard.  PDF – A is the latest and greatest standard, and  the goal of this ISO standard was to build a file format suitable for long term archiving.  Ensure you can support this option.
  5. Custom name the PDF with metadata.  Most scanning technologies can take extracted metadata and automatically concatenate to create a custom naming standard.


Anything I missed?  Please add in comments.

Tagged with: , , , ,
Posted in imaging, OCR, office 365, PDF, scanning, sharepoint 2010

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Enter your email address to follow this blog and receive notifications of new posts by email.

Follow Scanning with Microsoft SharePoint on WordPress.com
BLOG Categories
Current Poll
%d bloggers like this: