Displaying specific PDF file content in search results

Question

5.26K viewsNovember 7, 2013SP2010 SP2013

0

Haniel Croitoru0 November 7, 2013 0 Comments

Hi,

Today I was faced with an interesting question regarding SharePoint’s search capabilities. I have a need to customize the search results as to display content of PDF files returned from the search in a specific format.

Suppose there is a document library of patient records in PDF format. Each PDF would contain (amongst other information) the patient ID, DOB, address, and hair colour. Now, if I searched for a complete or partial patient ID I would like the results to display as follows:

123431
DOB: 10/30/1973
Address: 123 Main Street, Mainsville, MA
Hair Color: Black

582654
DOB: 11/07/2001
Address: 234 Front Street, Toronto, ON
Hair Color: Blonde

The Patient ID would be found inside the PDF and would be a link to the document. The rest other fields would be picked-up from inside the PDF. I know that the search results can be formatted using XSLT in SharePoint 2010 and using the GUI in SharePoint 2013. However, can SharePoint be set up to read data from inside the result documents and display it in the results?

Thanks,
-Haniel

(Visited 67 times, 1 visits today)

Add a Comment

8 Answers

« Previous 1 2

score 0 · Answer 1 · 2013-11-08T04:03:00+00:00

Hey Haniel,

sounds like you’re going to need a SharePoint to LiveLink search connector, so that SharePoint can crawl the documents in LiveLink (do they have to be in LiveLink?)

This guy did some work around this for SP2007.

BAInsight do solid work and have a connector for LiveLink.

But you’re still going to need to think about how you get the specific fields out of the content. Many of the paperless-office oriented products do this. Might be worth looking in that space too.

FYI We’re using SeeUnity’s OpenText eDOCS connector for FAST, to do the same thing. The BAInsight product is probably a bit better and definitly faster.

Regards
Craig

(Visited 1 times, 1 visits today)

score 0 · Answer 2 · 2013-11-08T03:55:00+00:00

Hi Craig,

Thanks for the response. The story gets a bit more interesting. The documents will actually be stored in LiveLink and the SharePoint search will need to access the documents there. I need to find out about the structure and consistency of the metadata in the PDF’s. One alternative may be to have some process external to SharePoint/LiveLink read the PDF and generate an XML file with all the metadata. When the PDF is uploaded into LiveLink, the XML would be associated with it and then when SharePoint performs the search, the results page would look at the XML and display the metadata captured in it.

Regards,
-Haniel

(Visited 1 times, 1 visits today)

score 0 · Answer 3 · 2013-11-08T00:16:00+00:00

Hey Haniel,

SharePoint itself wont pick up specific content from inside documents (other than as generic document body content) with the exception of some clever stuff for Office documents (can pick up Title’s etc). I suspect you’ll need to write a pipeline enhancement (Content Enrichment WebService), which has the ability to read PDFs and hunt out the specific content you’re wanting to extract. Hopefully it’s delimited somewhat consistently.

Val blogged about it a couple of months ago.

You may also be able to get some of the fields with acustom entity extractor but your milage may vary depending on your content.

Alternatively, you could separately inspect the content (either manually or using various tagging systems like Pingar) and either inject metadata into the PDF (there are a number of fields available, though I don’t know how many SP Search picks up) or add as metadata in SP. I guess it depends on your volumes…

Good luck, it’s not for the faint of heart 🙂

FYI We’ve done similar things in SP2010 with FAST and will need to port to SP2013 next year…

Regards
Craig

(Visited 1 times, 1 visits today)

Displaying specific PDF file content in search results

8 Answers

Get 200+ hours of Microsoft 365 Training for 27$!