My company has a public-facing SharePoint 2010 publishing site that is configured for anonymous access. We also have an internal SharePoint 2010 portal. The external site lives in the DMZ and the internal site lives inside of our network.
I need to be able to crawl the external web from the internal portal.
To this end, I created a content source in the internal portal’s search service application for the external website with the following settings:
- type of content to be crawled = SharePoint Sites
- start address = http://www.domain
- crawl everything under the hostname for each start address
However, crawls return this warning:
http://www.domain
The URL was permanently moved. ( URL redirected to http://www.domain/pages/default.aspx )
This reminds me – I have a URL Rewrite mapping in IIS that points “/” to “/pages/default.aspx”. Unfortunately, when I remove that mapping, I still receive the same warning.
If I edit the content source and replace the start address with “http://www.domain/pages/default.aspx”, I receive this error:
http://www.domain/pages
Access is denied. Verify that either the Default Content Access Account has access to this repository, or add a crawl rule to crawl this repository. If the repository being crawled is a SharePoint repository, verify that the account you are using has “Full Read” permissions on the SharePoint Web Application being crawled.
Sure enough, browsing to “http://www.domain/pages” prompts me for a login and then gives me a 401 unauthorized.
But I don’t want the crawler to go to http://www.domain/pages. I want it to start at “http://www.domain/pages/default.aspx”.
Other things I tried that didn’t help:
- monkeyed around a fair bit with robots.txt on the external website
- dabbled with configuring Crawl Rules
- tried setting the content source to “type of content to be crawled = Web Sites”
- ULS isn’t helping me on this one
- searching the Google
Any clues? What am I missing?
I had some time to get back to this today and I’m happy to say that I resolved the issue. So, for anyone who experiences the same problem – I hope this helps!
Since I setup my content source as a “SharePoint Site” the crawler used the default content access account to crawl. However, the site crawled was in a different farm and on a different domain. The default content access account needed to be updated.
The first time I tried to put in the credentials for “Specify a different content access account” in my Crawl Rule for this content source, I received the message “The username or password is not valid”. However, I left “Do not allow Basic Authentication” checked.
Disregarding the warning, I unchecked this box and was able to enter the target farm’s content access account credentials.
Crawled again and viola ~ it worked!
Still not sure why I couldn’t crawl this internet-facing SharePoint 2010 site as a website, but oh well, I’m good now.