Protocol Agnostic Robots Sitemap

by Itai   Last Updated June 27, 2017 17:04 PM

Recently, I have enabled all my servers to serve everything over HTTP and HTTPS. Users can access any site via http://www.example.com or https://www.example.com. All pages are identical between the versions, so http://www.example.com/about.php is the same as https://www.example.com/about.php and so on.

URLs are relative, so they do not mention the protocol with one exception. In other words, if the page is loaded with HTTP, it will link to other pages, images, CSS, Javascript over HTTP and the same with HTTPS, as to avoid mixed content warnings.

Now about that exception. It is in robots.txt:

Sitemap: http://www.example.com/sitemap.php

Apparently this URL must be absolute.

Now the problem I see if that when Google reads https://www.example.com/robots.txt it gets an HTTP sitemap! The documentation on robots.org says that one can specify multiple sitemaps but if I am not sure that putting both the HTTP and HTTPS sitemap is a good idea since they will contain each a list of identical pages (one with HTTP and one with HTTPS).

How should Sitemap in robots.txt be handled for websites that accept HTTP and HTTPS?

Some ideas that came to mind:

  • Specify both sitemaps (as mentioned above). Afraid this would cause duplicate content issues.
  • Only specify the HTTPS Sitemap. That gives access to all unique pages anyway.
  • Find a magical (Apache) way to sent a different robots.txt via HTTP and HTTPS. Is that even possible? Could it cause issues?


Answers 2


http://www.example.com/about/
http://www.example.com/about
http://example.com/about/
http://example.com/about
https://www.example.com/about/
https://www.example.com/about

These kind of duplicate content Google already handling from many years ago. So first of don't worry about duplicate content issue.

It is totally fine to serve HTTP and HTTPS version of site on same time, specially when you're migrating your site from HTTP to HTTPS, Stackoverflow also done that in past.

Here Google will index only one version of your webpage, it means they will not going to index both version http://www.example.com/about.php and https://www.example.com/about.php. In most of time, by default they will choose HTTPS

And again there is no need to add your sitemap file into robots.txt. Specially when you think about Google(It is not ask.com), because they gives us option to submit our sitemap into webmaster tool. So create two properties into search console like http://www.example.com and https://www.example.com and submit individual sitemap there.

I don't know why you're so serious about sitemap, robots.txt and all thing. Google can crawl and index any website without sitemap, for example wikipedia does not have any sitemap, but it is crawl often, because they have good internal link structure.

Goyllo
Goyllo
June 27, 2017 18:14 PM

A sitemap at http://www.example.com/sitemap.php can only contain URLs from http://www.example.com/.¹ The scheme and the host must be the same.

So if you 1) want to provide sitemaps for both protocols, and 2) link both sitemaps via the Sitemap field in the robots.txt, you have to provide separate robots.txt files for HTTP and HTTPS:

#        http://www.example.com/robots.txt

Sitemap: http://www.example.com/sitemap.php
#        https://www.example.com/robots.txt

Sitemap: https://www.example.com/sitemap.php

(It should be easy to achieve this with Apache, see for example the answers to Is there a way to disallow crawling of only HTTPS in robots.txt?)

But you might want to provide a sitemap only for the canonical variant (e.g., only for HTTPS), because there is not much point in letting search engines parse the sitemap for the non-canonical variant, as they typically wouldn’t want to index any of its URLs. So if HTTPS should be canonical:

  1. On each HTTP page, link to its HTTPS version with the canonical link type.
  2. Provide a sitemap only on HTTPS, listing only the HTTPS URLs.
  3. Link the sitemap (ideally only) from the HTTPS robots.txt.

¹ Except if cross submits are used.

unor
unor
June 28, 2017 11:29 AM

Related Questions


Updated April 17, 2015 21:01 PM

Updated January 26, 2018 17:04 PM

Updated April 22, 2016 08:01 AM

Updated May 13, 2016 08:01 AM

Updated November 08, 2016 08:01 AM