Recently, I have enabled all my servers to serve everything over HTTP and HTTPS. Users can access any site via http://www.example.com or https://www.example.com. All pages are identical between the versions, so http://www.example.com/about.php is the same as https://www.example.com/about.php and so on.
Now about that exception. It is in robots.txt:
Apparently this URL must be absolute.
Now the problem I see if that when Google reads https://www.example.com/robots.txt it gets an HTTP sitemap! The documentation on robots.org says that one can specify multiple sitemaps but if I am not sure that putting both the HTTP and HTTPS sitemap is a good idea since they will contain each a list of identical pages (one with HTTP and one with HTTPS).
How should Sitemap in robots.txt be handled for websites that accept HTTP and HTTPS?
Some ideas that came to mind:
http://www.example.com/about/ http://www.example.com/about http://example.com/about/ http://example.com/about https://www.example.com/about/ https://www.example.com/about
These kind of duplicate content Google already handling from many years ago. So first of don't worry about duplicate content issue.
It is totally fine to serve HTTP and HTTPS version of site on same time, specially when you're migrating your site from HTTP to HTTPS, Stackoverflow also done that in past.
Here Google will index only one version of your webpage, it means they will not going to index both version
https://www.example.com/about.php. In most of time, by default they will choose HTTPS
And again there is no need to add your sitemap file into robots.txt. Specially when you think about Google(It is not ask.com), because they gives us option to submit our sitemap into webmaster tool. So create two properties into search console like
https://www.example.com and submit individual sitemap there.
I don't know why you're so serious about sitemap, robots.txt and all thing. Google can crawl and index any website without sitemap, for example wikipedia does not have any sitemap, but it is crawl often, because they have good internal link structure.
A sitemap at
http://www.example.com/sitemap.php can only contain URLs from
http://www.example.com/.¹ The scheme and the host must be the same.
So if you 1) want to provide sitemaps for both protocols, and 2) link both sitemaps via the
Sitemap field in the robots.txt, you have to provide separate robots.txt files for HTTP and HTTPS:
# http://www.example.com/robots.txt Sitemap: http://www.example.com/sitemap.php
# https://www.example.com/robots.txt Sitemap: https://www.example.com/sitemap.php
(It should be easy to achieve this with Apache, see for example the answers to Is there a way to disallow crawling of only HTTPS in robots.txt?)
But you might want to provide a sitemap only for the canonical variant (e.g., only for HTTPS), because there is not much point in letting search engines parse the sitemap for the non-canonical variant, as they typically wouldn’t want to index any of its URLs. So if HTTPS should be canonical:
¹ Except if cross submits are used.