So, to give some background, I'm involved in doing some statistical calculations over a large number of web pages and this has the side effect of highlighting web pages that deviate from the norm. So I end up going through many web pages that stand out from the pack at first glance.
The fetcher I use talks HTTP directly, and deals leniently with the web servers out there that don't always implement HTTP according to spec. On this particular occasion, one web site : http://hairtype.naturallycurly.com responded to the fetcher with content that was nowhere close to what the browser retrieved.
Let me post here what the HTML looked like:
<html lang="en">
<head>
<title>PHP Application - AWS Elastic Beanstalk</title>
<link href="http://fonts.googleapis.com/css?family=Lobster+Two" rel="stylesheet" type="text/css"></link>
<link href="https://awsmedia.s3.amazonaws.com/favicon.ico" rel="icon" type="image/ico"></link>
<link href="https://awsmedia.s3.amazonaws.com/favicon.ico" rel="shortcut icon" type="image/ico"></link>
<!--[if IE]><script src="http://html5shiv.googlecode.com/svn/trunk/html5.js"></script><![endif]-->
<link href="/styles.css" rel="stylesheet" type="text/css"></link>
</head>
<body>
<section class="congratulations">
<h1>
Congratulations!</h1>
Your AWS Elastic Beanstalk <em>PHP</em> application is now running on your own dedicated environment in the AWS Cloud<br />
You are running PHP version 5.4.20<br />
</section>
<section class="instructions">
<h2>
What's Next?</h2>
<ul>
<li><a href="http://docs.amazonwebservices.com/elasticbeanstalk/latest/dg/">AWS Elastic Beanstalk overview</a></li>
<li><a href="http://docs.amazonwebservices.com/elasticbeanstalk/latest/dg/create_deploy_PHP_eb.html">Deploying AWS Elastic Beanstalk Applications in PHP Using Eb and Git</a></li>
<li><a href="http://docs.amazonwebservices.com/elasticbeanstalk/latest/dg/create_deploy_PHP.rds.html">Using Amazon RDS with PHP</a>
<li><a href="http://docs.amazonwebservices.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html">Customizing the Software on EC2 Instances</a></li>
<li><a href="http://docs.amazonwebservices.com/elasticbeanstalk/latest/dg/customize-containers-resources.html">Customizing Environment Resources</a></li>
</li>
</ul>
<h2>
AWS SDK for PHP</h2>
<ul>
<li><a href="http://aws.amazon.com/sdkforphp">AWS SDK for PHP home</a></li>
<li><a href="http://aws.amazon.com/php">PHP developer center</a></li>
<li><a href="https://github.com/aws/aws-sdk-php">AWS SDK for PHP on GitHub</a></li>
</ul>
</section>
<!--[if lt IE 9]><script src="http://css3-mediaqueries-js.googlecode.com/svn/trunk/css3-mediaqueries.js"></script><![endif]-->
</body>
</html>
This is nowhere close to the HTML retrieved by the browser. You can try it. The web page is about hair products.
My experience is that sometimes, based on the HTTP headers and originating IP, some web servers can return different content. Sometimes, the server has identified an IP as a bot and decided to return an error response or an outright wrong page.
So I tested the theory of the IP by running the fetcher from a different network, with a different outgoing IP. This time, the correct page was retrieved. Then I used curl to retrieve the page from the same network that had given me the incorrect page. To my surprise, curl retrieved the correct page. curl got the correct page from both networks.
This was quite puzzling. I thought that perhaps the web server might have done some sophisticated finger printing and thus having identified the User Agent and maybe other headers the fetcher was using had decided to send it a wrong page.
So using wireshark, I captured all the HTTP headers sent by the fetcher. Another team member then used curl, specifying these same headers.
curl -H 'User-Agent: rtw' -H 'Host: hairtype.naturallycurly.com' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-us,en;q=0.5' -H 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7' -H 'Keep-Alive: 115' -H 'Connection: keep-alive' -H 'Accept-Encoding: gzip,deflate' http://hairtype.naturallycurly.com
I was positive that curl would then fail. But of course it still returned the correct page. So my theory of the sophisticated finger printing was wrong - or maybe it was even more sophisticated that I thought. I was stumped.
And then I realized, that I had missed looking at a very crucial piece of data in this whole operation. The IP the fetcher used to get the page. The first thing the fetcher does is to resolve the IP and since the DNS query can be expensive and we do lots of those, the IP is retrieved from a memcached instance if it is available. An IP may be cached for a number of hours. From the fetcher logs, I could see the IP that it was using:
DNS resolved from cache hairtype.naturallycurly.com -> /54.243.101.48
But as dig showed, that was the incorrect IP :
>>$ dig hairtype.naturallycurly.com
\
; <<>> DiG 9.3.6-P1-RedHat-9.3.6-20.P1.el5 <<>> hairtype.naturallycurly.com
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28108
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 4, ADDITIONAL: 4
;; QUESTION SECTION:
;hairtype.naturallycurly.com. IN A
;; ANSWER SECTION:
hairtype.naturallycurly.com. 300 IN CNAME secure-nc-04-2015-1845606936.us-east-1.elb.amazonaws.com.
secure-nc-04-2015-1845606936.us-east-1.elb.amazonaws.com. 60 IN A 23.23.197.30
secure-nc-04-2015-1845606936.us-east-1.elb.amazonaws.com. 60 IN A 54.225.215.76
;; AUTHORITY SECTION:
us-east-1.elb.amazonaws.com. 1703 IN NS ns-1119.awsdns-11.org.
us-east-1.elb.amazonaws.com. 1703 IN NS ns-1793.awsdns-32.co.uk.
us-east-1.elb.amazonaws.com. 1703 IN NS ns-235.awsdns-29.com.
us-east-1.elb.amazonaws.com. 1703 IN NS ns-934.awsdns-52.net.
;; ADDITIONAL SECTION:
ns-235.awsdns-29.com. 92612 IN A 205.251.192.235
ns-934.awsdns-52.net. 92612 IN A 205.251.195.166
ns-1119.awsdns-11.org. 92612 IN A 205.251.196.95
ns-1793.awsdns-32.co.uk. 92510 IN A 205.251.199.1
;; Query time: 11 msec
;; SERVER: 10.101.51.60#53(10.101.51.60)
;; WHEN: Fri Nov 22 12:40:20 2013
;; MSG SIZE rcvd: 345
All that remained now was to validate this - far simpler - hypothesis. It was trivial to do so, all I had to do was remove the domain->IP maping from memcached.
>>$ telnet localhost 11211
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
get hairtype.naturallycurly.com
VALUE hairtype.naturallycurly.com 4096 4
6?e0
END
delete hairtype.naturallycurly.com
DELETED
get hairtype.naturallycurly.com
END
quit
Connection closed by foreign host.
This time, the fetcher logs showed that indeed, it was picking the correct IP. And of course it fetched the correct page with all the hair product details.
DNS resolved hairtype.naturallycurly.com -> /23.23.197.30
So once again, I was reminded of the Occam's Razor and how important it is to
1. Remember all the assumptions we make about how a certain software system works.
2. Validate all the assumptions, starting with the simplest first.
Happy debugging the Net!