A rel=canonical corner case for urls with same content
What is a canonical page?
A canonical page is a page with similar contents. It is usually common that, site's may list its products in an alphabetical order, and the same by means of top searched or by means of rating.
For example,
http://www.hscripts.com/freeimages/icons/cartoons/barbie-clipart.php
http://www.hscripts.com/freeimages/icons/cartoons/barbie-clipart.php?item=5678barbie
If Google identifies these kind of canonical url's, it will consider only the main link and ignore the rest in its index.
However, users can specify a canonical page to search engines by adding a element with the attribute rel="canonical" to the section of the non-canonical version of the page. Adding this link and attribute lets site owners identify sets of identical content and suggest to Google: "Of all these pages with identical content, this page is the most useful. Please prioritize it in search results."
Matt cutts' discussion on the rel=canonical corner case is as follows,
"I sometimes get a question about whether Google will always use the url from rel=canonical as the preferred url. The answer is that we take rel=canonical urls as a strong hint, but in some cases we won’t use them:
- For example, if we think you’re shooting yourself in the foot by accident (pointing a rel=canonical toward a non-existent/404 page), we’d reserve the right not to use the destination url you specify with rel=canonical.
- Another example where we might not go with your rel=canonical preference: if we think your website has been hacked and the hacker added a malicious rel=canonical. I recently tweeted about that case. On the “bright” side, if a hacker can control your website enough to insert a rel=canonical tag, they usually do far more malicious things like insert malware, hidden or malicious links/text, etc.
I wanted to talk today about another case in which we won’t use rel=canonical. First off, here’s a thought exercise: should Google trust rel=canonical if we see it in the body of the HTML? The answer is no, because some websites let people edit content or HTML on pages of the site. If Google trusted rel=canonical in the HTML body, we’d see far more attacks where people would drop a rel=canonical on part of a web page to try to hijack it.
Okay, so now we come to another corner case where we probably won’t trust a rel=canonical: if we see weird stuff in your HEAD section. For example, if you start to insert regular text or other tags that we normally only see in the BODY of HTML into the HEAD of a document, we may assume that someone just forgot to close the HEAD section. We don’t allow rel=canonical in the BODY (because as I mentioned, people would spam that), so we might not trust rel=canonical in those cases, especially if it comes after the regular text or tags that we normally only see in the BODY of a page.
But in general, as long as your HEAD looks fairly normal, things should be fine. If you really want to be safe, you can make sure that the rel=canonical is the first or one of the first things in the HEAD section. Again, things should be fine either way, but if you want an easy rule of thumb: put the rel=canonical toward the top of the HEAD."