Tag Archives: spiders

KONICA MINOLTA DIGITAL CAMERA

Starting a Technical SEO Audit

This is already the 8th post in the series of my new marketing campaign and we’re still just in the setup phase. You may be wondering when I’ll get to keyword research or when I’ll talk about link building (which I’ll call relationship building when I get to it). Fear not! I most certainly will address those critical pieces of the SEO puzzle, but right now we’ve still got on-page SEO work to do.

So far we’ve purchased a domain name, set up hosting, Google Analytics and Google Webmaster tools. In my opinion, there are 2 distinct overarching themes in SEO: the stuff you can control (on-site) and the stuff you can’t (off-site).

On-site or technical SEO is the process of making your website as user-friendly and as spider-friendly as possible by making it fast, easy to navigate and accessible to all visitors. It’s also important to use the proper code when necessary so that search engines don’t have to guess about the meaning of specific elements, such as images and videos.

Is my site indexed by Google?

I like to start a technical SEO audit by making sure the website can be discovered and indexed in its entirety by Google. It’s important to remember that behind the juggernaut that is Google is a collection of web pages. If your web pages aren’t in Google’s database (index) then any further efforts on optimizing your pages will be moot.

Read the Stanford research paper to see the vision that was to become Google.

Image Source

Google can crawl and index the web at remarkable speed, so chances are your pages are already indexed. To see what is indexed you can do a site search.

Google returned 18 results (pages) indexed, but I want to compare this against my Google Webmaster Tools data, particularly my XML sitemaps.

Google Webmaster Tools Lag Time

Google Webmaster Tools (GWT) is showing that I’ve submitted a single XML sitemap that contains only 8 submitted and 7 indexed pages. I don’t have hard data to support this, but GWT always seems to lag behind the actual index in terms of displayed data. For this reason, I recommend that you use GWT as a way to start your investigation, but don’t rely on its data alone.

Even the data between the sitemaps and the index status sections for the same website don’t match up, so be careful about taking it at face value.

The data I trust the most is what I see in the site search, which is showing that all of the pages I want indexed are and none of the pages I don’t, aren’t. This means Googlebot crawled the site successfully and respected my meta robots tags. Thanks, Googlebot!

How else can we give Googlebot and other spiders directions?

The robots.txt (I say it as robots dot text) is something spiders look for first for directions about which files and directories you do not want the spider to visit. Using the disallow function is a request, not a command, to essentially ignore the parameter behind it.

Without this file or if a directory is not included, spiders will assume any directories it finds are fair game and will attempt to crawl them. What you may not know is that there are thousands of spiders crawling the web, not just from search engines. So I’m going to take a look a the Wikipedia website’s robots.txt file to see if there are some additional parameters I want to add. (I chose Wikipedia because it probably gets a lot of spider traffic.)

I can see already that Wikipedia keeps tabs on spiders. And just because I don’t have as much content as Wikipedia doesn’t mean that I shouldn’t optimize user experience by eliminating unnecessary spider traffic.

To update my robots.txt file, I simply download the file, open it with a text editor and paste my new commands.

Here is what I intend to change my file to or you can download rename-robots (rename it to robots.txt before upload, but back up your existing file first!)

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Zealbot
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: Fetch
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: WebZIP
Disallow: /

User-agent: linko
Disallow: /

User-agent: HTTrack
Disallow: /

User-agent: Microsoft.URL.Control
Disallow: /

User-agent: Xenu
Disallow: /

User-agent: larbin
Disallow: /

User-agent: libwww
Disallow: /

User-agent: ZyBORG
Disallow: /

User-agent: Download Ninja
Disallow: /

#
# Sorry, wget in its recursive mode is a frequent problem.
# Please read the man page and use it properly; there is a
# --wait option you can use to set the delay between hits,
# for instance.
#
User-agent: wget
Disallow: /

#
# The 'grub' distributed client has been *very* poorly behaved.
#
User-agent: grub-client
Disallow: /

#
# Doesn't follow robots.txt anyway, but...
#
User-agent: k2spider
Disallow: /

#
# Hits many times per second, not acceptable
# http://www.nameprotect.com/botinfo.html
User-agent: NPBot
Disallow: /

# A capture bot, downloads gazillions of pages with no public benefit
# http://www.webreaper.net/
User-agent: WebReaper
Disallow: /
spider-man

SEO with Screaming Frog

If you’ve been following the series you’ll remember that we created and submitted XML sitemaps in the last post. Before I move into the spider simulation, let’s check on the status of the sitemap. Navigate to your Google Webmaster tools page and make note of any messages or warnings you see there. In my case, there is nothing new to report. No news is good news!

The Google Webmaster Tools dashboard is a great tool to quickly potential issues such as crawl errors or problems with your sitemap. The middle chart depicts the number of times my site is showing up in Google search. It’s low, but that’s to be expected with a new campaign. My focus is on the right graph, which shows the status of my sitemap.

It looks like things are running smoothly in the sitemap department, but my next step is to simulate a crawl of my website using a tool called Screaming Frog.

Screaming Frog SEO Spider

The free version of Screaming Frog will up to 500 URLs (pages), but since my website is new, that’s plenty. You can download Screaming Frog here: http://www.screamingfrog.co.uk/seo-spider/. To get an idea of what Screaming Frog can do, watch this video:

Starting from my home page, as directed, the spider will hop through hyperlinks much like Googlebot or another search engine spider would. Sitemaps and crawling a website are 2 ways that spiders can discover and report on content. For this exercise I’m going to filter my results to show only HTML. In other words, my pages. Screaming Frog found about 20 pages versus the 7 pages Google Webmaster Tools is showing in my sitemap.

What happened?

Robot Instructions

Earlier in the series I used Yoast’s WordPress SEO plugin to give the search engine spiders directions. I asked them (not ordered them) not to index certain types of pages: tags and categories. I also instructed my WordPress installation to exclude these post types from the sitemap. The image below shows that even though these pages were found, the directive “noindex” is present.

Looking good! Let’s do one final check to make sure this piece of the puzzle really is falling into place. I mentioned earlier that directives like “noindex” are requests, not really commands. It’s up to the search engine to follow these requests, so I’m going to do a site search of this domain to see what Google has in its index. (Google is partly a big, really big, database.)

You can see what Google has indexed for a website by typing site:http://yourwebsite as the search string. In my case, it’s site:http://marketingchris.com.

I see that there are 14 pages indexed versus 7 pages in my sitemap. Now what happened? Google must have crawled my site prior to me updating the settings asking it to not include categories in its index. Not a big deal! I know that these instructions are in place now, so I just need to be patient and wait for the next crawl.

Or not!

Fetch as Google

There is a tool in Google Webmaster Tools for the impatient SEO, like me, called Fetch as Googlebot.

 

I want Googlebot to re-spider my content to see that some of my requests of what to index have changed. Please let me know if you have any questions and I’ll see you next time!