This is already the 8th post in the series of my new marketing campaign and we’re still just in the setup phase. You may be wondering when I’ll get to keyword research or when I’ll talk about link building (which I’ll call relationship building when I get to it). Fear not! I most certainly will address those critical pieces of the SEO puzzle, but right now we’ve still got on-page SEO work to do.
So far we’ve purchased a domain name, set up hosting, Google Analytics and Google Webmaster tools. In my opinion, there are 2 distinct overarching themes in SEO: the stuff you can control (on-site) and the stuff you can’t (off-site).
On-site or technical SEO is the process of making your website as user-friendly and as spider-friendly as possible by making it fast, easy to navigate and accessible to all visitors. It’s also important to use the proper code when necessary so that search engines don’t have to guess about the meaning of specific elements, such as images and videos.
Is my site indexed by Google?
I like to start a technical SEO audit by making sure the website can be discovered and indexed in its entirety by Google. It’s important to remember that behind the juggernaut that is Google is a collection of web pages. If your web pages aren’t in Google’s database (index) then any further efforts on optimizing your pages will be moot.
Google can crawl and index the web at remarkable speed, so chances are your pages are already indexed. To see what is indexed you can do a site search.
Google returned 18 results (pages) indexed, but I want to compare this against my Google Webmaster Tools data, particularly my XML sitemaps.
Google Webmaster Tools Lag Time
Google Webmaster Tools (GWT) is showing that I’ve submitted a single XML sitemap that contains only 8 submitted and 7 indexed pages. I don’t have hard data to support this, but GWT always seems to lag behind the actual index in terms of displayed data. For this reason, I recommend that you use GWT as a way to start your investigation, but don’t rely on its data alone.
Even the data between the sitemaps and the index status sections for the same website don’t match up, so be careful about taking it at face value.
The data I trust the most is what I see in the site search, which is showing that all of the pages I want indexed are and none of the pages I don’t, aren’t. This means Googlebot crawled the site successfully and respected my meta robots tags. Thanks, Googlebot!
How else can we give Googlebot and other spiders directions?
The robots.txt (I say it as robots dot text) is something spiders look for first for directions about which files and directories you do not want the spider to visit. Using the disallow function is a request, not a command, to essentially ignore the parameter behind it.
Without this file or if a directory is not included, spiders will assume any directories it finds are fair game and will attempt to crawl them. What you may not know is that there are thousands of spiders crawling the web, not just from search engines. So I’m going to take a look a the Wikipedia website’s robots.txt file to see if there are some additional parameters I want to add. (I chose Wikipedia because it probably gets a lot of spider traffic.)
I can see already that Wikipedia keeps tabs on spiders. And just because I don’t have as much content as Wikipedia doesn’t mean that I shouldn’t optimize user experience by eliminating unnecessary spider traffic.
To update my robots.txt file, I simply download the file, open it with a text editor and paste my new commands.
Here is what I intend to change my file to or you can download rename-robots (rename it to robots.txt before upload, but back up your existing file first!)
# advertising-related bots: User-agent: Mediapartners-Google* Disallow: / # Wikipedia work bots: User-agent: IsraBot Disallow: User-agent: Orthogaffe Disallow: # Crawlers that are kind enough to obey, but which we'd rather not have # unless they're feeding search engines. User-agent: UbiCrawler Disallow: / User-agent: DOC Disallow: / User-agent: Zao Disallow: / # Some bots are known to be trouble, particularly those designed to copy # entire sites. Please obey robots.txt. User-agent: sitecheck.internetseer.com Disallow: / User-agent: Zealbot Disallow: / User-agent: MSIECrawler Disallow: / User-agent: SiteSnagger Disallow: / User-agent: WebStripper Disallow: / User-agent: WebCopier Disallow: / User-agent: Fetch Disallow: / User-agent: Offline Explorer Disallow: / User-agent: Teleport Disallow: / User-agent: TeleportPro Disallow: / User-agent: WebZIP Disallow: / User-agent: linko Disallow: / User-agent: HTTrack Disallow: / User-agent: Microsoft.URL.Control Disallow: / User-agent: Xenu Disallow: / User-agent: larbin Disallow: / User-agent: libwww Disallow: / User-agent: ZyBORG Disallow: / User-agent: Download Ninja Disallow: / # # Sorry, wget in its recursive mode is a frequent problem. # Please read the man page and use it properly; there is a # --wait option you can use to set the delay between hits, # for instance. # User-agent: wget Disallow: / # # The 'grub' distributed client has been *very* poorly behaved. # User-agent: grub-client Disallow: / # # Doesn't follow robots.txt anyway, but... # User-agent: k2spider Disallow: / # # Hits many times per second, not acceptable # http://www.nameprotect.com/botinfo.html User-agent: NPBot Disallow: / # A capture bot, downloads gazillions of pages with no public benefit # http://www.webreaper.net/ User-agent: WebReaper Disallow: /