Search Engine Crawling and Indexing Factors
The post today is about getting a site crawled and indexed effectively by the major search engines. It can be frustrating for a site owner to find that her newly built site with bells and whistles is just not appearing on the Google SERPs for a search query relevant to her business.
It is a good idea to have some knowledge of the factors that influence the crawling of a site and its successful indexing before the site ranks on the SERPs. The site can be built in a user friendly way that allows the spiders to know what to crawl and how frequently to crawl.
(more…)
Possible Related Posts
Posted by Ravi of Netconcepts Ltd. on 07/05/2009
Permalink | | Print | Trackback | Comments (0) | Comments RSS
Filed under: Search Engine Optimization, SEO, Spiders backlinks, content freshness, crawling factors, domain importance, duplicate-content, external links, Feeds, google-webmaster-tools, increase crawl rate, Links, PageRank, query deserves freshness, search engine crawling, search engine indexing, signals, supplemental index, technical factors, unique content
Yahoo’s Recent Spider Improvement Beats Google’s
Yahoo!’s Search Blog announced yesterday that they were making some final changes to their spider, (named “Slurp”), standardizing their crawlers to provide a common DNS signature for identification/authorization purposes.
Previously, Slurp’s requests may have come from IP addresses associated with inktomisearch.com, and now they should all come from IPs associated with domains in this standard syntax:
[something].crawl.yahoo.net
Possible Related Posts
Posted by Chris of Silvery on 06/06/2007
Permalink | | Print | Trackback | Comments Off on Yahoo’s Recent Spider Improvement Beats Google’s | Comments RSS
Filed under: Google, Spiders, Yahoo bot-detection, bots, Googlebot, slurp, spidering, Spiders, user-agents
Dupe Content Penalty a Myth, but Negative Effects Are Not
I was interested to read a column by Jill Whalen this past week on “The Duplicate Content Penalty Myth” at Search Engine Land. While I agree with her assessment that there really isn’t a Duplicate Content Penalty per se, I think she perhaps failed to address one major issue affecting websites in relation to this.
Read on to see what I mean.
Possible Related Posts
Posted by Chris of Silvery on 03/18/2007
Permalink | | Print | Trackback | Comments Off on Dupe Content Penalty a Myth, but Negative Effects Are Not | Comments RSS
Filed under: Best Practices, Content Optimization, Search Engine Optimization, SEO, Site Structure, Spiders, URLs duplicate-content, Duplicate-Content-Penalization, Jill-Whalen, Search Engine Optimization, SEO, URL-Optimization
In other news, a new free Clinic
Search Engine Journal today opened free SEO Clinic for sites in need of optimization or with specific challenges that have not been overcome.
A group of leading SEOs including Carsten Cumbrowski, Ahmed Bilal, and Rhea Drysdale will review one submission per week delivering a thorough review of usability and site navigation, link building, and copywriting from the perspective of placement in the four leading engines (Google, Yahoo!, MSN and Ask).
It’s clear though that “free” is as free as having your site criticized in one of the SEO clinics experts like to host at conferences. If chosen for review, the findings and recommendations will be posted for others to peruse. I’d do as much myself and appreciate their efforts to help others with these case studies but as a website owner, someone responsible for SEO, or marketing manager for a major brand, I might not be so inclined to have my successes and failures outlined in detail for everyone to see. That concern aside, I do hope they get some quality sites and develop a thorough library of reviews (perhaps I’ll sign up myself!).
To participate, simply contact the team here.
Possible Related Posts
Posted by stephan of stephan on 02/27/2007
Permalink | | Print | Trackback | Comments Off on In other news, a new free Clinic | Comments RSS
Filed under: Content Optimization, General, HTML Optimization, Link Building, PageRank, Search Engine Optimization, SEO, Site Structure, Spiders help, optimization, review, Search Engine Optimization, SEO, SEO-consulting, SEO-critiques, website-design
AdSense Spider Cross-Pollinates for Google
A few bloggers such as Jenstar have just posted that pages spidered by Google’s AdSense bot are appearing in Google’s regular search results pages. Shoemoney just blogged that Matt Cutts has officially verified that this is happening, saying that this was done so that they wouldn’t have to spider the same content twice, and that Google did this as part of their recent Big Daddy infrastructure improvements.
This has a couple of interesting ramifications for SEO professionals and those of us who are optimizing our sites for Google, since bot detection systems may now need to be updated and since this may essentially be a new way of committing site/page submissions into Google’s indices. And we all thought automated URL submissions were dead! I’ll explain further…. (more…)
Possible Related Posts
Posted by Chris of Silvery on 04/19/2006
Permalink | | Print | Trackback | Comments Off on AdSense Spider Cross-Pollinates for Google | Comments RSS
Filed under: Google, Spiders AdSense, bots, Googlebot, Robots.txt, Spiders, URL-submission
Bloody hell, that’s a lot of information
My feeling of technogeek euphoria that I got last month when Google doubled the size of their index has quickly evaporated as I perused Berkeley’s “How Much Information” study. Here’s some stats that will blow you away:
- The World Wide Web contains 167 terabytes of Web pages on its “surface” (i.e. fixed web pages); in volume this is seventeen times the size of the Library of Congress print collections. Plus another 91,850 terabytes of data in the “deep web” (from database driven websites that create web pages on demand)
- Email generates about 400,000 terabytes of new information each year worldwide.
- The amount of new information stored on paper, film, magnetic, and optical media has about doubled in the last three years.
- Print, film, magnetic, and optical storage media produced about 5 exabytes of new information in 2002. Ninety-two percent of the new information was stored on magnetic media, mostly in hard disks. Five exabytes of information is equivalent in size to the information contained in 37,000 new libraries the size of the Library of Congress book collections.
What I found even more amazing (and depressing) is the degree to which we consume this data. We are a society of
information junkies. Witness this from the same report:
Published studies on media use say that the average American adult uses the telephone 16.17 hours a month, listens to radio 90 hours a month, and watches TV 131 hours a month. About 53% of the U.S. population uses the Internet, averaging 25 hours and 25 minutes a month at home, and 74 hours and 26 minutes a month at work — about 13% of the time.
I can’t imagine sitting in front of the ‘idiot box’ for 131 hours a month. What a terrible waste of one’s life. For an average person, that’s something like 7 years of your life — gone.
Dave of the excellent PassingNotes.com blog looks at it this way:
IF you were all of those things, then of the 720 average hours in a given month, of which you should be sleeping circa 200 (give or take a few hundred), then you’d basically be occupied by media (in some form) for over 330 hours per month – and since we spend about one-third of our lives ‘waiting for something to happen’ (bus, phone etc) and about another 20-40 hours per month in a bathroom (much higher for ted kennedy), then discount sleep, and you’ve got about 80ish hours to be a genuine, sentient human being…sad, sad world…
Possible Related Posts
Posted by stephan of stephan on 12/13/2004
Permalink | | Print | Trackback | Comments Off on Bloody hell, that’s a lot of information | Comments RSS
Filed under: Reference Material, Spiders
Is your site unfriendly to search engine spiders like MSNBot?
Microsoft blogger Eytan Seidman on their MSN Search blog offers some very useful specifics on what makes a site crawler unfriendly, particularly to MSNBot:
An example of a page that might look “unfriendly” to a crawler is one that looks like this: http://www.somesite.com/info/default.aspx?view=22&tab=9&pcid=81-A4-76§ion=848&origin=msnsearch&cookie=false….URL’s with many (definitely more than 5) query parameters have a very low chance of ever being crawled….If we need to traverse through eight pages on your site before finding leaf pages that nobody but yourself points to, MSNBot might choose not to go that far. This is why many people recommend creating a site map and we would as well.
Possible Related Posts
Posted by stephan of stephan on 11/21/2004
Permalink | | Print | Trackback | Comments Off on Is your site unfriendly to search engine spiders like MSNBot? | Comments RSS
Filed under: Dynamic Sites, Spiders, URLs
Google’s index hits 8 billion pages. Yes folks, size does matter.
On Wednesday, the day before Microsoft unveiled the beta of Microsoft Search, Google announced that their index was now over eight billion pages strong. Impeccable timing from the Googleplex. Just a couple days later, and Microsoft could have proudly touted its bigger web page index over Google’s. Still, Microsoft’s 5 billion documents is an impressive feat, particularly for a new search engine just out of the blocks. Google continues to show their market dominance, however, with a database of a whopping 8,058,044,651 web pages. Poor Microsoft, trumped by Google at the last minute!
Why the big deal about index size? From the user’s perspective, a search engine that is comprehensive of the Web in its entirety is going to be more useful than one whose indexation is patchy. Which is why I think the Overture Site Match paid inclusion program from Yahoo! is a really bad idea. Sites shouldn’t pay the search engine to be indexed. Rather, the search engine should strive to index as much of the Web as possible because that makes for a better search engine.
Indeed, I see Google’s announcement as a landmark in the evolution of search engines. Search engine spiders have historically had major problems with “spider traps” — dynamic database-driven websites that serve up identical or nearly identical content at varying URLs (e.g. when there is a session ID in the URL). Alas, search engines couldn’t find their way through this quagmire without severe duplication clogging up their indices. The solution for the search engines was to avoid dynamic sites, to a large degree — or at least to approach them with caution. Over time, however, the sophistication of the spidering and indexing algorithms has improved to the point that search engines (most notably, Google) have been able to successfully index a plethora of previously un-indexed content and minimize the amount of duplication. And thus, the “Invisible Web” begins to shrink. Keep it up, Google and Microsoft!
Possible Related Posts
Posted by stephan of stephan on 11/14/2004
Permalink | | Print | Trackback | Comments Off on Google’s index hits 8 billion pages. Yes folks, size does matter. | Comments RSS
Filed under: Google, Research and Development, Spiders
Google Store makeover still not wooing the spiders
You may recall my observation a few months ago that the Google Store is not all that friendly to search engine spiders, including Googlebot. Now that the site has had a makeover, and the session IDs have been eliminated from the URLs, the many tens of thousands of duplicate pages have dropped to a mere 144. This is a good thing, since there’s only a small number of products for sale on the site. Unfortunately, a big chunk of those hundred-and-some search results lead to error pages. So even after a site rebuild, Google’s own store STILL isn’t spider friendly. And if you’re curious what the old site looked like, don’t bother checking the Wayback Machine for it. Unfortunately, the Wayback Machine’s bot has choked on the site since 2002, so all you’ll find for the past several years are “redirect errors”.
Possible Related Posts
Posted by stephan of stephan on 10/05/2004
Permalink | | Print | Trackback | Comments Off on Google Store makeover still not wooing the spiders | Comments RSS
Filed under: Dynamic Sites, Google, Spiders
Spiders like Googlebot choke on Session IDs
Many ecommerce sites have session IDs or user IDs in the URL of their pages. This tends to cause either the pages to not get indexed by search engines like Google, or to cause the pages to get included many times over and over, clogging up the index with duplicates (this phenonemon is called a “spider trap”). Furthermore, having all these duplicates in the index causes the site’s importance score, known as PageRank, to be spread out across all these duplicates (this phenonemon is called “PageRank dilution”).
Ironically, Googlebot regularly gets caught in a spider trap while spidering one of its own sites – the Google Store (where they sell branded caps, shirts, umbrellas, etc.). The URLs of the store are not very search engine friendly: they and are overly complex, and include session IDs. This has resulted in 3,440 duplicate copies of the Accessories page and 3,420 copies of the Office page, for example.
If you have a dynamic, database-driven website and you want to avoid your own site becoming a spider trap, you’ll need to keep your URLs simple. Try to avoid having any ?, &, or = characters in the URLs. And try to keep the number of “parameters” to a minimum. With URLs and search engine friendliness, less is more.
Possible Related Posts
Posted by stephan of stephan on 06/25/2004
Permalink | | Print | Trackback | Comments Off on Spiders like Googlebot choke on Session IDs | Comments RSS
Filed under: HTML Optimization, Spiders