/
This week, we found some duplicate texts on our programmatic SEO site. This guide shows you how to spot, correct and avoid duplicate content on your pSEO website.
Table of Contents
Last year, one of our niche blogger friend complained that one of his competitors duplicated one of his content on their site. Even worse, the duplicate ranked ahead of his (the original) on Google's first page.
Fast forward, the duplicate has been detected by Google, and removed from SERPs. It’s like it never existed in the first place.
That's what happens when you reproduce a carbon copy of others' content on your website.
It's like the case of an elephant on a tree. You have no idea how it got there, but you know it’s only a matter of time before it falls down.
Even though there's nothing like a "duplicate content penalty", having lots of duplicated content on your site could affect your domain's authority and traffic potential.
For one thing, most programmatic SEO sites create duplicate content without even knowing.
These can come from lack of sufficient data or targeting multiple variations of a given keyword.
The outcome is that they:
If you're creating duplicate pSEO content without even knowing it, how can you correct it?
This and many more are what we will be covering in this guide.
In the next few paragraphs, you are going to learn:
Let’s dive in…
Duplicate content springs up when you reproduce one piece of text on multiple URLs. Duplicate content may apply to texts obtained from any of these sources:
It turns out that duplicate content can occur in two places: From your website and third-party sites.
From your website: This is when 2 or more URLs have the same content or target the same keyword.
From other sites: This happens when two websites have the same content on their page. Or republish their content on your site without your permission.
In the case of cross-domain duplication, the original source has no control over third parties scraping their content.
Hence, when that happens, Google must rank one and list the others as duplicates.
So, how will Google differentiate the original from the duplicate?
When 2 websites have the same content, how does Google differentiate the original from the duplicate?
Is there a percentage threshold for measuring content duplication?
Or does duplication apply to, say, when you reproduce the entire content but not paragraphs?
To answer these questions, we dug out a response John Mueller, Senior search advocate at Google, gave in a tweet that asked the same question back in 2022.
It turns out that there’s no specific number or percentage that represents duplicate content.
But just thinking about how Google crawls and indexes pages, we believe we have an idea of how they differentiate duplicate content from the original.
So here's what we mean: When crawling, Google bots visit your site and try to understand what the page is all about before adding your content to their database (indexing).
Google marks the first variation that’s indexed as the “original”, and later submissions are marked as “duplicate”.
Using this analogy, it means it's unlikely for a duplicate to outrank the original unless the original fails to meet the webmaster's guidelines.
Hence, if you're reproducing duplicate content on your site, you’re likely to struggle with one or more of these:
Note: Don’t know what crawl budget is? Go here to read about it.
Now that you know the implication of having duplicate content on your site. It's time to start thinking about how to identify and eliminate them.
And also how to avoid creating them moving forward.
Before we jump in, we would like to sound this out: duplicate content isn't the same as thin content.
Here’s a breakdown of how they differ from each other.
Thin content and duplicate content are similar, but they aren’t the same. Duplication has to do with reproducing existing texts on multiple URLs.
Conversely, thin content refers to content that provides little or no value to the reader.
It includes content with little text, poor structure and lots of ads.
This content often ends in a high bounce rate, low page view time and any factor that ends in poor on-page experience
So hope you get it.
The table below compares both concepts side-by-side.
Even though this post is about how to avoid creating duplicate content for pSEO, we believe most websites already have tons of duplicate content living on their sites that they aren’t aware of.
Identifying and cleaning them up will help keep your site healthy and authoritative in the eyes of search engines.
For instance, before we created this content, we carried out a thorough check on our website and found duplicate texts on the following pages:
We are working on making them unique. And we see no reason why you shouldn’t do the same on your site.
So there are two ways to identify duplicate content:
Carry out a manual check on Google
The easiest way Google recommends we check for text duplication on a page is by copying a few words or a complete sentence and pasting it with quotes into Google.
Here’s an example:
When we carried out this text on one of our pages, we found that the text we copied appeared word-for-word on 2 of our pages.
Perfect example of “On-site duplication”
Use Duplicate Content Checkers
Duplicate content checkers are software programmed to detect identical texts on multiple URLs on your site or from other sites.
Our best pick for duplicate content checkers include:
The one we've used is Siteliner. And it's the same one we used to identify duplicate content on our site while creating this guide.
What we love about Site liner is that it’s easy to use and straightforward.
All you have to do is to insert your page URL, and the tool runs a thorough check to identify duplicate content on your site.
When we did this for SEOmatic, it generated a report that showed that 23% of our content is duplicate text.
What next? Now, we can click on each URL to identify the texts that need a rewrite.
But in reality, it doesn't really matter. Those programmatic SEO pages are ranking well on Google and only account for 20-25% of duplicate content on our overall website. Each individual page have a duplicate content match percentage of up to 80%-90%, not 100%. We believe that maintaining a 80/20 duplicate content match percentage is enough to not penalize ourself. And we’re currently doing that as you read this.
Before we go further, let's recap what we've covered so far. We've discussed the types of duplicate content and outlined the various styles that can pass as duplicates.
At some point, we also analyzed how Google differentiate duplicates from original.
And finally, we got you thinking about how to identify duplicate content.
With all those sorted, it’s time to show you how to avoid creating duplicate content in the first place.
Use Ai Writers with GPT-4 Text-Davinci Model
Open AI has developed several AI software that automates repetitive tasks in the last few years. Their GPT-4 model, in particular, is trained to understand and generate texts best on natural language processing (NLP). It's the only model that can generate high-quality outputs while minimizing plagiarism.
Hence, when website owners come to us for recommendations on the best pSEO software, we usually advise them to opt for a tool with a built-in AI writer with GPT-4 model embedded in its functionality.
Such tools will give them the advantage of:
If you want a pSEO tool with such built-in content creation capabilities, Seomatic is there for you. Just so you know, all of the programmatic SEO pages on our website were generated with SEOmatic, and most of these pages are currently ranking on the first page.
And one of our users, Ben, had this to say after using SEOmatic to generate 75 unique content in 3 hours.
If you want to generate the same results for your business, we recommend you sign up for a free trial and take the tool for a test run before deciding to opt for a paid plan.
Go here to kickstart your pSEO journey with SEOmatic.
Differentiate Originals from Duplicates with Canonical URLs
If you have multiple texts on multiple pages on your site, it's important to specify which one is canonical.
Adding a canonical to one of the pages tells Google that this is the one they will serve on search engines, while the others with the same texts are duplicates.
Otherwise, Google will make the pick for you, and that pick might not be your best version. Hence, losing out on a chance to rank the pages higher.
But how do you specify a page as a canonical?
Add the rel="canonical" tag to the page URL to specify that a blog post is canonical.
So here's how it's done; assuming this page is a canonical, the URL will be:
It’s as easy as that.
Different Data Set for Each Keyword
Programmatic SEO generally relies on data. The more data you have the better your content.
Hence, to create content that stands out, we recommend you invest enough time sourcing for data.
Ideally, if it will take 3 hours to create 20 pSEO content, you should invest 10-12 hours sourcing for data.
Why should it take that long?
Our approach is to create different datasets for different keywords.
For instance, if you create content around location-based keyword like "Paris", you can have different datasets for:
And lots more…
Doing this will level up your pSEO game and super change your content creation process.
Website owners who implement pSEO on their site unintentionally fall into the trap of creating duplicate content on their site. However, too much of it will hinder your page ranking.
But if you implement the strategies we shared with you here, we have confidence that you will completely avoid creating duplicate content for your pSEO site.
And one last thing, you can level up your pSEO content game if you combine this approach with a powerful software like Seomatic.
So why not sign up for our 14-day free trial and test-run our tool at no cost.
Today, I used SEOmatic for the first time.
It was user-friendly and efficiently generated 75 unique web pages using keywords and pre-written excerpts.
Total time cost for research & publishing was ≈ 3h (Instead of ≈12h)
Ben Farley
SaaS Founder, Salespitch
Add 10 pages to your site every week. Or 1,000 or 1,000,000.