four: two documents are part of the same, but the layout of different formats, this is called a partial repeat repeat page.

near duplicate page types, according to the combination of the content and format of the page layout is divided into 4 forms:

Such as:

three: two documents are an important part of the content is the same, and the same layout format, called page layout repeat repeat.


if a "high repeatability, is often.

The aim of

1. URL address pointing to the same page and the mirror site

statistics show that: "the Internet approximate number of the total number of" repetition rate as high as 29% of the pages of the same accounts for about 22%. of the total number of "research shows that, in a large-scale information acquisition system, 30%" and 70% other web page is completely duplicate or near duplicate.

web content approximately two applications of duplicate detection:

points to the same site.

: in the user search stage

: Web content a high proportion of Internet pages is similar or identical to the


duplicated web pages for search engine benefits:


on a new web crawler program through the web page to decide whether the algorithm, index.

two: two document content is the same, but the layout of different formats, this is called a content page to repeat repeat.


2. web content duplicate or near duplicate

two: crawler discovery phase

such as plagiarism, extract content, spam and other

The adverse effects of

search crawler will produce "repeat types:


www.sina贵族宝贝 and www.sina贵族宝贝.cn

under normal circumstances, very similar to the web content or not only to provide users with a small amount of new information, but in the capture of crawlers and user search will consume a large amount of server resources.

: no two documents in the content and layout format on the difference, known as full page to repeat repeat.

duplicated web pages for search engine:

is in accordance with the user specified query to find the approximate duplicate documents have been indexed list, and sort the output.

