four: two documents are part of the same, but the layout of different formats, this is called a partial repeat repeat page.
near duplicate page types, according to the combination of the content and format of the page layout is divided into 4 forms:
three: two documents are an important part of the content is the same, and the same layout format, called page layout repeat repeat.
if a "high repeatability, is often.
The aim of
1. URL address pointing to the same page and the mirror site
statistics show that: "the Internet approximate number of the total number of" repetition rate as high as 29% of the pages of the same accounts for about 22%. of the total number of "research shows that, in a large-scale information acquisition system, 30%" and 70% other web page is completely duplicate or near duplicate.
web content approximately two applications of duplicate detection:
points to the same site.
: in the user search stage
: Web content a high proportion of Internet pages is similar or identical to the
duplicated web pages for search engine benefits:
on a new web crawler program through the web page to decide whether the algorithm, index.
two: two document content is the same, but the layout of different formats, this is called a content page to repeat repeat.
2. web content duplicate or near duplicate
two: crawler discovery phase
such as plagiarism, extract content, spam and other
The adverse effects of
search crawler will produce "repeat types:
www.sina贵族宝贝 and www.sina贵族宝贝.cn
under normal circumstances, very similar to the web content or not only to provide users with a small amount of new information, but in the capture of crawlers and user search will consume a large amount of server resources.
: no two documents in the content and layout format on the difference, known as full page to repeat repeat.
duplicated web pages for search engine:
is in accordance with the user specified query to find the approximate duplicate documents have been indexed list, and sort the output.