Misplaced Pages

w-shingling

Article snapshot taken from[REDACTED] with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
This article includes a list of references, related reading, or external links, but its sources remain unclear because it lacks inline citations. Please help improve this article by introducing more precise citations. (March 2023) (Learn how and when to remove this message)

In natural language processing a w-shingling is a set of unique shingles (therefore n-grams) each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity between documents. The symbol w denotes the quantity of tokens in each shingle selected, or solved for.

The document, "a rose is a rose is a rose" can therefore be maximally tokenized as follows:

(a,rose,is,a,rose,is,a,rose)

The set of all contiguous sequences of 4 tokens (Thus 4=n, thus 4-grams) is

{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) } Which can then be reduced, or maximally shingled in this particular instance to { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }.

Resemblance

For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or

r ( A , B ) = | S ( A ) S ( B ) | | S ( A ) S ( B ) | {\displaystyle r(A,B)={{|S(A)\cap S(B)|} \over {|S(A)\cup S(B)|}}}

where |A| is the size of set A. The resemblance is a number in the range , where 1 indicates that two documents are identical. This definition is identical with the Jaccard coefficient describing similarity and diversity of sample sets.

See also

References

This article needs more complete citations for verification. Please help add missing citation information so that sources are clearly identifiable. (March 2023) (Learn how and when to remove this message)
Natural language processing
General terms
Text analysis
Text segmentation
Automatic summarization
Machine translation
Distributional semantics models
Language resources,
datasets and corpora
Types and
standards
Data
Automatic identification
and data capture
Topic model
Computer-assisted
reviewing
Natural language
user interface
Related
Category:
w-shingling Add topic