aboutsummaryrefslogtreecommitdiff
path: root/textproc/p5-Algorithm-RabinKarp/pkg-descr
blob: b8bd15392542d2ade56bf41536847c134052afec (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
This is an implementation of Rabin and Karp's streaming hash, as described
in "Winnowing: Local Algorithms for Document Fingerprinting" by Schleimer,
Wilkerson, and Aiken. Following the suggestion of Schleimer, I am using
their second equation:

  $H[ $c[2..$k + 1] ] = (( $H[ $c[1..$k] ] - $c[1] ** $k ) + $c[$k+1] ) * $k

The results of this hash encodes information about the next k values in
the stream (hense k-gram.) This means for any given stream of length n
integer values (or characters), you will get back n - k + 1 hash values.

For best results, you will want to create a code generator that filters
your data to remove all unnecessary information. For example, in a large
english document, you should probably remove all white space, as well as
removing all capitalization.

WWW: https://metacpan.org/release/Algorithm-RabinKarp