## Probability of overlapping sub-sequences?

Consider a periodic sequence, seq_{p}, with period p

seq_{p} = *a*_{1}, *a*_{2}, …, *a*_{p}, *a*_{1}, *a*_{2}, …, *a*_{p}, *a*_{1}, *a*_{2}, …, *a*_{p}, …

Two sub-sequences, each of length N where 2< N< p, are produced by choosing two random start points in seq_{p} and using the next N-1 numbers. What is the probability that these two sub-sequences will overlap?

If you take M>2 such length N sub-sequences then what is the probability that **any** two will overlap?

For a more concrete example, assume that p = 2^19937 – 1 (yes, a very big number) and consider 10,000 sub-sequences each of length 10^9.

Disclaimer: I don’t have the solution to this and haven’t yet tried to find it

For the sake of clarity: can the elements from {a_1,..,a_p} duplicate each other?

Would it make a difference to the solution? If it would then I’d be tempted to split into two parts

part a) assume all p elements are unique

part b) assume that there are only q < p unique elements and so some of the a_p will be duplicates. How's that sound? For the concrete example, part b is the one that actually applies. Bonus points for anyone who can tell me where the concrete example may come from :)

Where the example comes from is the easy part. You want to know if Mersenne Twister random number sequences overlap. Not good enough at probability to give you the answer, but I can guarantee you the probability is VERY small.

For part a, I’m thinking the answer is N/(p-N).

Here is a upper bound on the probability of overlap.

There are p^M different (ordered) choices for the starting points of each sequence. Now let’s count how many don’t overlap. The first starting point can be chosen freely, in p different ways. This makes 2*N-1 of the p points forbidden as starting points for other sequences – each of the N points contained in that first subsequence and N-1 points before it. So there are p – 2*N + 1 legal starting points for the second sequence. Choosing the second sequence starting point will make *at most* 2*N-1 of the remaining points invalid – there could be overlap between the new forbidden points and old forbidden points. So for the starting point of the second sequence, there are at least p – 4*N + 2 starting points, and overall there are at least prod(p – i*(2*N – 1), i = 0 .. M – 1) M-tuples of sequences that have no overlap. Hence the probability that no two will overlap is at least prod((p – i*(2*N – 1))/p, i = 0 .. M – 1), and the probability that any two will overlap is at most one minus that number.

Trying to compute this with the given exact integers doesn’t seem to be particularly useful (and I don’t see an easy way to make it not blow up), but I think we can make a good estimate using arbitrary precision floats. Since p has about 6000 decimal digits, if we work with a precision of, say, 8000 decimal digits, we should be able to get good results in finite space. Indeed, the following Maple program (full disclosure: I work at Maplesoft) gives us a number:

Digits := 8000:

p := 2^19937 – 1:

N := 10^9:

M := 10^4:

r := 1 – mul((p – i*(2*N – 1.))/p, i = 0 .. M – 1):

evalf[25](r);

It’s about 2.32E-5985; if I do the whole thing with interval arithmetic, I get an interval width of about 3E-7996, so the result looks reliable.

2^19937 – 1 => that is bound to be the Mersenne Twister

I guess the dominating factor will be C(m,2)*2*N/P