# Two problem related to sequence of sets

Problem1

Given a sequence of sets $S_1,\ldots,S_n$ with a total of $m$ elements. Partition $[n]$, such that if $i,j$ is in the same partition class, then $S_i = S_j$.

Solve the problem by building a trie over the lexicographic ordering of the elements in the set. Since the alphabet has size $n$, it has running time $O(m\log n)$. One can get better running time using integer data structures, say $O(m\log \log n)$ using van Emde Boas tree.

$O(m)$ time is actually possible. For each $k$, we build the set $H_k = \set{j | k\in S_j}$ (as a list). We define equivalent relation $\equiv_k$ as $i\equiv_k j$ if $S_i\cap [k] =S_j\cap [k]$. If we have equivalent class of $\equiv_k$, we can obtain the equivalent class of $\equiv_{k+1}$ in $O(|H_k|)$ time. Hence together the running time is $O(m)$.

Problem2

Given a sequence of sets $S_1,\ldots,S_n$ containing a total of $m$ integers, and a integer $k$. Decide if there exists $i$ and $j$ such that $i\neq j$ and $|S_i\cap S_j|\geq k$.

We assume the elements in the sets are in $[m]$. Let $S=\bigcup_{i=1}^n S_i$.

For $k=0,1$, we can solve it in $O(m)$ time: Decide if any element appears more than once in the sets.

For larger $k$, we shall compute $|S_i\cap S_j|$ for every pair $i$ and $j$. To do this, we start with an all zero $n\times n$ matrix $C$. At the end of the algorithm, $C_{i,j} = |S_i\cap S_j|$ for all $i,j\in [n]$. For each element $x$, we find $E_x = \set{i|x\in S_i}$. This takes $O(m)$ time. We increment $C_{i,j}$ for all $i,j\in E_x$. We claim this algorithm have running time $O(nm)$. Indeed, for each $x$, we spend $|E_x|$ time in incrementing $C_{i,j}$ where $i,j\in E_x$. Hence the running time is bounded by $\sum_{x\in S} |E_x|^2$. We know $\sum_{x\in S} |E_x|=m$ and $|E_x|\leq n$. We see the worst case is when $|E_x|=n$ and $|S|=m/n$. In that case, we have running time $O(\sum_{x\in S} n^2)=O(mn)$.

Since we just want to find a pair $\set{i,j}$ where $|S_i\cap S_j|\geq k$. We can stop the algorithm as soon as $C_{i,j}\geq k$ for some $i$ and $j$. This means we can increment at most $(k-1)n^2$ times.

Together, the running time become $O(\min(nm,k n^2+m))$.

For $k=2$. One can improve the running time when $n$ is large by reduce it to a problem similar to finding rectangles or finding a $C_4$ in the incident graph. Let $n'$ be $|\bigcup_i S_i|$, we can obtain a more refined bound. Together, the final running time for $k=2$ is $O(\min(m^{4/3}, dm, n^2+m))$. Here $d$ is the degeneracy of the incident graph of the sets and the elements, which is bounded above by the maximum degree.

Recently, I had some result for $k=3$, where the running time improves to $O(m^{28/15})$.

Posted by Chao Xu on .
Tags: algorithm.