[latex3-commits] [git/LaTeX3-latex3-latex3] master: Document limitations of the RNG and how we go around them [ci skip] (884c33b)

Sun May 6 16:13:51 CEST 2018

Repository : https://github.com/latex3/latex3
On branch  : master
Link       : https://github.com/latex3/latex3/commit/884c33b1dfad0eef3222189ce1226e4fe96ea369

>---------------------------------------------------------------

commit 884c33b1dfad0eef3222189ce1226e4fe96ea369
Author: Bruno Le Floch <bruno at le-floch.fr>
Date:   Sat May 5 16:26:08 2018 +0200

    Document limitations of the RNG and how we go around them [ci skip]
    
    This is based on a detailed discussion with Jean-François B. at
    https://github.com/latex3/latex3/commit/c83d05b3f8ad6778d756b1e08d095e16a3b7e23c


>---------------------------------------------------------------

884c33b1dfad0eef3222189ce1226e4fe96ea369
 l3kernel/l3fp-random.dtx |  126 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 126 insertions(+)

diff --git a/l3kernel/l3fp-random.dtx b/l3kernel/l3fp-random.dtx
index 3d34135..4deabe4 100644
--- a/l3kernel/l3fp-random.dtx
+++ b/l3kernel/l3fp-random.dtx
@@ -96,6 +96,132 @@
   {
 %    \end{macrocode}
 %
+% Obviously, every word \enquote{random} below means
+% \enquote{pseudo-random}, as we have no access to entropy (except a
+% very unreliable source of entropy: the time it takes to run some
+% code).
+%
+% The primitive random number generator (RNG) is provided as
+% \cs{tex_uniformdeviate:D}.  Under the hood, it maintains an array of
+% $55$ $28$-bit numbers, updated with a linear recursion relation
+% (similar to Fibonacci numbers) modulo $2^{28}$.  When
+% \cs{tex_uniformdeviate:D} \meta{integer} is called (for brevity denote
+% by~$N$ the \meta{integer}), the next $28$-bit number is read from the
+% array, scaled by $N/2^{28}$, and rounded.  To prevent $0$ and $N$ from
+% appearing half as often as other numbers, they are both mapped to the
+% result~$0$.
+%
+% This process means that \cs{tex_uniformdeviate:D} only gives a uniform
+% distribution from $0$ to $N-1$ if $N$ is a divisor of $2^{28}$, so we
+% will mostly call the RNG with such power of~$2$ arguments.  If $N$
+% does not divide $2^{28}$, then the relative non-uniformity (difference
+% between probabilities of getting different numbers) is about
+% $N/2^{28}$ and could be detected after generating roughly $2^{56}/N^2$
+% random numbers.  For instance for $N=98304=3\times 2^{15}$, the bias
+% modulo~$3$ would be detectable after roughly $2^{23}$ calls to the RNG
+% (experimentally this takes roughly a second on a 1 giga-hertz
+% processor).  While this bias is not quite problematic, it is
+% uncomfortably close to being so, and it becomes worse as $N$ is
+% increased.  In our code, we shall thus combine several results from
+% the RNG\@.
+%
+% The RNG has two types of unexpected correlations.  First, everything
+% is linear modulo~$2^{28}$, hence the lowest $k$ bits of the random
+% numbers only depend on the lowest $k$ bits of the seed (and of course
+% the number of times the RNG was called since setting the seed).  The
+% recommended way to get a number from $0$ to $N-1$ is thus to scale the
+% raw $28$-bit integer, as the engine's RNG does.  We will go further
+% and in fact typically we discard some of the lowest bits.
+%
+% Second, suppose that we call the RNG with the same argument~$N$ to get
+% a set of $K$ integers in $[0,N-1]$ (throwing away repeats), and
+% suppose that $N>K^3$ and $K>55$.  The recursion used to construct more
+% $28$-bit numbers from previous ones is linear:
+% $x_n = x_{n-55} - x_{n-24}$ or $x_n = x_{n-55}-x_{n-24}+2^{28}$.
+% After rescaling and rounding we find that the result $N_n\in[0,N-1]$
+% is among $N_{n-55} - N_{n-24} + \{-1,0,1\}$ modulo~$N$ (a more
+% detailed analysis shows that $0$ appears with frequency close
+% to~$3/4$).  The resulting set thus has more triplets $(a,b,c)$ than
+% expected obeying $a=b+c$ modulo~$N$.  Namely it will have of order
+% $(K-55)\times 3/4$ such triplets, when one would expect $K^3/(6N)$.
+% This starts to be detectable around $N=2^{18}>55^3$ (earlier if one
+% keeps track of positions too, but this is more subtle than it looks
+% because the array of $28$-bit integers is read backwards by the
+% engine).  Hopefully the correlation is subtle enough to not affect
+% realistic documents so we do not specifically mitigate against this.
+% Since we typically use two calls to the RNG per \cs{int_rand:nn} we
+% would need to investigate linear relations between the $x_{2n}$ on the
+% one hand and between the $x_{2n+1}$ on the other hand.  Such relations
+% will have more complicated coefficients than $\pm 1$, which alleviates
+% the issue.
+%
+% Ideally, our algorithm should be:
+% \begin{itemize}
+% \item Uniform.  The result should be as uniform as possible assuming
+%   that the RNG's underlying $28$-bit integers are uniform.
+% \item Uncorrelated.  The result should not have detectable
+%   correlations between different seeds, similar to the lowest-bit ones
+%   mentioned earlier.
+% \item Quick.  The algorithm should be fast in \TeX{}, so no
+%   \enquote{bit twiddling}, but \enquote{digit twiddling} is ok.
+% \item Simple.  The behaviour must be documentable precisely.
+% \item Predictable.  The number of calls to the RNG should be the same
+%   for any \cs{int_rand:nn}, because then the algorithm can be modified
+%   later without changing the result of other uses of the RNG\@.
+% \item Robust.  It should work even for \cs{int_rand:nn} |{| |-|
+%   \cs{c_max_int} |}| |{| \cs{c_max_int} |}| where the range is not
+%   representable as an integer.  In fact, we also provide later a
+%   floating-point |randint| whose range can go all the way up to
+%   $2\times 10^{16}-1$ possible values.
+% \end{itemize}
+% Some of these requirements conflict.  For instance, uniformity cannot
+% be achieved with a fixed number of calls to the RNG\@.
+%
+% Denote by $\operatorname{random}(N)$ one call to
+% \cs{tex_uniformdeviate:D} with argument~$N$, and by
+% $\operatorname{ediv}(p,q)$ the \eTeX{} rounding division giving
+% $\lfloor p/q+1/2\rfloor$.  Denote by $\meta{min}$, $\meta{max}$ and
+% $R=\meta{max}-\meta{min}+1$ the arguments of \cs{int_min:nn} and the
+% number of possible outcomes.  Note that $R\in [0,2^{32}-1]$ cannot
+% necessarily be represented as an integer.  Our strategy is to get two
+% $28$-bit integers $X$ and $Y$ from the RNG, split each into $14$-bit
+% integers, as $X=X_1\times 2^{14}+X_0$ and $Y=Y_1\times 2^{14}+Y_0$
+% then return essentially
+% $\meta{min} + \lfloor R (X_1\times 2^{-14} + Y_1\times 2^{-28} +
+% Y_0\times 2^{-42} + X_0\times 2^{-56})\rfloor$.  For small~$R$ the
+% $X_0$ term has a tiny effect so we ignore it and we can compute
+% $R\times Y/2^{28}$ much more directly by $\operatorname{random}(R)$.
+% \begin{itemize}
+% \item If $R \leq 2^{17}$ then return
+%   $\meta{min} + \operatorname{ediv}(R\operatorname{random}(2^{14}) +
+%   \operatorname{random}(R) - 2^{13}, 2^{14})$.  The shift by $2^{13}$
+%   converts \eTeX{} division to truncated division.  The bound on $R$
+%   ensures that the number obtained before the shift and division is at
+%   most $2^{17}(2^{14}-1)+(2^{17}-1)=\cs{c_max_int}$.  The
+%   non-uniformity is at most of order $2^{17}/2^{42} = 2^{-25}$.
+% \item Otherwise, split $R=R_2\times 2^{28}+R_1\times 2^{14}+R_0$,
+%   precompute $R_{21}=R_2\times 2^{14}+R_1$ and
+%   $R_{10}=R_1\times 2^{14}+R_0$ and $Z = X_1\times 2^{14}+Y_1$.  Then
+%   compute on the one hand
+%   $\meta{min}+8\times Z+(R_2-8)\times Z+R_1\times X_1$ (where the
+%   weird split of $R_2\times Z$ avoids a possible overflow) and
+%   $\operatorname{ediv}(R_2\times Y_0 + R_0\times X_1 +
+%   \operatorname{ediv}(R_{21}\times X_0,2^{28}) +
+%   \operatorname{ediv}(R_{10}\times Y,2^{28}),2^{14})$ then sum them
+%   (this may give $2^{31}$, an overflow).  If the result is greater
+%   than $\meta{max}$ (in particular if it overflows) then replace it by
+%   $\meta{min}$.
+% \item To get a floating point number in $[0,1)$ just call the
+%   $R=10000\leq 2^{17}$ procedure above to produce four blocks of four
+%   digits.
+% \item To get an integer floating point number in a small range
+%   $R\leq 2^{17}$ use the above simplified algorithm.  If the range is
+%   bigger (it can be up to $2\times 10^{16}-1$), work with fixed-point
+%   numbers: get six times four digits to build a fixed point number,
+%   multiply by $R$ and add $\meta{min}$.
+% \end{itemize}
+% This is strategy is not yet implemented at all.
+%
 % \begin{macro}[EXP]{\@@_rand_uniform:}
 % \begin{variable}
 %   {