I have two methods of generating m distinct random numbers in the range [0..n-1]
我有两种方法在范围[0. n-1]中生成m个不同的随机数。
Method 1:
方法1:
//C++-ish pseudocode
int result[m];
for(i = 0; i
Method 2:
方法2:
//C++-ish pseudocode
int arr[n];
for(int i = 0; i
The first method is more efficient when n is much larger than m, whereas the second is more efficient otherwise. But "much larger" isn't that strict a notion, is it? :)
当n大于m时,第一种方法更有效,而第二种方法则更有效。但是“大得多”并不是一个严格的概念,不是吗?:)
Question: What formula of n and m should I use to determine whether method1 or method2 will be more efficient? (in terms of mathematical expectation of the running time)
问:我应该用n和m的什么公式来确定method1或method2是否会更有效?(关于运行时间的数学期望)
15
Pure mathematics:
Let's calculate the quantity of rand()
function calls in both cases and compare the results:
纯数学:让我们计算两种情况下rand()函数调用的数量,并比较结果:
Case 1: let's see the mathematical expectation of calls on step i = k
, when you already have k numbers chosen. The probability to get a number with one rand()
call is equal to p = (n-k)/n
. We need to know the mathematical expectation of such calls quantity which leads to obtaining a number we don't have yet.
情形1:让我们看一下,当你已经选择了k个数字的时候,关于step i = k的调用的数学期望。用一个rand()调用得到一个数字的概率等于p = (n-k)/n。我们需要知道这种调用量的数学期望,从而得到一个我们还没有得到的数字。
The probability to get it using 1
call is p
. Using 2
calls - q * p
, where q = 1 - p
. In general case, the probability to get it exactly after n
calls is (q^(n-1))*p
. Thus, the mathematical expectation is
Sum[ n * q^(n-1) * p ], n = 1 --> INF
. This sum is equal to 1/p
(proved by wolfram alpha).
用1次调用得到它的概率是p,使用2个调用- q *p,其中q = 1 - p,一般情况下,在n次调用后得到它的概率是(q (n-1))*p。因此,数学期望是和[n * q (n-1) * p], n = 1——> INF,这个和等于1/p(由wolfram alpha证明)。
So, on the step i = k
you will perform 1/p = n/(n-k)
calls of the rand()
function.
因此,在步骤i = k时,您将执行rand()函数的1/p = n/(n-k)调用。
Now let's sum it overall:
现在我们总结一下:
Sum[ n/(n - k) ], k = 0 --> m - 1 = n * T
- the number of rand
calls in method 1.
Here T = Sum[ 1/(n - k) ], k = 0 --> m - 1
Sum[n/(n - k)], k = 0 -> m - 1 = n * T -方法1中的rand调用数。这里T =和[1/(n - k)], k = 0 -> m - 1。
Case 2:
案例2:
Here rand()
is called inside random_shuffle
n - 1
times (in most implementations).
在这里,rand()被调用在random_shuffle n - 1次(在大多数实现中)。
Now, to choose the method, we have to compare these two values: n * T ? n - 1
.
So, to choose the appropriate method, calculate T
as described above. If T <(n - 1)/n
it's better to use the first method. Use the second method otherwise.
现在,要选择这个方法,我们需要比较这两个值:n * T ?n - 1。因此,选择合适的方法,计算T如上所述。如果T <(n - 1)/n,最好使用第一种方法。否则,使用第二种方法。
9
Check the Wikipedia description of the original Fisher-Yates algorithm. It advocates using essentially your method 1 for up to n/2, and your method 2 for the remainder.
检查维基百科对原始Fisher-Yates算法的描述。它提倡使用基本的方法1到n/2,剩下的方法是2。
6
Here's an algorithm that will work in O(n) memory and O(n) time (where n is the number of returned results, not the size of the set you're selecting from) for any result set. It's in Python for convenience because it uses a hashtable:
这是一个在O(n)内存和O(n)时间内工作的算法(其中n是返回结果的数目,而不是你选择的集合的大小)。它在Python中是为了方便,因为它使用了一个hashtable:
def random_elements(num_elements, set_size):
state = {}
for i in range(num_elements):
# Swap state[i] with a random element
swap_with = random.randint(i, set_size - 1)
state[i], state[swap_with] = state.get(swap_with, swap_with), state.get(i, i)
return [state[i] for i in range(num_elements) # effectively state[:num_elements] if it were a list/array.
This is just a partial fisher-yates shuffle, with the array being shuffled implemented as a sparse hashtable - any element that is not present is equal to its index. We shuffle the first num_elements
indices, and return those values. In the case that set_size = 1,
this is equivalent to picking a random number in the range, and in the case that num_elements = set_size
, this is equivalent to a standard fisher-yates shuffle.
这只是一个部分的fisher-yates洗牌,数组被拖拽实现为一个稀疏hashtable——任何不存在的元素都等于它的索引。我们对第一个num_elements索引进行洗牌,并返回这些值。在set_size = 1的情况下,这相当于在range中选择一个随机数,在num_elements = set_size的情况下,这相当于一个标准的fisher-yates洗牌。
It's trivial to observe that this is O(n) time, and because each iteration of the loop initializes at most two new indices in the hashtable, it's O(n) space, too.
观察到这是O(n)时间是很简单的,因为循环的每一次迭代都在哈希表中的两个新索引中初始化,也就是O(n)空间。
5
Personally, I would use Method 1, and then if M > N/2, choose N-M values, and then invert the array (return the numbers that were not picked). So for example, if N is 1000 and you want 950 of them, chose 50 values using Method 1, and then return the other 950.
我个人将使用方法1,然后如果M > N/2,选择N-M值,然后反转数组(返回未选中的数字)。例如,如果N是1000,而你想要950个,用方法1选择50个值,然后返回另一个950。
Edit: Though, if consistent performance is your goal, I would use a modified method 2, which doesn't do the full shuffle, but only shuffles the first M elements of your N length array.
编辑:虽然,如果你的目标是一致的性能,我将使用一个修改过的方法2,它不会进行完全的洗牌,但只会打乱N长度数组的第一个M元素。
int arr[n];
for(int i = 0; i
3
What about a third method?
第三种方法呢?
int result[m];
for(i = 0; i
Edit it should be <=. and it would actually additional logic to avoid collisions.
编辑它应该是<=。它实际上是额外的逻辑来避免碰撞。
This is better, an example using the Modern Method from Fisher-Yates
这是一个更好的例子,一个使用现代方法的Fisher-Yates。
//C++-ish pseudocode
int arr[n];
for(int i = 0; i
2
Talking about mathematical expectation, it's pretty useless but I will post it anyway :D
说到数学期望,它是无用的,但我还是会把它贴出来:D。
Shuffle is simple O(m).
洗牌是简单的O(m)。
Now the other algorithm is a bit more complex. The number of steps needed to generate the next number is the expected value of the number of trials, and the probability of the trial length is a geomtric distribution. So...
另一种算法有点复杂。生成下一个数字所需的步骤数是试验次数的期望值,而试验长度的概率是一个地轴分布。所以…
p=1 E[X1]=1 = 1 = 1
p=1-1/n E[x2]=1/(1-1/n) = 1 + 1/(n-1) = 1 + 1/(n-1)
p=1-2/n E[x3]=1/(1-1/n) = 1 + 2/(n-2) = 1 + 1/(n-2) + 1/(n-2)
p=1-3/n E[X4]=1/(1-2/n) = 1 + 3/(n-3) = 1 + 1/(n-3) + 1/(n-3) + 1(n-3)
....
p=1-(m-1)/n) E[Xm]=1/(1-(m-1)/n))
Note that the sum can be split up into a triangle shape, see right hand side.
注意,这个和可以被分割成一个三角形的形状,看右边。
Let's use the formula for the harmonic series: H_n = Sum k=0->n (1/k) = approx ln(k)
让我们用调和级数的公式:H_n = Sum k=0->n (1/k) =约ln(k)
Sum(E[Xk]) = m + ln(n-1)-ln(n-m-1) + ln(n-2)-ln(n-m-1) + ... = m + ln(n-1) + ln(n-2) + ... - (m-1)*ln(n-m-1) ..
And there is some forumla for the sum of harmonic series, if you are still interested I will look it up...
对于和声级数的和有一些公式,如果你们还感兴趣的话,我会查一下。
Update: actually it's quite nice formula (thanks to the brilliant Concrete Mathematics book)
更新:其实这是一个很好的公式(多亏了那本精彩的混凝土数学书)
Sum(H_k) k=0->n = n*H_n - n
So the expected number of steps:
所以期望的步骤数:
Sum(E[Xk]) = m + (n-1)*ln(n-1) - (n-1) - (n-m-1)*ln(n-m-1) - (n-m-1)) - (m-1)*ln(n-m-1).
Note: I haven't verified it.
注:我还没有核实。
1
This is a bit of a long shot, but it could work, depending on your system.
这是一个很长的镜头,但它可以工作,取决于你的系统。
The obvious flaw in this method is that on highly variable load systems your "offline" test won't be too reliable.
这种方法的明显缺点是,在高度可变的负载系统上,您的“脱机”测试不会太可靠。
0
There was suggested Fisher-Yates shuffle. Don't know if the next code generates equally distributed integers, but it is at least compact and one-pass:
有人建议说要进行“费雪”的洗牌。不知道下一个代码是否生成同样的分布式整数,但它至少是紧凑的和一次性的:
std::random_device rd;
std::mt19937 g(rd());
for (size_type i = 1; i
-1
Quite possibly it would be more simple to start it in debug mode (and keep one method as a note) for a couple of times to get an average, then use the other method to get an average from that
很有可能在调试模式下启动它(并将一个方法作为一个注释),以获得平均值,然后使用另一种方法来获得平均值。