- Python:Advanced Predictive Analytics
- Ashish Kumar Joseph Babcock
- 2988字
- 2021-07-02 20:09:24
Generating random numbers and their usage
Random numbers are just like any other number in their property except for the fact that they assume a different value every time the call
statement to generate a random number is executed. Random number generating methods use certain algorithms to generate different numbers every time, which are beyond the scope of this book. However, after a finitely large period, they might start generating the already generated numbers. In that sense, these numbers are not truly random and are sometimes called pseudo-random numbers.
In spite of them actually being pseudo-random, these numbers can be assumed to be random for all practical purposes. These numbers are of critical importance to predictive analysts because of the following points:
- They allow analysts to perform simulations for probabilistic multicase scenarios
- They can be used to generate dummy data frames or columns of a data frame that are needed in the analysis
- They can be used for the random sampling of data
Various methods for generating random numbers
The method used to deal with random number is called random
and is found in the numpy
library. Let's have a look at the different methods of generating random numbers and their usage.
Let's start by generating a random integer between 1
and 100
. This can be done, as follows:
import numpy as np np.random.randint(1,100)
If you run the preceding snippet, it will generate a random number between 1
and 100
. When I ran it, it gave me 43 as the result. It might give you something else.
To generate a random number between 0
and 1
, we can write something similar to the following code:
import numpy as np np.random.random()
These methods allow us to generate one random number at a time. What if we wanted to generate a list of numbers, all lying within a given interval and generated randomly. Let's define a function that can generate a list of n
random numbers lying between a
and b
.
All one needs to do is define a function, wherein an empty list is created and the randomly generated numbers are appended to the list. The recipe to do that is shown in the following code snippet:
def randint_range(n,a,b): x=[] for i in range(n): x.append(np.random.randint(a,b)) return x
After defining this function we can generate, let's say, 10
numbers lying between 2
and 1000, as
shown
:
rand_int_gen(10,2,1000)
On the first run, it gives something similar to the following output:
Fig. 3.8: 10 random integers between 2 and 1000
The randrange
method is an important method to generate random numbers and is in a way an extension to the randint
method, as it provides a step argument in addition to the start and stop argument in the case of randint
function.
To generate three random numbers between 0
and 100,
which are all multiples of 5
, we can write:
import random for i in range(3): print random.randrange(0,100,5)
You should get something similar to the following screenshot, as a result (the actual numbers might change):
Another related useful method is shuffle
, which shuffles a list or an array in random order. It doesn't generate a random number, per se, but nevertheless it is very useful. Lets see how it works. Lets generate a list of consecutive 100
integers and then shuffle the list:
a=range(100) np.random.shuffle(a)
The list looks similar to the following screenshot before and after the shuffle:
The choice
method is another important technique that might come in very handy in various scenarios including creating simulations, depending upon selecting a random item from a list of items. The choice
method is used to pick an item at random from a given list of items.
To see an example of how this method works, let's go back to the data frame that we have been using all along in this chapter. Let's import that data again and get the list of column names, using the following code snippet:
import pandas as pd data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Customer Churn Model.txt') column_list=data.columns.values.tolist()
To select one column name from the list, at random, we can write it similar to the following example:
np.random.choice(column_list)
This should result in one column name being chosen at random from the list of the column names. I got Day Calls
for my run. Of course, one can loop over the choice method to get multiple items, as we did for the randint
method.
Seeding a random number
At the onset of this section on random numbers, we discussed how random numbers change their values on every execution of their call statement. They repeat their values but only after a very large period. Sometimes, we need to generate a set of random numbers that retain their value. This can be achieved by seeding the generation of random numbers. Basically, the particular instance of generating a random number is given a seed (sort of a key), which when used can regenerate the same set of random numbers. Let's see this with an example:
np.random.seed(1) for i in range(5): print np.random.random()
In the first line, we set the seed as 1
and then generated 5
random numbers. The output looks something similar to this:
Fig. 3.9: Five random numbers generated through random method with seed 1
If one removes the seed and then generates random numbers, one will get different random numbers. Let's have a look:
for i in range(5): print np.random.random()
By running the preceding code snippet, one indeed gets different random numbers, as shown in the following output screenshot:
Fig. 3.10: Five random number generated through random method without seed 1
However, if one brings back the seed used to generate random numbers, we can get back the same numbers. If we try running the following snippet, we will have to regenerate the numbers, as shown in the first case:
np.random.seed(1) for i in range(5): print np.random.random()
Generating random numbers following probability distributions
If you have taken a probability class in your school or college, you might have heard of probability distributions. There are two concepts that you might want to refresh.
Probability density function
For a random variable, it is just the count of times that the random variable attains a particular value x
or the number of times that the value of the random variable falls in a given range (bins). This gives the probability of attaining a particular value by the random variable. Histograms plot this number/probability on the y axis and it can be identified as the y axis value of a distribution plot/histogram:
PDF = Prob(X=x)
Cumulative density function
For a random variable, it is defined as the probability that the random variable is less than or equal to a given value x
. It is the total probability that the random variable is less than or equal to a given value. For a given point on the x axis, it is calculated as the area enclosed by the frequency distribution curve between by values less than x
.
Mathematically, it is defined as follows:
CDF(x) = Prob(X<=x)
Fig. 3.11: CDF is the area enclosed by the curve till that value of random variable. PDF is the frequency/probability of that particular value of random variable.
There are various kinds of probability distributions that frequently occur, including the normal (famously known as the Bell Curve), uniform, poisson, binomial, multinomial distributions, and so on.
Many of the analyses require generating random numbers that follow a particular probability distribution. One can generate random numbers in such a fashion using the same random
method of the numpy
library.
Let's see how one can generate two of the most commonly used distributions, which are normal and uniform distributions.
Uniform distribution
A uniform distribution is defined by its endpoints—the start and stop points. Each of the points lying in between these endpoints are supposed to occur with the same (uniform) probability and hence the name of the distribution.
If the start and stop points are a and b, each point between a and b would occur with a frequency of 1/(b-a):
Fig. 3.12: In a uniform distribution, all the random variables occur with the same (uniform) frequency/probability
As the uniform distribution is defined by its start and stop points, it is essential to know these points while generating random numbers following a uniform distribution. Thus, these points are taken as input parameters for the uniform function that is used to generate a random number following a uniform distribution. The other parameter of this function is the number of random numbers that one wants to generate.
To generate 100 random numbers lying between 1
and 100
, one can write the following:
import numpy as np randnum=np.random.uniform(1,100,100)
To check whether it indeed follows the uniform distribution, let's plot a histogram of these numbers and see whether they occur with the same probability or not. This can be done using the following code snippet:
import numpy as np import matplotlib.pyplot as plt %matplotlib inline a=np.random.uniform(1,100,100) b=range(1,101) plt.hist(a)
The output that we get is not what we expected. It doesn't have the same probability for all the numbers, as seen in the following output:
Fig. 3.13: Histogram of 100 random numbers between 1 and 100 following uniform distribution
The reason for this is that 100 is a very small number, given the range (1-100), to showcase the property of the uniform distribution. We should try generating more random numbers and then see the results. Try generating around a million (1,000,000) numbers by changing the parameter in the uniform
function, and then see the results of the preceding code snippet.
It should look something like the following:
Fig. 3.14: The kind of plot expected for uniform distribution, all the numbers occur with the same frequency/probability
If you observe the preceding plot properly, each bin that contains 10 numbers occurs roughly with a frequency of 100,000 (and hence a probability of 100000/1000000=1/10). This means that each number occurs with a probability of 1/10*1/10=1/100, which is equal to the probability that we would have expected from a set of numbers following the uniform distribution between 1 and 100 (1/(100-1)=1/99).
Normal distribution
Normal distribution is the most common form of probability distribution arising from everyday real-life situations. Thus, the exam score distribution of students in a class would roughly follow the normal distribution as would the heights of the students in the class. An interesting behavior of all the probability distributions is that they tend to follow/align to a normal distribution as the sample size of the numbers increase. In a sense, one can say that a normal distribution is the most ubiquitous and versatile probability distribution around.
The parameters that define a normal distribution are the mean and standard deviation. A normal distribution with a 0
mean and 1
standard deviation is called a standard normal distribution. The randn
function of the random
method is used to generate random numbers following a normal distribution. It returns random numbers following a standard normal distribution.
To generate 100
such numbers, one simply writes the following:
import numpy as np a=np.random.randn(100)
To take a look at how random these values actually are, let's plot them against a list of integers:
import numpy as np import matplotlib.pyplot as plt %matplotlib inline a=np.random.randn(100) b=range(1,101) plt.plot(b,a)
The output looks something like the following image. The numbers are visibly random.
Fig. 3.15: A plot of 100 random numbers following normal distribution
One can pass a list defining the shape of the expected array. If one passes, let's say, (2,4)
as the input, one would get a 2 x 4 array of numbers following a standard normal distribution:
import numpy as np a=np.random.randn(2,4)
If no numbers are specified, it generates a single random number from the standard normal distribution.
To get numbers following normal distributions (with mean and standard deviation other than 0
and 1
, let's say, mean 1.5
and standard deviation 2.5
), one can write something like the following:
import numpy as np a=2.5*np.random.randn(100)+1.5
The preceding calculation holds because the standard normal distribution S is created from a normal distribution X, with mean μ and standard deviation σ, using the following formula:
Let's generate enough random numbers following a standard normal distribution and plot them to see whether they follow the shape of a standard normal distribution (a bell curve). This can be done using the following code snippet:
import numpy as np import matplotlib.pyplot as plt %matplotlib inline a=np.random.randn(100000) b=range(1,101) plt.hist(a)
The output would look something like this, which roughly looks like a bell curve (if one joins the top points of all the bins to form a curvilinear line):
Fig. 3.16: Histogram of 100000 random numbers following standard normal distribution.
Using the Monte-Carlo simulation to find the value of pi
Till now, we have been learning about various ways to generate random numbers. Let's now see an application of random numbers. In this section, we will use random numbers to run something called Monte-Carlo simulations to calculate the value of pi. These simulations are based on repeated random sampling or the generation of numbers.
Geometry and mathematics behind the calculation of pi
Consider a circle of radius r unit circumscribed inside a square of side 2r units such that the circle's diameter and the square's sides have the same dimensions:
Fig. 3.17: A circle of radius r circumscribed in a square of side 2r
What is the probability that a point chosen at random would lie inside the circle? This probability would be given by the following formulae:
Thus, we find out that the probability of a point lying inside the circle is pi/4. The purpose of the simulation is to calculate this probability and use this to estimate the value of pi. The following are the steps to be implemented to run this simulation:
- Generate points with both x and y coordinates lying between
0
and1
. - Calculate x*x + y*y. If it is less than
1
, it lies inside the circle. If it is greater than 1, it lies outside the circle. - Calculate the total number of points that lie inside the circle. Divide it by the total number of points generated to get the probability of a point lying inside the circle.
- Use this probability to calculate the value of pi.
- Repeat the process for a sufficient number of times, say, 1,000 times and generate 1,000 different values of pi.
- Take an average of all the 1,000 values of pi to arrive at the final value of pi.
Let's see how one can implement these steps in Python. The following code snippet would do just this:
pi_avg=0 pi_value_list=[] for i in range(100): value=0 x=np.random.uniform(0,1,1000).tolist() y=np.random.uniform(0,1,1000).tolist() for j in range(1000): z=np.sqrt(x[j]*x[j]+y[j]*y[j]) if z<=1: value+=1 float_value=float(value) pi_value=float_value*4/1000 pi_value_list.append(pi_value) pi_avg+=pi_value pi=pi_avg/100 print pi ind=range(1,101) fig=plt.plot(ind,pi_value_list) fig
The preceding snippet generates 1,000 random points to calculate the probability of a point lying inside the circle and then repeats this process 100 times to get at the final averaged value of pi. These 100 values of pi have been plotted and they look as follows:
Fig. 3.18: Values of pi over 100 simulations of 1000 points each
The final averaged value of pi comes out to be 3.14584
in this run. As we increase the number of runs, the accuracy increases. One can easily wrap the preceding snippet in a function and pass the number of runs as an input for the easy comparison of pi values as an increasing number of runs are passed to this function. The following code snippet is a function to do just this:
def pi_run(nums,loops): pi_avg=0 pi_value_list=[] for i in range(loops): value=0 x=np.random.uniform(0,1,nums).tolist() y=np.random.uniform(0,1,nums).tolist() for j in range(nums): z=np.sqrt(x[j]*x[j]+y[j]*y[j]) if z<=1: value+=1 float_value=float(value) pi_value=float_value*4/nums pi_value_list.append(pi_value) pi_avg+=pi_value pi=pi_avg/loops ind=range(1,loops+1) fig=plt.plot(ind,pi_value_list) return (pi,fig)
To call this function, write pi_run(1000,100)
, and it should give you a similar result as was given previously with the hardcoded numbers. This function would return both the averaged value of pi as well as the plot.
Generating a dummy data frame
One very important use of generating random numbers is to create a dummy data frame, which will be used extensively in this book to illustrate concepts and examples.
The basic concept of this is that the array/list of random numbers generated through the various methods described in the previous sections can be passed as the columns of a data frame. The column names and their descriptions are passed as the keys and values of a dictionary.
Let's see an example where a dummy data frame contains two columns, A
and B
, which have 10
random numbers following a standard normal distribution and normal distribution, respectively.
To create such a data frame, one can run the following code snippet:
import pandas as pd d=pd.DataFrame({'A':np.random.randn(10),'B':2.5*np.random.randn(10)+1.5}) d
The following screenshot is the output of the code:
Fig. 3.19: A dummy data frame containing 2 columns – one having numbers following standard normal distribution, the second having random numbers following normal distribution with mean 1.5 and standard deviation 2.5
Categorical/string variables can also be passed as a list to be part of a dummy data frame. Let's go back to our example of the Customer Churn Model
data and use the column names as the list to be passed. This can be done as described in the following snippet:
import pandas as pd data = pd.read_csv('E:/Personal/Learning/Datasets/Book/Customer Churn Model.txt') column_list=data.columns.values.tolist() a=len(column_list) d=pd.DataFrame({'Column_Name':column_list,'A':np.random.randn(a),'B':2.5*np.random.randn(a)+1.5}) d
The output of the preceding snippet is as follows:
Fig. 3.20: Another dummy data frame. Similar to the one above but with one extra column which has column names of the data data frame
The index can also be passed as one of the parameters of this function. By default, it gives a range of numbers starting from 0
as the index. If we want something else as the index, we can specify it in the index parameter as shown in the following example:
import pandas as pd d=pd.DataFrame({'A':np.random.randn(10),'B':2.5*np.random.randn(10)+1.5},index=range(10,20)) d
The output of the preceding code looks like the following:
Fig. 3.21: Passing indices to the dummy data frame