Sampling in lay man terms would mean a customized data distribution to solve the purpose/requirement. Sampling is an important topic of Data Science which are given least importance. Let's talk a bit about them.
Difference of Sampling from Feature Selection is Feature Selection deals with the particular features. Whereas Sampling deals with the classes of the samples and how to reshuffle/redistribute them.
Eg. A particular class/label has lower/larger number of samples compared to other, sampling is the best way to go.
There are many ways to perform data sampling ->
1) Random Undersampling
Use imblearn package. Tomek links are used to undersample the required class.
2) Random Oversampling
Use imblearn package. SMOTE is a common oversampling method for oversample the required class.
3) Simple Random Sampling
A simple subset selection where each element has equal probability of getting selected.
eg. subset = df.sample(100)
4) Stratified Sampling
A stratified sampling is performed when the class labels are taken care of while performing the data sampling.
eg. train_test_split(X,y, **stratify = y**, test_size = 0.2)
5) Reservoir Sampling
Reservoir sampling doesn't consider the length of the samples. Rather it considers to sample out of infinite data stream and generate a sample out of it. It uses probability concepts to generate a subset of data. Used in Big Data applications where a large stream of input flows in and close approximations are made.
Comments