您的位置：首页 > 其它

ML之sklearn：sklearn库中的ShuffleSplit()函数和StratifiedShuffleSplit()函数的讲解

2020-09-04 22:38 2719 查看

sklearn库中的ShuffleSplit()函数和StratifiedShuffleSplit()函数的讲解

from sklearn.model_selection import ShuffleSplit,StratifiedShuffleSplit
这两个函数均是实现了对数据集进行打乱划分，即在数据集在进行划分之前，先进行打乱操作，否则容易产生过拟合，模型泛化能力下降。其中，StratifiedShuffleSplit函数是StratifiedKFold和ShuffleSplit的合并，它将返回StratifiedKFold。折叠是通过保存每个类的样本百分比来实现的。
首先将样本随机打乱，然后根据设置参数划分出train/test对。通过n_splits产生指定数量的独立的【train/test】数据集，划分数据集划分成n组(n组索引值)，其创建的每一组划分将保证每组类比的比例相同。比如第一组训练数据类别比例为2:1，则后面每组类别都满足这个比例。

ShuffleSplit()函数

cv_split = ShuffleSplit(n_splits=6, train_size=0.7, test_size=0.2)

class ShuffleSplit(BaseShuffleSplit):

"""Random permutation cross-validator

Yields indices to split data into training and test sets.

Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters

----------

n_splits : int, default=10. Number of re-shuffling & splitting iterations.

test_size : float or int, default=None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If ``train_size`` is also None, it will be set to 0.1.

train_size : float or int, default=None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

random_state : int or RandomState instance, default=None. Controls the randomness of the training and testing indices produced. Pass an int for reproducible output across multiple function calls.

See :term:`Glossary <random_state>`.

类ShuffleSplit (BaseShuffleSplit):

随机排列交叉验证

生成将数据分割为训练集和测试集的索引。

注：与其他交叉验证策略相反，随机分割并不能保证所有的折叠都是不同的，尽管对于较大的数据集，这种情况仍然很可能发生。

更多信息请参见:ref: ' User Guide <cross_validation> '。</cross_validation>

参数

----------

n_splits : int，默认=10。重新洗牌和分裂迭代的数量。将训练数据分成【train/test】对的组数。

test_size: float或int，默认=None。如果是浮动的，则应该在0.0和1.0之间，并表示要包含在test分割中的数据集的比例。如果int，表示测试样本的绝对数量。如果没有，则将该值设置为train_size的补集。如果train_size也是None，它将被设置为0.1。

test_size用来设置【train/test】对中test所占的比例。

train_size: float或int，默认=None。如果是浮点数，则应该在0.0和1.0之间，并表示要包含在train分割序列中的数据集的比例。如果int，表示train样本的绝对数量。如果没有，该值将自动设置为train size的补集。train_size用来设置【train/test】对中train所占的比例。

random_state: int或RandomState实例，默认为None。控制产生的训练和测试指标的随机性。在多个函数调用之间传递可重复输出的int。

控制将样本随机打乱，用于随机抽样的伪随机数发生器状态。

看:术语:“术语表< random_state >”。

Examples

--------

>>> import numpy as np

>>> from sklearn.model_selection import ShuffleSplit

>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])

>>> y = np.array([1, 2, 1, 2, 1, 2])

>>> rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)

>>> rs.get_n_splits(X)

>>> print(rs)

ShuffleSplit(n_splits=5, random_state=0, test_size=0.25, train_size=None)

>>> for train_index, test_index in rs.split(X):

... print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [1 3 0 4] TEST: [5 2]

TRAIN: [4 0 2 5] TEST: [1 3]

TRAIN: [1 2 4 0] TEST: [3 5]

TRAIN: [3 4 1 0] TEST: [5 2]

TRAIN: [3 5 1 0] TEST: [2 4]

>>> rs = ShuffleSplit(n_splits=5, train_size=0.5, test_size=.25, random_state=0)

>>> for train_index, test_index in rs.split(X):

... print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [1 3 0] TEST: [5 2]

TRAIN: [4 0 2] TEST: [1 3]

TRAIN: [1 2 4] TEST: [3 5]

TRAIN: [3 4 1] TEST: [5 2]

TRAIN: [3 5 1] TEST: [2 4]

"""

@_deprecate_positional_args

def __init__(self, n_splits=10, *, test_size=None, train_size=None,

random_state=None):

super().__init__(n_splits=n_splits, test_size=test_size, train_size=train_size, random_state=random_state)

self._default_test_size = 0.1

def _iter_indices(self, X, y=None, groups=None):

n_samples = _num_samples(X)

n_train, n_test = _validate_shuffle_split(

n_samples, self.test_size, self.train_size,

default_test_size=self._default_test_size)

rng = check_random_state(self.random_state)

for i in range(self.n_splits):

# random partition

permutation = rng.permutation(n_samples)

ind_test = permutation[:n_test]

ind_train = permutation[n_test:n_test + n_train]

yield ind_train, ind_test

Examples

--------

>>> import numpy as np

>>> from sklearn.model_selection import ShuffleSplit

>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])

>>> y = np.array([1, 2, 1, 2, 1, 2])

>>> rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)

>>> rs.get_n_splits(X)

>>> print(rs)

ShuffleSplit(n_splits=5, random_state=0, test_size=0.25, train_size=None)

>>> for train_index, test_index in rs.split(X):

... print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [1 3 0 4] TEST: [5 2]

TRAIN: [4 0 2 5] TEST: [1 3]

TRAIN: [1 2 4 0] TEST: [3 5]

TRAIN: [3 4 1 0] TEST: [5 2]

TRAIN: [3 5 1 0] TEST: [2 4]

>>> rs = ShuffleSplit(n_splits=5, train_size=0.5, test_size=.25, random_state=0)

>>> for train_index, test_index in rs.split(X):

... print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [1 3 0] TEST: [5 2]

TRAIN: [4 0 2] TEST: [1 3]

TRAIN: [1 2 4] TEST: [3 5]

TRAIN: [3 4 1] TEST: [5 2]

TRAIN: [3 5 1] TEST: [2 4]

"""

@_deprecate_positional_args

def __init__(self, n_splits=10, *, test_size=None, train_size=None,

random_state=None):

super().__init__(n_splits=n_splits, test_size=test_size, train_size=train_size, random_state=random_state)

self._default_test_size = 0.1

def _iter_indices(self, X, y=None, groups=None):

n_samples = _num_samples(X)

n_train, n_test = _validate_shuffle_split(

n_samples, self.test_size, self.train_size,

default_test_size=self._default_test_size)

rng = check_random_state(self.random_state)

for i in range(self.n_splits):

# random partition

permutation = rng.permutation(n_samples)

ind_test = permutation[:n_test]

ind_train = permutation[n_test:n_test + n_train]

yield ind_train, ind_test

StratifiedShuffleSplit()函数

StratifiedShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=None)

class StratifiedShuffleSplit(BaseShuffleSplit):

"""Stratified Shuffle Split cross-validator

Provides train/test indices to split data in train/test sets.

This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.

Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters

----------

n_splits : int, default=10

Number of re-shuffling & splitting iterations.

random_state : int or RandomState instance, default=None. Controls the randomness of the training and testing indices produced. Pass an int for reproducible output across multiple function calls.

See :term:`Glossary <random_state>`.

分层洗牌分裂交叉验证器

提供训练/测试索引来分割训练/测试集中的数据。

这个交叉验证对象是StratifiedKFold和ShuffleSplit的合并，它将返回StratifiedKFold。折叠是通过保存每个类的样本百分比来实现的。

注意:就像ShuffleSplit策略一样，分层随机分割不能保证所有的折叠都是不同的，尽管这对于相当大的数据集仍然很有可能。

更多信息请参见:ref: ' User Guide <cross_validation> '。</cross_validation>

参数

----------

int，默认=10

重新洗牌和分裂迭代的数量。

test_size: float或int，默认=None。如果是浮动的，则应该在0.0和1.0之间，并表示要包含在测试分割中的数据集的比例。如果int，表示测试样本的绝对数量。如果没有，则将该值设置为train size的补集。如果' ' train_size ' '也是None，它将被设置为0.1。

train_size: float或int，默认=None。如果是浮点数，则应该在0.0和1.0之间，并表示要包含在分割序列中的数据集的比例。如果int，表示train样本的绝对数量。如果没有，该值将自动设置为train size的补集。

random_state: int或RandomState实例，默认为None。控制产生的训练和测试指标的随机性。在多个函数调用之间传递可重复输出的int。

看:术语:“术语表< random_state >”。

Examples

--------

>>> import numpy as np

>>> from sklearn.model_selection import StratifiedShuffleSplit

>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

>>> y = np.array([0, 0, 0, 1, 1, 1])

>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5,

random_state=0)

>>> sss.get_n_splits(X, y)

>>> print(sss)

StratifiedShuffleSplit(n_splits=5, random_state=0, ...)

>>> for train_index, test_index in sss.split(X, y):

... print("TRAIN:", train_index, "TEST:", test_index)

... X_train, X_test = X[train_index], X[test_index]

... y_train, y_test = y[train_index], y[test_index]

TRAIN: [5 2 3] TEST: [4 1 0]

TRAIN: [5 1 4] TEST: [0 2 3]

TRAIN: [5 0 2] TEST: [4 3 1]

TRAIN: [4 1 0] TEST: [2 3 5]

TRAIN: [0 5 1] TEST: [3 4 2]

"""

@_deprecate_positional_args

def __init__(self, n_splits=10, *, test_size=None, train_size=None,

random_state=None):

super().__init__(n_splits=n_splits, test_size=test_size,

train_size=train_size, random_state=random_state)

self._default_test_size = 0.1

def _iter_indices(self, X, y, groups=None):

n_samples = _num_samples(X)

y = check_array(y, ensure_2d=False, dtype=None)

n_train, n_test = _validate_shuffle_split(

n_samples, self.test_size, self.train_size,

default_test_size=self._default_test_size)

if y.ndim == 2:

# for multi-label y, map each distinct row to a string repr

# using join because str(row) uses an ellipsis if len(row) >

1000

y = np.array([' '.join(row.astype('str')) for row in y])

classes, y_indices = np.unique(y, return_inverse=True)

n_classes = classes.shape[0]

class_counts = np.bincount(y_indices)

if np.min(class_counts) < 2:

raise ValueError("The least populated class in y has only 1"

" member, which is too few. The minimum"

" number of groups for any class cannot"

" be less than 2.")

if n_train < n_classes:

raise ValueError(

'The train_size = %d should be greater or '

'equal to the number of classes = %d' %

(n_train, n_classes))

if n_test < n_classes:

raise ValueError('The test_size = %d should be greater or '

'equal to the number of classes = %d' %

(n_test, n_classes)) # Find the sorted list of instances for

each class:

# (np.unique above performs a sort, so code is O(n logn)

already)

class_indices = np.split(np.argsort(y_indices,

kind='mergesort'), np.cumsum(class_counts)[:-1])

rng = check_random_state(self.random_state)

for _ in range(self.n_splits):

# if there are ties in the class-counts, we want

# to make sure to break them anew in each iteration

n_i = _approximate_mode(class_counts, n_train, rng)

class_counts_remaining = class_counts - n_i

t_i = _approximate_mode(class_counts_remaining, n_test,

rng)

train = []

test = []

for i in range(n_classes):

permutation = rng.permutation(class_counts[i])

perm_indices_class_i = class_indices[i].take(permutation,

mode='clip')

train.extend(perm_indices_class_i[:n_i[i]])

test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]])

train = rng.permutation(train)

test = rng.permutation(test)

yield train, test

def split(self, X, y, groups=None):

"""Generate indices to split data into training and test set.

Parameters

----------

X : array-like of shape (n_samples, n_features)

Training data, where n_samples is the number of samples

and n_features is the number of features.

Note that providing ``y`` is sufficient to generate the splits

and

hence ``np.zeros(n_samples)`` may be used as a placeholder

for

``X`` instead of actual training data.

y : array-like of shape (n_samples,) or (n_samples, n_labels)

The target variable for supervised learning problems.

Stratification is done based on the y labels.

groups : object

Always ignored, exists for compatibility.

Yields

------

train : ndarray

The training set indices for that split.

test : ndarray

The testing set indices for that split.

Notes

-----

Randomized CV splitters may return different results for each

call of

split. You can make the results identical by setting

`random_state`

to an integer.

"""

y = check_array(y, ensure_2d=False, dtype=None)

return super().split(X, y, groups)

Examples

--------

>>> import numpy as np

>>> from sklearn.model_selection import StratifiedShuffleSplit

>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

>>> y = np.array([0, 0, 0, 1, 1, 1])

>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5,

random_state=0)

>>> sss.get_n_splits(X, y)

>>> print(sss)

StratifiedShuffleSplit(n_splits=5, random_state=0, ...)

>>> for train_index, test_index in sss.split(X, y):

... print("TRAIN:", train_index, "TEST:", test_index)

... X_train, X_test = X[train_index], X[test_index]

... y_train, y_test = y[train_index], y[test_index]

TRAIN: [5 2 3] TEST: [4 1 0]

TRAIN: [5 1 4] TEST: [0 2 3]

TRAIN: [5 0 2] TEST: [4 3 1]

TRAIN: [4 1 0] TEST: [2 3 5]

TRAIN: [0 5 1] TEST: [3 4 2]

"""

@_deprecate_positional_args

def __init__(self, n_splits=10, *, test_size=None, train_size=None,

random_state=None):

super().__init__(n_splits=n_splits, test_size=test_size,

train_size=train_size, random_state=random_state)

self._default_test_size = 0.1

def _iter_indices(self, X, y, groups=None):

n_samples = _num_samples(X)

y = check_array(y, ensure_2d=False, dtype=None)

n_train, n_test = _validate_shuffle_split(

n_samples, self.test_size, self.train_size,

default_test_size=self._default_test_size)

if y.ndim == 2:

# for multi-label y, map each distinct row to a string repr

# using join because str(row) uses an ellipsis if len(row) >

1000

y = np.array([' '.join(row.astype('str')) for row in y])

classes, y_indices = np.unique(y, return_inverse=True)

n_classes = classes.shape[0]

class_counts = np.bincount(y_indices)

if np.min(class_counts) < 2:

raise ValueError("The least populated class in y has only 1"

" member, which is too few. The minimum"

" number of groups for any class cannot"

" be less than 2.")

if n_train < n_classes:

raise ValueError(

'The train_size = %d should be greater or '

'equal to the number of classes = %d' %

(n_train, n_classes))

if n_test < n_classes:

raise ValueError('The test_size = %d should be greater or '

'equal to the number of classes = %d' %

(n_test, n_classes)) # Find the sorted list of instances for

each class:

# (np.unique above performs a sort, so code is O(n logn)

already)

class_indices = np.split(np.argsort(y_indices,

kind='mergesort'), np.cumsum(class_counts)[:-1])

rng = check_random_state(self.random_state)

for _ in range(self.n_splits):

# if there are ties in the class-counts, we want

# to make sure to break them anew in each iteration

n_i = _approximate_mode(class_counts, n_train, rng)

class_counts_remaining = class_counts - n_i

t_i = _approximate_mode(class_counts_remaining, n_test,

rng)

train = []

test = []

for i in range(n_classes):

permutation = rng.permutation(class_counts[i])

perm_indices_class_i = class_indices[i].take(permutation,

mode='clip')

train.extend(perm_indices_class_i[:n_i[i]])

test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]])

train = rng.permutation(train)

test = rng.permutation(test)

yield train, test

def split(self, X, y, groups=None):

"""Generate indices to split data into training and test set.

Parameters

----------

X : array-like of shape (n_samples, n_features)

Training data, where n_samples is the number of samples

and n_features is the number of features.

Note that providing ``y`` is sufficient to generate the splits

and

hence ``np.zeros(n_samples)`` may be used as a placeholder

for

``X`` instead of actual training data.

y : array-like of shape (n_samples,) or (n_samples, n_labels)

The target variable for supervised learning problems.

Stratification is done based on the y labels.

groups : object

Always ignored, exists for compatibility.

Yields

------

train : ndarray

The training set indices for that split.

test : ndarray

The testing set indices for that split.

Notes

-----

Randomized CV splitters may return different results for each

call of

split. You can make the results identical by setting

`random_state`

to an integer.

"""

y = check_array(y, ensure_2d=False, dtype=None)

return super().split(X, y, groups)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航