



  • nirvair asked:

    • import pandas as pd
      import numpy as np
      import cv2
      from torch.utils.data.dataset import Dataset
      class CustomDatasetFromCSV(Dataset):
      def __init__(self, csv_path, transform=None):
          self.data = pd.read_csv(csv_path)
          self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
          self.height = 48
          self.width = 48
          self.transform = transform
      def __getitem__(self, index):
          pixels = self.data['pixels'].tolist()
          faces = []
          for pixel_sequence in pixels:
              face = [int(pixel) for pixel in pixel_sequence.split(' ')]
              # print(np.asarray(face).shape)
              face = np.asarray(face).reshape(self.width, self.height)
              face = cv2.resize(face.astype('uint8'), (self.width, self.height))
          faces = np.asarray(faces)
          faces = np.expand_dims(faces, -1)
          return faces, self.labels
      def __len__(self):
          return len(self.data)
    • 这段代码是我从其它地方参考的,但我还想将数据集分为训练集和测试集。

    • 能在这个类里面直接实现嘛?还是需要分开来实现?

  • Answers:

    • Fábio Perez - vote: 156

    • Pytorch 0.4.1 以上版本可以使用 random_split

    • train_size = int(0.8 * len(full_dataset))
      test_size = len(full_dataset) - train_size
      train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])
    • benjaminplanche - vote: 127

    • 试试 Pytorch 的 SubsetRandomSampler

    • import torch
      import numpy as np
      from torchvision import datasets
      from torchvision import transforms
      from torch.utils.data.sampler import SubsetRandomSampler
      class CustomDatasetFromCSV(Dataset):
        def __init__(self, csv_path, transform=None):
            self.data = pd.read_csv(csv_path)
            self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
            self.height = 48
            self.width = 48
            self.transform = transform
        def __getitem__(self, index):
            # This method should return only 1 sample and label 
            # (according to "index"), not the whole dataset
            # So probably something like this for you:
            pixel_sequence = self.data['pixels'][index]
            face = [int(pixel) for pixel in pixel_sequence.split(' ')]
            face = np.asarray(face).reshape(self.width, self.height)
            face = cv2.resize(face.astype('uint8'), (self.width, self.height))
            label = self.labels[index]
            return face, label
        def __len__(self):
            return len(self.labels)
      dataset = CustomDatasetFromCSV(my_path)
      batch_size = 16
      validation_split = .2
      shuffle_dataset = True
      random_seed= 42
      # Creating data indices for training and validation splits:
      dataset_size = len(dataset)
      indices = list(range(dataset_size))
      split = int(np.floor(validation_split * dataset_size))
      if shuffle_dataset :
      train_indices, val_indices = indices[split:], indices[:split]
      # Creating PT data samplers and loaders:
      train_sampler = SubsetRandomSampler(train_indices)
      valid_sampler = SubsetRandomSampler(val_indices)
      train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, 
      validation_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
      # Usage Example:
      num_epochs = 10
      for epoch in range(num_epochs):
        # Train:   
        for batch_index, (faces, labels) in enumerate(train_loader):
            # ...
    • Shital Shah - vote: 25

    • 现有回答采用的是随机分割,但每个类的样本会有数量不平衡的缺点。当每类只需要少量样本时,这个问题尤为致命。例如,手写数字集 MNIST 有六万个样本,即每个数字六千个样本。假设每个训练集中你只需要30个样本,随机分割就会在类间产生不平衡(某数字比其它数字有更多的训练数据)。所以还得确保每个数字只有三十个标签,这叫做分层抽样

    • 一个解决方案是用 Pytorch 的接口,见示例代码

    • 另一种方式是自己实现 :)。例如,下面关于 MNIST 的一个简单实现中,ds 是MNIST 数据集,k是每个类的样本数。

    • def sampleFromClass(ds, k):
        class_counts = {}
        train_data = []
        train_label = []
        test_data = []
        test_label = []
        for data, label in ds:
            c = label.item()
            class_counts[c] = class_counts.get(c, 0) + 1
            if class_counts[c] <= k:
                train_label.append(torch.unsqueeze(label, 0))
                test_label.append(torch.unsqueeze(label, 0))
        train_data = torch.cat(train_data)
        for ll in train_label:
        train_label = torch.cat(train_label)
        test_data = torch.cat(test_data)
        test_label = torch.cat(test_label)
        return (TensorDataset(train_data, train_label), 
            TensorDataset(test_data, test_label))
    • 然后这样使用它:

    • def main():
        train_ds = datasets.MNIST('../data', train=True, download=True,
        train_ds, test_ds = sampleFromClass(train_ds, 3)

How do I split a custom dataset into training and test datasets?

  • nirvair asked:

    • import pandas as pd
      import numpy as np
      import cv2
      from torch.utils.data.dataset import Dataset
      class CustomDatasetFromCSV(Dataset):
      def __init__(self, csv_path, transform=None):
          self.data = pd.read_csv(csv_path)
          self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
          self.height = 48
          self.width = 48
          self.transform = transform
      def __getitem__(self, index):
          pixels = self.data['pixels'].tolist()
          faces = []
          for pixel_sequence in pixels:
              face = [int(pixel) for pixel in pixel_sequence.split(' ')]
              # print(np.asarray(face).shape)
              face = np.asarray(face).reshape(self.width, self.height)
              face = cv2.resize(face.astype('uint8'), (self.width, self.height))
          faces = np.asarray(faces)
          faces = np.expand_dims(faces, -1)
          return faces, self.labels
      def __len__(self):
          return len(self.data)
    • This is what I could manage to do by using references from other repositories. However, I want to split this dataset into train and test.

    • How can I do that inside this class? Or do I need to make a separate class to do that?

  • Answers:

    • Fábio Perez - vote: 156

    • Starting in PyTorch 0.4.1 you can use random_split:
      Pytorch 0.4.1 以上版本可以使用 random_split

    • train_size = int(0.8 * len(full_dataset))
      test_size = len(full_dataset) - train_size
      train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])
    • benjaminplanche - vote: 127

    • Using Pytorch\'s SubsetRandomSampler:
      试试 Pytorch 的 SubsetRandomSampler

    • import torch
      import numpy as np
      from torchvision import datasets
      from torchvision import transforms
      from torch.utils.data.sampler import SubsetRandomSampler
      class CustomDatasetFromCSV(Dataset):
        def __init__(self, csv_path, transform=None):
            self.data = pd.read_csv(csv_path)
            self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
            self.height = 48
            self.width = 48
            self.transform = transform
        def __getitem__(self, index):
            # This method should return only 1 sample and label 
            # (according to "index"), not the whole dataset
            # So probably something like this for you:
            pixel_sequence = self.data['pixels'][index]
            face = [int(pixel) for pixel in pixel_sequence.split(' ')]
            face = np.asarray(face).reshape(self.width, self.height)
            face = cv2.resize(face.astype('uint8'), (self.width, self.height))
            label = self.labels[index]
            return face, label
        def __len__(self):
            return len(self.labels)
      dataset = CustomDatasetFromCSV(my_path)
      batch_size = 16
      validation_split = .2
      shuffle_dataset = True
      random_seed= 42
      # Creating data indices for training and validation splits:
      dataset_size = len(dataset)
      indices = list(range(dataset_size))
      split = int(np.floor(validation_split * dataset_size))
      if shuffle_dataset :
      train_indices, val_indices = indices[split:], indices[:split]
      # Creating PT data samplers and loaders:
      train_sampler = SubsetRandomSampler(train_indices)
      valid_sampler = SubsetRandomSampler(val_indices)
      train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, 
      validation_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
      # Usage Example:
      num_epochs = 10
      for epoch in range(num_epochs):
        # Train:   
        for batch_index, (faces, labels) in enumerate(train_loader):
            # ...
    • Shital Shah - vote: 25

    • Current answers do random splits which has disadvantage that number of samples per class is not guaranteed to be balanced. This is especially problematic when you want to have small number of samples per class. For example, MNIST has 60,000 examples, i.e. 6000 per digit. Assume that you want only 30 examples per digit in your training set. In this case, random split may produce imbalance between classes (one digit with more training data then others). So you want to make sure each digit precisely has only 30 labels. This is called stratified sampling.
      现有回答采用的是随机分割,但每个类的样本会有数量不平衡的缺点。当每类只需要少量样本时,这个问题尤为致命。例如,手写数字集 MNIST 有六万个样本,即每个数字六千个样本。假设每个训练集中你只需要30个样本,随机分割就会在类间产生不平衡(某数字比其它数字有更多的训练数据)。所以还得确保每个数字只有三十个标签,这叫做分层抽样

    • One way to do this is using sampler interface in Pytorch and sample code is here.
      一个解决方案是用 Pytorch 的接口,见示例代码

    • Another way to do this is just hack your way through :). For example, below is simple implementation for MNIST where ds is MNIST dataset and k is number of samples needed for each class.
      另一种方式是自己实现 :)。例如,下面关于 MNIST 的一个简单实现中,ds 是MNIST 数据集,k是每个类的样本数。

    • def sampleFromClass(ds, k):
        class_counts = {}
        train_data = []
        train_label = []
        test_data = []
        test_label = []
        for data, label in ds:
            c = label.item()
            class_counts[c] = class_counts.get(c, 0) + 1
            if class_counts[c] <= k:
                train_label.append(torch.unsqueeze(label, 0))
                test_label.append(torch.unsqueeze(label, 0))
        train_data = torch.cat(train_data)
        for ll in train_label:
        train_label = torch.cat(train_label)
        test_data = torch.cat(test_data)
        test_label = torch.cat(test_label)
        return (TensorDataset(train_data, train_label), 
            TensorDataset(test_data, test_label))
    • You can use this function like this:

    • def main():
        train_ds = datasets.MNIST('../data', train=True, download=True,
        train_ds, test_ds = sampleFromClass(train_ds, 3)


< <上一篇