译（六十二）-PyTorch使用DataLoader于自己的数据集

MWHLS • 2022/05/28 pm8:42 • Python, Pytorch, 计算机语言

已阅： 7

stackoverflow热门问题目录
如有翻译问题欢迎评论指出，谢谢。
说起来最近越来越拖了，果然事不能太多。

PyTorch：将 DataLoader 用于自己的数据集

PyTorch: How to use DataLoaders for custom Datasets

PyTorch：将 DataLoader 用于自己的数据集

Sarthak asked:
- 如何在自己的数据集上使用 torch.utils.data.Dataset 和 torch.utils.data.DataLoader （而非只在 torchvision.datasets 上）？
- 能把用在 TorchVisionDatasets 上的内置 DataLoaders 用于其它数据集吗？
Answers:
- paho - vote: 69
- 可以的，只需要创建你需要的对象即可，如：
- ```
import torch.utils.data as data_utils
#
train = data_utils.TensorDataset(features, targets)
train_loader = data_utils.DataLoader(train, batch_size=50, shuffle=True)
```
- 其中 features 和 targets 是张量。features 为二维矩阵（每行为一个训练样本的特征），targets 则可以是一维或二维矩阵，取决于其值为标量还是向量。
  - 译者注：~~我没实际用过这个方法，不过在我理解中，features 在处理图像时可以是四维 (Num, H, W, C)。~~三维图像得拉平，见后文
- 希望能帮到你！
- 更新：对 @sarthak 提问的回答
  - 译者注：正好是我前面关于 featrues 中图像处理的问题，不是我想的直接四维，还是得拉平（但我觉得没道理，我保留意见）。原问题如下：
  - I have 3D features : 2D for an image and one extra dimension for color channels. Would it still work if I pass the features as 5000xnxnx3. 5000 is the number of data points nxnx3 is the image size
  - A 4d Dataset can be passed as features there is no need for the view statement.
- 没错。对于 TensorData 类型的对象，构造器会判断 features 的第一维（即 data_tensor）和 targets 的第一维（即 target_tensor）的长度是否一致。
- ```
assert data_tensor.size(0) == target_tensor.size(0)
```
- 不过，如果这些数据在之后会被送入神经网络，需要注意：卷积层处理的是矩阵形式的数据，但为了能使用该方法，得用 view 之类的方法将三维特征矩阵转为一维特征向量。对于 $5000 \times n \times n \times 3$ 的数据集，需要被如下操作：
- ```
2d_dataset = 4d_dataset.view(5000, -1)
```
- （-1 表示让 PyTorch 自动匹配合适的长度。）
- user3693922 - vote: 12
- 用 data.Dataset 类就行。如 API 中所示，只需要实现两个方法：__getitem__ 和 __len__。
- 参照 API 或 @pho7 的回答，使用 DataLoader 封装你的数据集。
- 这里的 ImageFolder 类是个不错的参考。
- Khubaib Raza - vote: 2
- 试试这个：
- ```
from torch.utils.data import TensorDataset, DataLoader
import torch.utils.data as data_utils
#
inputs = [[ 1,  2,  3,  4,  5],[ 2,  3,  4,  5,  6]]
targets = [ 6,7]
batch_size = 2
#
inputs  = torch.tensor(inputs)
targets = torch.IntTensor(targets)
#
dataset =TensorDataset(inputs, targets)
data_loader = DataLoader(dataset, batch_size, shuffle = True)
```

PyTorch: How to use DataLoaders for custom Datasets

Sarthak asked:
- How to make use of the torch.utils.data.Dataset and torch.utils.data.DataLoader on your own data (not just the torchvision.datasets)?
  如何在自己的数据集上使用 torch.utils.data.Dataset 和 torch.utils.data.DataLoader （而非只在 torchvision.datasets 上）？
- Is there a way to use the inbuilt DataLoaders which they use on TorchVisionDatasets to be used on any dataset?
  能把用在 TorchVisionDatasets 上的内置 DataLoaders 用于其它数据集吗？
Answers:
- paho - vote: 69
- Yes, that is possible. Just create the objects by yourself, e.g.
  可以的，只需要创建你需要的对象即可，如：
- ```
import torch.utils.data as data_utils
#
train = data_utils.TensorDataset(features, targets)
train_loader = data_utils.DataLoader(train, batch_size=50, shuffle=True)
```
- where features and targets are tensors. features has to be 2-D, i.e. a matrix where each line represents one training sample, and targets may be 1-D or 2-D, depending on whether you are trying to predict a scalar or a vector.
  其中 features 和 targets 是张量。features 为二维矩阵（每行为一个训练样本的特征），targets 则可以是一维或二维矩阵，取决于其值为标量还是向量。
  - 译者注：~~我没实际用过这个方法，不过在我理解中，features 在处理图像时可以是四维 (Num, H, W, C)。~~三维图像得拉平，见后文
- Hope that helps!
  希望能帮到你！
- EDIT: response to @sarthak\'s question
  更新：对 @sarthak 提问的回答
  - 译者注：正好是我前面关于 featrues 中图像处理的问题，不是我想的直接四维，还是得拉平（但我觉得没道理，我保留意见）。原问题如下：
  - I have 3D features : 2D for an image and one extra dimension for color channels. Would it still work if I pass the features as 5000xnxnx3. 5000 is the number of data points nxnx3 is the image size
  - A 4d Dataset can be passed as features there is no need for the view statement.
- Basically yes. If you create an object of type TensorData, then the constructor investigates whether the first dimensions of the feature tensor (which is actually called data_tensor) and the target tensor (called target_tensor) have the same length:
  没错。对于 TensorData 类型的对象，构造器会判断 features 的第一维（即 data_tensor）和 targets 的第一维（即 target_tensor）的长度是否一致。
- ```
assert data_tensor.size(0) == target_tensor.size(0)
```
- However, if you want to feed these data into a neural network subsequently, then you need to be careful. While convolution layers work on data like yours, (I think) all of the other types of layers expect the data to be given in matrix form. So, if you run into an issue like this, then an easy solution would be to convert your 4D-dataset (given as some kind of tensor, e.g. FloatTensor) into a matrix by using the method view. For your 5000xnxnx3 dataset, this would look like this:
  不过，如果这些数据在之后会被送入神经网络，需要注意：卷积层处理的是矩阵形式的数据，但为了能使用该方法，得用 view 之类的方法将三维特征矩阵转为一维特征向量。对于 $5000 \times n \times n \times 3$ 的数据集，需要被如下操作：
- ```
2d_dataset = 4d_dataset.view(5000, -1)
```
- (The value -1 tells PyTorch to figure out the length of the second dimension automatically.)
  （-1 表示让 PyTorch 自动匹配合适的长度。）
- user3693922 - vote: 12
- You can easily do this be extending the data.Dataset class.According to the API, all you have to do is implement two function: __getitem__ and __len__.
  用 data.Dataset 类就行。如 API 中所示，只需要实现两个方法：__getitem__ 和 __len__。
- You can then wrap the dataset with the DataLoader as shown in the API and in @pho7 \'s answer.
  参照 API 或 @pho7 的回答，使用 DataLoader 封装你的数据集。
- I think the ImageFolder class is a reference. See code here.
  这里的 ImageFolder 类是个不错的参考。
- Khubaib Raza - vote: 2
- Yes, you can do it.Hope this helps for future readers.
  试试这个：
- ```
from torch.utils.data import TensorDataset, DataLoader
import torch.utils.data as data_utils
#
inputs = [[ 1,  2,  3,  4,  5],[ 2,  3,  4,  5,  6]]
targets = [ 6,7]
batch_size = 2
#
inputs  = torch.tensor(inputs)
targets = torch.IntTensor(targets)
#
dataset =TensorDataset(inputs, targets)
data_loader = DataLoader(dataset, batch_size, shuffle = True)
```