译(六十二)-PyTorch使用DataLoader于自己的数据集
如有翻译问题欢迎评论指出,谢谢。
说起来最近越来越拖了,果然事不能太多。
PyTorch:将 DataLoader 用于自己的数据集
Sarthak asked:
- 如何在自己的数据集上使用
torch.utils.data.Dataset
和torch.utils.data.DataLoader
(而非只在torchvision.datasets
上)? - 能把用在
TorchVisionDatasets
上的内置DataLoaders
用于其它数据集吗?
- 如何在自己的数据集上使用
Answers:
paho - vote: 69
可以的,只需要创建你需要的对象即可,如:
import torch.utils.data as data_utils # train = data_utils.TensorDataset(features, targets) train_loader = data_utils.DataLoader(train, batch_size=50, shuffle=True)
其中
features
和targets
是张量。features
为二维矩阵(每行为一个训练样本的特征),targets
则可以是一维或二维矩阵,取决于其值为标量还是向量。- 译者注:
我没实际用过这个方法,不过在我理解中,三维图像得拉平,见后文features
在处理图像时可以是四维 (Num, H, W, C)。
- 译者注:
希望能帮到你!
更新:对 @sarthak 提问的回答
- 译者注:正好是我前面关于
featrues
中图像处理的问题,不是我想的直接四维,还是得拉平(但我觉得没道理,我保留意见)。原问题如下: - I have 3D features : 2D for an image and one extra dimension for color channels. Would it still work if I pass the features as 5000xnxnx3. 5000 is the number of data points nxnx3 is the image size
- A 4d Dataset can be passed as features there is no need for the view statement.
- 译者注:正好是我前面关于
没错。对于
TensorData
类型的对象,构造器会判断features
的第一维(即data_tensor
)和targets
的第一维(即target_tensor
)的长度是否一致。assert data_tensor.size(0) == target_tensor.size(0)
不过,如果这些数据在之后会被送入神经网络,需要注意:卷积层处理的是矩阵形式的数据,但为了能使用该方法,得用
view
之类的方法将三维特征矩阵转为一维特征向量。对于 $5000 \times n \times n \times 3$ 的数据集,需要被如下操作:2d_dataset = 4d_dataset.view(5000, -1)
(
-1
表示让 PyTorch 自动匹配合适的长度。)user3693922 - vote: 12
用
data.Dataset
类就行。如 API 中所示,只需要实现两个方法:__getitem__
和__len__
。参照 API 或 @pho7 的回答,使用 DataLoader 封装你的数据集。
这里的
ImageFolder
类是个不错的参考。Khubaib Raza - vote: 2
试试这个:
from torch.utils.data import TensorDataset, DataLoader import torch.utils.data as data_utils # inputs = [[ 1, 2, 3, 4, 5],[ 2, 3, 4, 5, 6]] targets = [ 6,7] batch_size = 2 # inputs = torch.tensor(inputs) targets = torch.IntTensor(targets) # dataset =TensorDataset(inputs, targets) data_loader = DataLoader(dataset, batch_size, shuffle = True)
PyTorch: How to use DataLoaders for custom Datasets
Sarthak asked:
- How to make use of the
torch.utils.data.Dataset
andtorch.utils.data.DataLoader
on your own data (not just thetorchvision.datasets
)?
如何在自己的数据集上使用torch.utils.data.Dataset
和torch.utils.data.DataLoader
(而非只在torchvision.datasets
上)? - Is there a way to use the inbuilt
DataLoaders
which they use onTorchVisionDatasets
to be used on any dataset?
能把用在TorchVisionDatasets
上的内置DataLoaders
用于其它数据集吗?
- How to make use of the
Answers:
paho - vote: 69
Yes, that is possible. Just create the objects by yourself, e.g.
可以的,只需要创建你需要的对象即可,如:import torch.utils.data as data_utils # train = data_utils.TensorDataset(features, targets) train_loader = data_utils.DataLoader(train, batch_size=50, shuffle=True)
where
features
andtargets
are tensors.features
has to be 2-D, i.e. a matrix where each line represents one training sample, andtargets
may be 1-D or 2-D, depending on whether you are trying to predict a scalar or a vector.
其中features
和targets
是张量。features
为二维矩阵(每行为一个训练样本的特征),targets
则可以是一维或二维矩阵,取决于其值为标量还是向量。- 译者注:
我没实际用过这个方法,不过在我理解中,三维图像得拉平,见后文features
在处理图像时可以是四维 (Num, H, W, C)。
- 译者注:
Hope that helps!
希望能帮到你!EDIT: response to @sarthak\'s question
更新:对 @sarthak 提问的回答- 译者注:正好是我前面关于
featrues
中图像处理的问题,不是我想的直接四维,还是得拉平(但我觉得没道理,我保留意见)。原问题如下: - I have 3D features : 2D for an image and one extra dimension for color channels. Would it still work if I pass the features as 5000xnxnx3. 5000 is the number of data points nxnx3 is the image size
- A 4d Dataset can be passed as features there is no need for the view statement.
- 译者注:正好是我前面关于
Basically yes. If you create an object of type
TensorData
, then the constructor investigates whether the first dimensions of the feature tensor (which is actually calleddata_tensor
) and the target tensor (calledtarget_tensor
) have the same length:
没错。对于TensorData
类型的对象,构造器会判断features
的第一维(即data_tensor
)和targets
的第一维(即target_tensor
)的长度是否一致。assert data_tensor.size(0) == target_tensor.size(0)
However, if you want to feed these data into a neural network subsequently, then you need to be careful. While convolution layers work on data like yours, (I think) all of the other types of layers expect the data to be given in matrix form. So, if you run into an issue like this, then an easy solution would be to convert your 4D-dataset (given as some kind of tensor, e.g.
FloatTensor
) into a matrix by using the methodview
. For your 5000xnxnx3 dataset, this would look like this:
不过,如果这些数据在之后会被送入神经网络,需要注意:卷积层处理的是矩阵形式的数据,但为了能使用该方法,得用view
之类的方法将三维特征矩阵转为一维特征向量。对于 $5000 \times n \times n \times 3$ 的数据集,需要被如下操作:2d_dataset = 4d_dataset.view(5000, -1)
(The value
-1
tells PyTorch to figure out the length of the second dimension automatically.)
(-1
表示让 PyTorch 自动匹配合适的长度。)user3693922 - vote: 12
You can easily do this be extending the
data.Dataset
class.According to the API, all you have to do is implement two function:__getitem__
and__len__
.
用data.Dataset
类就行。如 API 中所示,只需要实现两个方法:__getitem__
和__len__
。You can then wrap the dataset with the DataLoader as shown in the API and in @pho7 \'s answer.
参照 API 或 @pho7 的回答,使用 DataLoader 封装你的数据集。I think the
ImageFolder
class is a reference. See code here.
这里的ImageFolder
类是个不错的参考。Khubaib Raza - vote: 2
Yes, you can do it.Hope this helps for future readers.
试试这个:from torch.utils.data import TensorDataset, DataLoader import torch.utils.data as data_utils # inputs = [[ 1, 2, 3, 4, 5],[ 2, 3, 4, 5, 6]] targets = [ 6,7] batch_size = 2 # inputs = torch.tensor(inputs) targets = torch.IntTensor(targets) # dataset =TensorDataset(inputs, targets) data_loader = DataLoader(dataset, batch_size, shuffle = True)
共有 0 条评论