Data Preprocessing
Text Data Files
For accurate model evaluation, it is necessary to split the data into subsets such as training, validation, and test sets for training and evaluating the model. The cvtk package provides a convenient command for splitting a single dataset into multiple subsets.
If the dataset is saved in a text file, use the cvtk split command.
For example, suppose you have a tab-delimited text file data.txt
with the image file paths in the first column and the label names in the second column:
data/fruits/strawberry/68e35228.jpg strawberry
data/fruits/eggplant/833bda67.jpg eggplant
data/fruits/cucumber/c1a79fff.jpg cucumber
data/fruits/eggplant/c2e2291e.jpg eggplant
data/fruits/tomato/3ee5d80e.jpg tomato
data/fruits/eggplant/3da0be49.jpg eggplant
...
To split this data into training, validation, and test sets in a 6:2:2 ratio,
run the following command.
The --shuffle option shuffles the data before splitting.
cvtk split --input data.txt --output data_subset.txt --ratios 6:2:2 --shuffle
If the command runs successfully,
it generates the files data_subset.txt.0, data_subset.txt.1,
and data_subset.txt.2` in the current directory.
The number of samples in each file will roughly match the ratio specified by --ratios.
wc -l data.txt
# 400 data.txt
wc -l data_subset.txt.0 data_subset.txt.1 data_subset.txt.2
# 240 data_subset.txt.0
# 80 data_subset.txt.1
# 80 data_subset.txt.2
# 400 total
In general, shuffling the data ensures that each subset contains data from all classes.
However, if the dataset is imbalanced, the class distribution in each subset may not be uniform.
In such cases, user can use the --stratify option
to split the data so that each subset has a uniform class distribution.
cvtk split --input all.txt --output data_subset.txt --ratios 6:2:2 --shuffle --stratify
COCO Format Files
To split COCO format files into multiple subsets, use the cvtk cocosplit command.
cvtk cocosplit --input bbox.json --output data_subset.json --ratios 6:2:2 --shuffle
The command generates the files data_subset.json.0, data_subset.json.1,
and data_subset.json.2` in the directory.
The number of samples in each file will roughly match the ratio specified by --ratios.
The cvtk package also provides a command to combine multiple COCO format files into a single file.
You can also combine multiple COCO format files into a single file using the cvtk cococombine command.
Note that, specify the paths of the files to be combined as a comma-separated list without spaces between the files.
cvtk cococombine --input data_subset.json.0,data_subset.json.1,data_subset.json.2 \
--output data_combined.json