开窗见月,霜天悄然,欲更小文,以为消遣。

本篇以解析 CSV 为例,再谈 C++20 的使用。网上方法,颇为陈旧,看新方式何以优雅实现。

开始之前,定义为先:

Comma-separated values (CSV) is a text file format that uses commas to separate values. A CSV file stores tabular data (numbers and text) in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

CSV 文件是以逗号分隔数据的一种文本格式,每行表示一个数据记录,列数一致。机器学习中的许多数据集便是此种格式,解析工作,相当常见。

本文便以一真实数据集为例,进行演示。数据集地址为 chip-dataset,其中部分内容展示如下表。

Product Type Release Date Process Size (nm) TDP (W) Die Size (mm^2) Transistors (million) Freq (MHz) Foundry Vendor FP16 GFLOPS FP32 GFLOPS FP64 GFLOPS
0 AMD Athlon 64 3500+ CPU 2007-02-20 65.0 45.0 77.0 122.0 2200.0 Unknown AMD
1 AMD Athlon 200GE CPU 2018-09-06 14.0 35.0 192.0 4800.0 3200.0 Unknown AMD
2 Intel Core i5-1145G7 CPU 2020-09-02 10.0 28.0 2600.0 Intel Intel
3 Intel Xeon E5-2603 v2 CPU 2013-09-01 22.0 80.0 160.0 1400.0 1800.0 Intel Intel
4 AMD Phenom II X4 980 BE CPU 2011-05-03 45.0 125.0 258.0 758.0 3700.0 Unknown AMD
5 Intel Xeon E5-2470 v2 CPU 2013-09-01 22.0 95.0 160.0 1400.0 2400.0 Intel Intel
6 AMD Phenom X4 9750 (125W) CPU 2008-03-27 65.0 125.0 285.0 450.0 2400.0 Unknown AMD
7 Intel Pentium D 930 CPU 2006-01-16 65.0 130.0 140.0 376.0 3000.0 Intel Intel
8 Intel Core i3-1125G4 CPU 2020-09-02 10.0 28.0 2000.0 Intel Intel
9 AMD Athlon 64 X2 4200+ CPU 2006-05-23 90.0 89.0 156.0 154.0 2200.0 Unknown AMD

该芯片数据集,含 2185 条 CPU 数据和 2668 条 GPU 数据。

数据既定,接下来便且书且析。

首先,确定输入与输出,写出函数原型。

using dataset_sequence_type = std::vector<std::vector<std::string>>;

auto read_csv(std::string_view file, std::string_view type = "", std::string_view delimiter = ",")
    -> std::optional<dataset_sequence_type>
{
    std::ifstream data_file(file.data());
    if (!data_file.is_open())
        return {};

    // do parsing

    data_file.close();
}

三个输入参数分别表示数据集文件路径、筛选类型(CPU or GPU)和 分隔符,后二者皆为可选参数。

返回值采用 std::optional,便于检测结果的有效性,实现返回值为 std::vector 构成的动态二维数组,一条记录占一行,每个元素占一列。

其次,逐行读取文件,依分隔符拆分数据。

using dataset_sequence_type = std::vector<std::vector<std::string>>;

auto read_csv(std::string_view file, std::string_view type = "", std::string_view delimiter = ",")
    -> std::optional<dataset_sequence_type>
{
    std::ifstream data_file(file.data());
    if (!data_file.is_open())
        return {};

    // do parsing
    std::string line;
    dataset_sequence_type result;
    std::getline(data_file, line); // skip the title
    while (std::getline(data_file, line)) {
        auto tokens = line
                    | std::views::split(delimiter)
                    | std::views::transform([](auto&& token) {
                        return std::string_view(&*token.begin(), std::ranges::distance(token));
                    });

        // other work
    }
}

表头为数据描述信息,是以弃之。解析工作,乃 Views 拿手好戏,由 std::views::splitstd::views::transform 轻松拿下。因 split_ivew 里面的值类型为 ranges::subrange,这里借助 transform 将其转换为 string_view

至此,已实现殆半。余下难题主要在于过滤与保存,若无需过滤,type 参数便可弃去,问题顿消。

// ...

auto read_csv(std::string_view file, std::string_view type = "", std::string_view delimiter = ",")
    -> std::optional<dataset_sequence_type>
{
    // ...
    while (std::getline(data_file, line)) {
        auto tokens = line
                    | std::views::split(delimiter)
                    | std::views::transform([](auto&& token) {
                        return std::string_view(&*token.begin(), std::ranges::distance(token));
                    });

        // other work
        result.push_back(dataset_sequence_type::value_type(tokens.begin(), tokens.end()));
    }

    return result;
}

若是过滤,将所有 Views 转换成 std::vector,些许始建即弃,未免浪费。于是先筛后存,type 为数据集第二列,然而 transform_view 并不支持随机访问,你无法像 vector 那般以下标直接访问某列元素。

对此问题,最简之法是借助 std::advance,它可以控制迭代器前进。

// ...

auto read_csv(std::string_view file, std::string_view type = "", std::string_view delimiter = ",")
    -> std::optional<dataset_sequence_type>
{
    // ...
    while (std::getline(data_file, line)) {
        auto tokens = line
                    | std::views::split(delimiter)
                    | std::views::transform([](auto&& token) {
                        return std::string_view(&*token.begin(), std::ranges::distance(token));
                    });

        // filter
        auto it = std::ranges::begin(tokens);
        std::ranges::advance(it, 2);
        if (type.empty() || *it == type) {
            // save all records or filtered records.
            result.push_back(dataset_sequence_type::value_type(tokens.begin(), tokens.end()));
        }
    }

    return result;
}

最后,你可能还想对 read_csv() 添加 constexpr,只惜 std::ifstream 当前并不支持编译期,无法实现。那是否存在其他方式呢?暂不作表,暇日续究。

该实现具有通用性(去除过滤,或将过滤以 lambda 抽象出来,则可更加通用),完整代码及使用示例:

using dataset_sequence_type = std::vector<std::vector<std::string>>;

auto read_csv(std::string_view file, std::string_view type, std::string_view delimiter)
    -> std::optional<dataset_sequence_type>
{
    std::ifstream data_file(file.data());
    if (!data_file.is_open()) {
        return {};
    }

    std::string line;
    std::getline(data_file, line); // skip the title
    dataset_sequence_type result;
    while (std::getline(data_file, line)) {
        auto tokens = line
                    | std::views::split(delimiter)
                    | std::views::transform([](auto&& token) {
                        return std::string_view(&*token.begin(), std::ranges::distance(token));
                    });

        auto it = std::ranges::begin(tokens);
        std::ranges::advance(it, 2);
        if (type.empty() || *it == type) {
            // save all records or filtered records.
            result.push_back(dataset_sequence_type::value_type(tokens.begin(), tokens.end()));
        }
    }

    return result;
}

int main() {
    // 加载数据集
    auto chip = read_csv("./datasets/chip_dataset.csv", "CPU");
    if (chip) {
        std::ranges::for_each(chip.value(), [](const dataset_sequence_type::value_type& cpu) {
            fmt::print("{}\n", cpu);
        });
    }
}

众多方法,于斯为巧,寥寥数行,便实现了需求。

Leave a Reply

Your email address will not be published. Required fields are marked *

You can use the Markdown in the comment form.