Pinot nested json ingestion

Question

Welcome To Ask or Share your Answers For Others

Pinot nested json ingestion

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

Pinot nested json ingestion

I have this json schema

{
  "name":"Pete"
  "age":24,
  "subjects":[
    {
      "name":"maths"
      "grade":"A"
    },
    {
      "name":"maths"
      "grade":"B"
    }
  ]
}

and I want to ingest this into a pinot table to run a query like

select age,subjects_grade,count(*) from table group by age,subjects_grade

Is there a way to do this in a pinot job?

question from:https://stackoverflow.com/questions/65886253/pinot-nested-json-ingestion

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:19:36+0000

Pinot has two ways to handle JSON records:

1. Flatten the record during ingestion time: In this case, we treat each nested field as a separated field, so need to:

Define those fields in the table schema
Define transform functions to flatten nested fields in table config

Please see how column subjects_name and subjects_grade is defined below. Since it's an array, so both fields are multi-value columns in Pinot.

2. Directly ingest JSON records

In this case, we treat each nested field as one single field, so need to:

Define the JSON field in table schema as a string with maxLength value
Put this field into noDictionaryColumns and jsonIndexColumns in table config
Define transform functions jsonFormat to stringify the JSON field in table config

Please see how column subjects_str is defined below.

Below is the sample table schema/config/query:

Sample Pinot Schema:

{
  "metricFieldSpecs": [],
  "dimensionFieldSpecs": [
    {
      "dataType": "STRING",
      "name": "name"
    },
    {
      "dataType": "LONG",
      "name": "age"
    },
    {
      "dataType": "STRING",
      "name": "subjects_str"
    },
    {
      "dataType": "STRING",
      "name": "subjects_name",
      "singleValueField": false
    },
    {
      "dataType": "STRING",
      "name": "subjects_grade",
      "singleValueField": false
    }
  ],
  "dateTimeFieldSpecs": [],
  "schemaName": "myTable"
}

Sample Table Config:

{
    "tableName": "myTable",
    "tableType": "OFFLINE",
    "segmentsConfig": {
        "segmentPushType": "APPEND",
        "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
        "schemaName": "myTable",
        "replication": "1"
    },
    "tenants": {},
    "tableIndexConfig": {
        "loadMode": "MMAP",
        "invertedIndexColumns": [],
        "noDictionaryColumns": [
            "subjects_str"
        ],
        "jsonIndexColumns": [
            "subjects_str"
        ]
    },
    "metadata": {
        "customConfigs": {}
    },
    "ingestionConfig": {
        "batchIngestionConfig": {
            "segmentIngestionType": "APPEND",
            "segmentIngestionFrequency": "DAILY",
            "batchConfigMaps": [],
            "segmentNameSpec": {},
            "pushSpec": {}
        },
        "transformConfigs": [
            {
                "columnName": "subjects_str",
                "transformFunction": "jsonFormat(subjects)"
            },
            {
                "columnName": "subjects_name",
                "transformFunction": "jsonPathArray(subjects, '$.[*].name')"
            },
            {
                "columnName": "subjects_grade",
                "transformFunction": "jsonPathArray(subjects, '$.[*].grade')"
            }
        ]
    }
}

Sample Query:

select age, subjects_grade, count(*) from myTable GROUP BY  age, subjects_grade

select age, json_extract_scalar(subjects_str, '$.[*].grade', 'STRING') as subjects_grade, count(*) from myTable GROUP BY  age, subjects_grade

Comparing both ways, we recommend solution 1 to flatten the nested fields out when the field density is high(e.g. every document has field name and grade, then it's worth extracting them out to be new columns), it gives better query performance and better storage efficiency.

For solution 2, it's simpler in configuration, and good for sparse fields(e.g. only a few documents have certain fields). It requires to use json_extract_scalar function to access the nested field.

Please also note the behavior of Pinot GROUP BY on multi-value columns.

More references:

Pinot Column Transformation

Pinot JSON Functions

Pinot JSON Index

Pinot Multi-value Functions