Pinot has two ways to handle JSON records:
1. Flatten the record during ingestion time:
In this case, we treat each nested field as a separated field, so need to:
- Define those fields in the table schema
- Define transform functions to flatten nested fields in table config
Please see how column subjects_name
and subjects_grade
is defined below. Since it's an array, so both fields are multi-value columns in Pinot.
2. Directly ingest JSON records
In this case, we treat each nested field as one single field, so need to:
- Define the JSON field in table schema as a string with maxLength value
- Put this field into noDictionaryColumns and jsonIndexColumns in table config
- Define transform functions
jsonFormat
to stringify the JSON field in table config
Please see how column subjects_str
is defined below.
Below is the sample table schema/config/query:
Sample Pinot Schema:
{
"metricFieldSpecs": [],
"dimensionFieldSpecs": [
{
"dataType": "STRING",
"name": "name"
},
{
"dataType": "LONG",
"name": "age"
},
{
"dataType": "STRING",
"name": "subjects_str"
},
{
"dataType": "STRING",
"name": "subjects_name",
"singleValueField": false
},
{
"dataType": "STRING",
"name": "subjects_grade",
"singleValueField": false
}
],
"dateTimeFieldSpecs": [],
"schemaName": "myTable"
}
Sample Table Config:
{
"tableName": "myTable",
"tableType": "OFFLINE",
"segmentsConfig": {
"segmentPushType": "APPEND",
"segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
"schemaName": "myTable",
"replication": "1"
},
"tenants": {},
"tableIndexConfig": {
"loadMode": "MMAP",
"invertedIndexColumns": [],
"noDictionaryColumns": [
"subjects_str"
],
"jsonIndexColumns": [
"subjects_str"
]
},
"metadata": {
"customConfigs": {}
},
"ingestionConfig": {
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"segmentIngestionFrequency": "DAILY",
"batchConfigMaps": [],
"segmentNameSpec": {},
"pushSpec": {}
},
"transformConfigs": [
{
"columnName": "subjects_str",
"transformFunction": "jsonFormat(subjects)"
},
{
"columnName": "subjects_name",
"transformFunction": "jsonPathArray(subjects, '$.[*].name')"
},
{
"columnName": "subjects_grade",
"transformFunction": "jsonPathArray(subjects, '$.[*].grade')"
}
]
}
}
Sample Query:
select age, subjects_grade, count(*) from myTable GROUP BY age, subjects_grade
select age, json_extract_scalar(subjects_str, '$.[*].grade', 'STRING') as subjects_grade, count(*) from myTable GROUP BY age, subjects_grade
Comparing both ways, we recommend solution 1 to flatten the nested fields out when the field density is high(e.g. every document has field name and grade, then it's worth extracting them out to be new columns), it gives better query performance and better storage efficiency.
For solution 2, it's simpler in configuration, and good for sparse fields(e.g. only a few documents have certain fields). It requires to use json_extract_scalar function to access the nested field.
Please also note the behavior of Pinot GROUP BY on multi-value columns.
More references:
Pinot Column Transformation
Pinot JSON Functions
Pinot JSON Index
Pinot Multi-value Functions