Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
687 views
in Technique[技术] by (71.8m points)

regex - Hive: Usage of Concat in Column Name

I am trying to get data from a table that has column name as: year_2016, year_2017, year_2018 etc. I am not sure how to get the data from this table. The data looks like:

| count_of_accidents | year_2016 | year_2017 |year_2018 | 
|--------------------|-----------|-----------|----------|
| 15                 | 12        | 5         | 1        |
| 5                  | 10        | 6         | 18       |

I have tried 'concat' function but this doesn't really work.

I have tried with this:

select SUM( count_of_accidents * concat('year_',year(regexp_replace('2018_1_1','_','-')))) 
from table_name;

The column name (year_2017 or year_2018 etc) will be passed as a parameter. So, I am not really able to hardcode the column name like this-

select SUM( count_of_accidents * year_2018) from table_name;

Is there any way I can do this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can do it using regular expressions. Like this:

--create test table
create table test_col(year_2018 string, year_2019 string);

set hive.support.quoted.identifiers=none;
set hive.cli.print.header=true;

--test select using hard-coded pattern
select year_2018, `(year_)2019` from test_col;
OK
year_2018       year_2019
Time taken: 0.862 seconds

--test pattern parameter
set hivevar:year_param=2019;

select year_2018, `(year_)${year_param}` from test_col;
OK
year_2018       year_2019
Time taken: 0.945 seconds

--two parameters
set hivevar:year_param1=2018;
set hivevar:year_param2=2019;

select `(year_)${year_param1}`, `(year_)${year_param2}`  from test_col t;
OK
year_2018       year_2019
Time taken: 0.159 seconds

--parameter contains full column_name and using more strict regexp pattern
set hivevar:year_param2=year_2019;

select `^${year_param2}$` from test_col t;
OK
year_2019
Time taken: 0.053 seconds

--select all columns using single pattern year_ and four digits
select `^year_[0-9]{4}$`  from test_col t;
OK
year_2018       year_2019

Parameter should be calculated and passed to the hive script, no functions like concat(), regexp_replace are supported in the column names.

Also column aliasing does not work for columns extracted using regular expressions:

 select t.number_of_incidents, `^${year_param}$` as year1 from test_t t;

throws exception:

FAILED: SemanticException [Error 10004]: Line 1:30 Invalid table alias or column reference '^year_2018$': (possible column names are: number_of_incidents, year_2016, year_2017, year_2018)

I found a workaround to alias a column using union all with empty dataset, see this test:

create table test_t(number_of_incidents int, year_2016 int, year_2017 int, year_2018 int);
insert into table test_t values(15, 12, 5, 1); --insert test data
insert into table test_t values(5,10,6,18);

--parameter, can be passed from outside the script from command line  
set hivevar:year_param=year_2018;

--enable regex columns and print column names
set hive.support.quoted.identifiers=none;
set hive.cli.print.header=true;

--Alias column using UNION ALL with empty dataset
select sum(number_of_incidents*year1) incidents_year1 
  from
    (--UNION ALL with empty dataset to alias columns extracted
     select 0 number_of_incidents, 0 year1 where false --returns no rows because of false condition
     union all
     select t.number_of_incidents, `^${year_param}$` from test_t t
    )s;

Result:

OK
incidents_year1
105
Time taken: 38.003 seconds, Fetched: 1 row(s)

First query in the UNION ALL does not affect data because it returns no rows. But it's column names become the names of the whole UNION ALL dataset and can be used in the upper query. This trick works. If you will find a better workaround to alias columns extracted using regexp, please add your solution as well.

Update:

No need in regular expressions if you can pass full column_name as a parameter. Hive substitutes variables as is (does not calculate them) before query execution. Use regexp only if you can not pass full column name for some reason and like in the original query some pattern concatenation is needed. See this test:

--parameter, can be passed from outside the script from command line  
set hivevar:year_param=year_2018;

select sum(number_of_incidents*${year_param}) incidents_year1 from test_t t;

Result:

OK
incidents_year1
105
Time taken: 63.339 seconds, Fetched: 1 row(s)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...