There is some gap in your understanding.
First of all, the degree of parallelism is controlled by -m <n>
or --num-mappers <n>
. By default value of --num-mappers
is 4.
Second, --split-by <column-name>
, will split your task on the basis of column-name.
Third, $CONDITIONS
, it is used internally by sqoop to achieve this splitting task.
Example, You fired a query:
sqoop import --connect jdbc:mysql://myserver:1202/ --username u1 --password p1 --query 'select * from emp where $CONDITIONS' --split-by empId --target-dir /temp/emp -m 4
Say, my empId is uniformly distributed from 1- 100.
Now, sqoop will take --split-by
column and find its max and min value using query:
SELECT MIN(empId), MAX(empId) FROM (Select * From emp WHERE (1 = 1) ) t1
See it replaced $CONDITIONS
with (1 = 1)
.
In our case, min, max values are 1 and 100.
As number of mappers are 4, sqoop will divide my query in 4 parts.
Creating input split with lower bound 'empId >= 1' and upper bound 'empId < 25'
Creating input split with lower bound 'empId >= 25' and upper bound 'empId < 50'
Creating input split with lower bound 'empId >= 50' and upper bound 'empId < 75'
Creating input split with lower bound 'empId >= 75' and upper bound 'empId <= 100'
Now $CONDITIONS
will again come into the picture. It is replaced by above range queries.
First mapper will fire query like this:
Select * From emp WHERE empId >= 25' AND 'empId < 50
and so on for other 3 mappers.
Results from all the mappers is aggregated and written to a final HDFS directory.
Regarding your query :
select deptid, avg(salary) from emp group by deptid
you will specify
--query 'select deptid, avg(salary) from emp group by deptid where $CONDITIONS'
It will be first converted to
select deptid, avg(salary) from emp group by deptid where (1 = 0)
to fetch column metadata.
I believe this query won't run in RDBMS. Try above query(having Where (1 = 0)
) directly in Mysql.
So you will not be able to use this query to fetch data using Sqoop.
Sqoop is used for simpler SQL queries.