Data science Software Course Training in Ameerpet Hyderabad

Data science Software Course Training in Ameerpet Hyderabad

Wednesday 3 May 2017

Pig : How to perform grouping by Multiple Columns


 how to perform grouping by multiple columns.
-------------------------------------------
 task: mutiple grouping with mulitiple aggregations .

 sql:
  select dno, sex , sum(sal) ,
         count(*), avg(sal), max(sal),
           min(sal) from emp
         group by dno, sex;

grunt> emp = load 'piglab/emp' using PigStorage(',')
>>   as (id:int, name:chararray,  sal:int,
>>     sex:chararray, dno:int);
grunt>

grunt> e = foreach emp generate dno, sex, sal;
grunt> grp = group e by dno, sex;
2017-05-03 18:54:14,501 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 7, column 25>  Syntax error, unexpected symbol at or near ';'
Details at logfile: /home/cloudera/pig_1493862740272.log
grunt>

above grouping is invalid.
 pig does not allow grouping by multiple fields.


solution:
  make mulitple fields as a tuple.
  and group it by the tuple field.

grunt> grp = group e by (dno,sex);
grunt> describe grp;
grp: {group: (dno: int,sex: chararray),e: {(dno: int,sex: chararray,sal: int)}}
grunt> dump grp
((11,m),{(11,m,50000),(11,m,40000)})
((12,f),{(12,f,50000)})
((12,m),{(12,m,40000),(12,m,10000),(12,m,50000)})
((13,f),{(13,f,80000),(13,f,90000)})

grunt> res = foreach grp generate
>>         group.dno as dno,
>>        group.sex as gender,
>>        SUM(e.sal) as tot,
>>        COUNT(e) as cnt,
>>       AVG(e.sal) as avg,
>>       MAX(e.sal) as max, MIN(e.sal) as min;
grunt>dump res
(11,m,90000,2,45000.0,50000,40000)
(12,f,50000,1,50000.0,50000,50000)
(12,m,100000,3,33333.333333333336,50000,10000)
(13,f,170000,2,85000.0,90000,80000)
-----------------------------------------












No comments:

Post a Comment