SreeRam Hadoop Notes: Pig : Entire Column Aggregations

Entire column aggregations.

select sum(sal) from emp;

grunt> describe emp
emp: {id: int,name: chararray,sal: int,sex: chararray,dno: int}
grunt> esal = foreach emp generate sal;
grunt> rsum = foreach esal generate SUM(sal) as tot;

-- ABOVE is invalid.
bcoz, pig aggregated functions should be applied only inner bags.

solution:
way1:

grunt> e = foreach emp generate
'ventech' as org, sal;
grunt> grp1 = group e by org;

grunt> illustrate grp1
---------------------------------------------------------------------------------------------
| emp | id:int | name:chararray | sal:int | sex:chararray | dno:int |
---------------------------------------------------------------------------------------------
| | 107 | sdkfj | 80000 | f | 13 |
| | 103 | cccc | 50000 | m | 12 |
---------------------------------------------------------------------------------------------
-------------------------------------------
| e | org:chararray | sal:int |
-------------------------------------------
| | ventech | 80000 |
| | ventech | 50000 |
-------------------------------------------
-----------------------------------------------------------------------------------------
| grp1 | group:chararray | e:bag{:tuple(org:chararray,sal:int)} |
-----------------------------------------------------------------------------------------
| | ventech | {(ventech, 80000), (ventech, 50000)} |
-----------------------------------------------------------------------------------------

grunt>

difference between describe and illustrate:

describe gives only schema of given relation.

illustrate gives,
entire heirarchy of data flow along with schema and sample data.

--used for debugging.

grunt> dump grp1

(ventech,{(ventech,50000),(ventech,80000),(ventech,40000),(ventech,10000),(ventech,90000),(ventech,50000),(ventech,50000),(ventech,40000)})

grunt> esum = foreach grp1 generate
SUM(e.sal) as tot;

grunt> eall = foreach grp1 generate
SUM(e.sal) as tot,
MAX(e.sal) as max;

grunt> dump eall
(410000,90000)

note: pig needs inner bag, to perform aggregations. when we group the data,
inner bags will be produced.
but our task does not require any grouping field. but its mandatory for pig.

solution:
provide a Constant as key(grouping column).
and group it by the key.

e = foreach emp generate 'ventech' as org, sal;

here , for all the rows(tuples)
org value is constant.

when you group it by 'org' field,
all tuples will be formed as one inner bag.

way2:

grunt> e = foreach emp generate sal;
grunt> grp2 = group e all;
grunt> describe grp2;
grp2: {group: chararray,e: {(sal: int)}}
grunt> dump grp2

(all,{(50000),(80000),(40000),(10000),(90000),(50000),(50000),(40000)})

grunt> rall = foreach grp2 generate
SUM(e.sal) as tot,
MAX(e.sal) as max,
COUNT(e) as cnt;
grunt> illustrate rall

--------------------------------------------

SreeRam Hadoop Notes

Data science Software Course Training in Ameerpet Hyderabad

Wednesday, 3 May 2017

Pig : Entire Column Aggregations

1 comment: