cogroup:-
_________
to get seperate inner bags (data groups) for each dataset.
so that, we can perform seperate aggregations on each data set.
ds1
_______
(a,10)
(b,20)
(a,30)
(b,40)
_________
ds2
_________
(a,30)
(c,30)
(c,40)
(a,20)
_________
cg = cogroup ds1 by $0, ds2 by $0;
--> group, ds1, ds2
(a,{(a,10),(a,30)},{(a,30),(a,20)})
(b,{(b,20),(b,40)},{})
(c,{},{(c,30),(c,40)})
res = foreach cg generate
group as mykey,
COUNT(ds1) as cnt1,
COUNT(ds2) as cnt2;
(a,2,2)
(b,2,0)
(c,0,2)
_________________________________
[training@localhost ~]$ cat > sales1
p1,2000
p2,3000
p1,4000
p1,5000
p2,4000
p3,5000
[training@localhost ~]$ cat > sales2
p1,6000
p2,8000
p1,1000
p1,5000
p1,6000
p2,6000
p2,8000
[training@localhost ~]$ hadoop fs -copyFromLocal sales1 pdemo
[training@localhost ~]$ hadoop fs -copyFromLocal sales2 pdemo
grunt> s1 = load 'pdemo/sales1'
>> using PigStorage(',')
>> as (pid:chararray, price:int);
grunt> s2 = load 'pdemo/sales2'
>> using PigStorage(',')
>> as (pid:chararray, price:int);
grunt> cg = cogroup s1 by pid, s2 by pid;
grunt> describe cg
cg: {group: chararray,s1: {pid: chararray,price: int},s2: {pid: chararray,price: int}}
grunt> dump cg
(p1,{(p1,2000),(p1,4000),(p1,5000)},{(p1,6000),(p1,1000),(p1,5000),(p1,6000)})
(p2,{(p2,3000),(p2,4000)},{(p2,8000),(p2,6000),(p2,8000)})
(p3,{(p3,5000)},{})
grunt> res = foreach cg generate
>> group as pid, SUM(s1.price) as tot1,
>> SUM(s2.price) as tot2;
grunt> dump res
(p1,11000,18000)
(p2,7000,22000)
(p3,5000,)
___________________________________
cleaning nulls
grunt> describe res;
res: {pid: chararray,tot1: long,tot2: long}.
grunt> res = foreach res generate
>> pid,
>> (tot1 is null ? 0:tot1) as tot1,
>> (tot2 is null ? 0:tot2) as tot2;
grunt>
(p1,11000,18000)
(p2,7000,22000)
(p3,5000,0)
grunt> res = foreach res generate
>> *, tot1+tot2 as totall;
grunt> dump res
(p1,11000,18000,29000)
(p2,7000,22000,29000)
(p3,5000,0,5000)
___________________________________
assignment:-
______________
task: cleaning
schema -->
trid, prid, price, mrp, qnt, discount
if price missed, replace it by mrp.
if qnt missed , replace it by 1.
if discount missed , replace it by 0.
sankara.deva2016@gmail.com
_______________________________
union:-
______
ds1
________
name, sal
__________
aaa,10000
bbbb,30000
cccc,40000
___________
ds2
____________
name,sal
___________
xxx,30000
yyy,40000
zzz,60000
______________
e = union ds1 , ds2
if all files schema is different.
ds3
_________
sal, name
___________
10000,abc
20000,def
____________
ds4 = foreach ds3 generate name, sal;
ds = union ds1, ds2, ds4;
___________________________
if files have different number of fields.
[training@localhost ~]$ cat > e1
ravi,30000,m
rani,40000,f
giri,50000,m
[training@localhost ~]$ cat > e2
hari,60000,11
hara,70000,12
[training@localhost ~]$ hadoop fs -copyFromLocal e1 pdemo
[training@localhost ~]$ hadoop fs -copyFromLocal e2 pdemo
grunt> e1 = load 'pdemo/e1' using PigStorage(',')
>> as (name:chararray, sal:int, sex:chararray);
grunt> e2 = load 'pdemo/e2' using PigStorage(',')
>> as (name:chararray, sal:int, dno:int);
grunt> ee1 = foreach e1 generate *,0 as dno;
grunt> ee2 = foreach e2 generate name,
>> sal,'*' as sex, dno;
grunt> ee = union ee1, ee2;
grunt>
________________________________________
distinct:- to eliminate duplicates
ds
_id,name_________
(101,a)
(102,b)
(103,c)
(101,a)
(102,b)
(101,a)
(102,b)
(102,x)
_____________
ds2 = distinct ds;
(101,a)
(102,b)
(103,c)
(102,x)
___________________________
emp ---> id,name,sal,sex,dno
what are different dno(departments)s
e = foreach emp geneate dno;
e2 = distinct e
how many unique depts?
grp = group e2 all;
res = foreach grp generate
COUNT(e2) as cnt;
_____________________________
order:
______
e = order emp by name;
e2 = order emp by sal desc;
e3 = order emp by sal desc,
dno , sex desc;
--- no limit for max number of sort fields.
________________________________
No comments:
Post a Comment