Data science Software Course Training in Ameerpet Hyderabad

Data science Software Course Training in Ameerpet Hyderabad

Tuesday 14 June 2016

Pig Lab4


cogroup:-
_________

to get seperate inner bags (data groups) for each dataset.
 so that, we can perform seperate aggregations on each data set.

  ds1
_______
 (a,10)
 (b,20)
 (a,30)
 (b,40)
_________

  ds2
_________
 (a,30)
 (c,30)
 (c,40)
 (a,20)
_________

cg = cogroup ds1 by $0, ds2 by $0;
 --> group, ds1, ds2
(a,{(a,10),(a,30)},{(a,30),(a,20)})
(b,{(b,20),(b,40)},{})
(c,{},{(c,30),(c,40)})

res = foreach cg generate
      group as mykey, 
  COUNT(ds1) as cnt1,
  COUNT(ds2) as cnt2;
(a,2,2)
(b,2,0)
(c,0,2)
_________________________________

[training@localhost ~]$ cat > sales1
p1,2000
p2,3000
p1,4000
p1,5000
p2,4000
p3,5000
[training@localhost ~]$ cat > sales2
p1,6000
p2,8000
p1,1000
p1,5000
p1,6000
p2,6000
p2,8000
[training@localhost ~]$ hadoop fs -copyFromLocal sales1  pdemo


[training@localhost ~]$ hadoop fs -copyFromLocal sales2  pdemo

grunt> s1 = load 'pdemo/sales1'                
>>     using PigStorage(',')
>>    as (pid:chararray, price:int);
grunt> s2 = load 'pdemo/sales2'         
>>     using PigStorage(',')        
>>    as (pid:chararray, price:int);
grunt> cg = cogroup s1 by pid, s2 by pid;
grunt> describe cg
cg: {group: chararray,s1: {pid: chararray,price: int},s2: {pid: chararray,price: int}}
grunt> dump cg

(p1,{(p1,2000),(p1,4000),(p1,5000)},{(p1,6000),(p1,1000),(p1,5000),(p1,6000)})
(p2,{(p2,3000),(p2,4000)},{(p2,8000),(p2,6000),(p2,8000)})
(p3,{(p3,5000)},{})

grunt> res = foreach cg generate
>>    group as pid, SUM(s1.price) as tot1,
>>    SUM(s2.price) as tot2;
grunt> dump res

(p1,11000,18000)
(p2,7000,22000)
(p3,5000,)

___________________________________

cleaning nulls

grunt> describe res;
res: {pid: chararray,tot1: long,tot2: long}.

grunt> res = foreach res generate             
>>    pid,                                
>>    (tot1 is null   ?   0:tot1) as tot1,
>>   (tot2 is null ? 0:tot2) as tot2;     
grunt> 


(p1,11000,18000)
(p2,7000,22000)
(p3,5000,0)

    
grunt> res = foreach res generate
>>      *,  tot1+tot2 as totall;
grunt> dump res

(p1,11000,18000,29000)
(p2,7000,22000,29000)
(p3,5000,0,5000)

___________________________________

assignment:-
______________

task: cleaning 
schema -->
  trid, prid, price, mrp, qnt, discount

if  price missed, replace it by mrp.
if qnt missed , replace it by 1.
if discount missed , replace it by 0.

 sankara.deva2016@gmail.com
_______________________________

union:-
______

 ds1
________
name, sal
__________
aaa,10000
bbbb,30000
cccc,40000
___________

ds2
____________
name,sal
___________
 xxx,30000
 yyy,40000
 zzz,60000
______________

 e = union ds1  , ds2



if all files schema is different.
 ds3
_________
sal, name
___________
10000,abc
20000,def
____________

ds4 = foreach ds3 generate name, sal;

ds = union ds1, ds2, ds4;
___________________________

if files have different number of fields.

[training@localhost ~]$ cat > e1
ravi,30000,m
rani,40000,f
giri,50000,m
[training@localhost ~]$ cat > e2
hari,60000,11
hara,70000,12
[training@localhost ~]$ hadoop fs -copyFromLocal e1 pdemo
[training@localhost ~]$ hadoop fs -copyFromLocal e2 pdemo

grunt> e1 = load 'pdemo/e1' using PigStorage(',')
>>   as (name:chararray, sal:int, sex:chararray);
grunt> e2 = load 'pdemo/e2' using PigStorage(',')
>>   as (name:chararray, sal:int, dno:int);
grunt> ee1 = foreach e1 generate *,0 as dno;
grunt> ee2 = foreach e2 generate name,
>>         sal,'*' as sex, dno;
grunt> ee = union ee1, ee2;
grunt> 

________________________________________

distinct:- to eliminate duplicates

    ds
  _id,name_________
  (101,a)
  (102,b)
  (103,c)
  (101,a)
  (102,b)
  (101,a)
  (102,b)
  (102,x)
_____________

  ds2 = distinct ds;
  (101,a)
  (102,b)
  (103,c)
  (102,x)
___________________________

emp ---> id,name,sal,sex,dno

 what are different dno(departments)s

 e = foreach emp geneate dno;
 e2 = distinct e

 how many unique depts?

  grp = group e2 all;
 res = foreach grp generate
 COUNT(e2) as cnt; 

_____________________________

order:
______

 e = order emp by name;
 e2 = order emp by sal desc;
 e3 = order emp by sal desc,
              dno , sex desc;

--- no limit for max number of sort fields.

________________________________












































































  



















































No comments:

Post a Comment