Data science Software Course Training in Ameerpet Hyderabad

Data science Software Course Training in Ameerpet Hyderabad

Tuesday, 14 June 2016

Pig Lab4


cogroup:-
_________

to get seperate inner bags (data groups) for each dataset.
 so that, we can perform seperate aggregations on each data set.

  ds1
_______
 (a,10)
 (b,20)
 (a,30)
 (b,40)
_________

  ds2
_________
 (a,30)
 (c,30)
 (c,40)
 (a,20)
_________

cg = cogroup ds1 by $0, ds2 by $0;
 --> group, ds1, ds2
(a,{(a,10),(a,30)},{(a,30),(a,20)})
(b,{(b,20),(b,40)},{})
(c,{},{(c,30),(c,40)})

res = foreach cg generate
      group as mykey, 
  COUNT(ds1) as cnt1,
  COUNT(ds2) as cnt2;
(a,2,2)
(b,2,0)
(c,0,2)
_________________________________

[training@localhost ~]$ cat > sales1
p1,2000
p2,3000
p1,4000
p1,5000
p2,4000
p3,5000
[training@localhost ~]$ cat > sales2
p1,6000
p2,8000
p1,1000
p1,5000
p1,6000
p2,6000
p2,8000
[training@localhost ~]$ hadoop fs -copyFromLocal sales1  pdemo


[training@localhost ~]$ hadoop fs -copyFromLocal sales2  pdemo

grunt> s1 = load 'pdemo/sales1'                
>>     using PigStorage(',')
>>    as (pid:chararray, price:int);
grunt> s2 = load 'pdemo/sales2'         
>>     using PigStorage(',')        
>>    as (pid:chararray, price:int);
grunt> cg = cogroup s1 by pid, s2 by pid;
grunt> describe cg
cg: {group: chararray,s1: {pid: chararray,price: int},s2: {pid: chararray,price: int}}
grunt> dump cg

(p1,{(p1,2000),(p1,4000),(p1,5000)},{(p1,6000),(p1,1000),(p1,5000),(p1,6000)})
(p2,{(p2,3000),(p2,4000)},{(p2,8000),(p2,6000),(p2,8000)})
(p3,{(p3,5000)},{})

grunt> res = foreach cg generate
>>    group as pid, SUM(s1.price) as tot1,
>>    SUM(s2.price) as tot2;
grunt> dump res

(p1,11000,18000)
(p2,7000,22000)
(p3,5000,)

___________________________________

cleaning nulls

grunt> describe res;
res: {pid: chararray,tot1: long,tot2: long}.

grunt> res = foreach res generate             
>>    pid,                                
>>    (tot1 is null   ?   0:tot1) as tot1,
>>   (tot2 is null ? 0:tot2) as tot2;     
grunt> 


(p1,11000,18000)
(p2,7000,22000)
(p3,5000,0)

    
grunt> res = foreach res generate
>>      *,  tot1+tot2 as totall;
grunt> dump res

(p1,11000,18000,29000)
(p2,7000,22000,29000)
(p3,5000,0,5000)

___________________________________

assignment:-
______________

task: cleaning 
schema -->
  trid, prid, price, mrp, qnt, discount

if  price missed, replace it by mrp.
if qnt missed , replace it by 1.
if discount missed , replace it by 0.

 sankara.deva2016@gmail.com
_______________________________

union:-
______

 ds1
________
name, sal
__________
aaa,10000
bbbb,30000
cccc,40000
___________

ds2
____________
name,sal
___________
 xxx,30000
 yyy,40000
 zzz,60000
______________

 e = union ds1  , ds2



if all files schema is different.
 ds3
_________
sal, name
___________
10000,abc
20000,def
____________

ds4 = foreach ds3 generate name, sal;

ds = union ds1, ds2, ds4;
___________________________

if files have different number of fields.

[training@localhost ~]$ cat > e1
ravi,30000,m
rani,40000,f
giri,50000,m
[training@localhost ~]$ cat > e2
hari,60000,11
hara,70000,12
[training@localhost ~]$ hadoop fs -copyFromLocal e1 pdemo
[training@localhost ~]$ hadoop fs -copyFromLocal e2 pdemo

grunt> e1 = load 'pdemo/e1' using PigStorage(',')
>>   as (name:chararray, sal:int, sex:chararray);
grunt> e2 = load 'pdemo/e2' using PigStorage(',')
>>   as (name:chararray, sal:int, dno:int);
grunt> ee1 = foreach e1 generate *,0 as dno;
grunt> ee2 = foreach e2 generate name,
>>         sal,'*' as sex, dno;
grunt> ee = union ee1, ee2;
grunt> 

________________________________________

distinct:- to eliminate duplicates

    ds
  _id,name_________
  (101,a)
  (102,b)
  (103,c)
  (101,a)
  (102,b)
  (101,a)
  (102,b)
  (102,x)
_____________

  ds2 = distinct ds;
  (101,a)
  (102,b)
  (103,c)
  (102,x)
___________________________

emp ---> id,name,sal,sex,dno

 what are different dno(departments)s

 e = foreach emp geneate dno;
 e2 = distinct e

 how many unique depts?

  grp = group e2 all;
 res = foreach grp generate
 COUNT(e2) as cnt; 

_____________________________

order:
______

 e = order emp by name;
 e2 = order emp by sal desc;
 e3 = order emp by sal desc,
              dno , sex desc;

--- no limit for max number of sort fields.

________________________________












































































  



















































1 comment:

  1. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
    Big Data Hadoop Training in electronic city

    ReplyDelete