SreeRam Hadoop Notes: Pig : Data types and Operators

Data types:

simple data types:
---------------------
int --> 32 bit integer.
long ---> 64 bit "
float --> 32 bit float [ not available in
latest version ]
double --> 64 bit

boolean --> true/false

-------------------------
complex data types:
-------------------
tuple.
bag:

name : chararray [ older versions 0.7 before,
2gb is length
later 4 gb is length ]
----> variable length.

age : int
sal : double.
sex : chararry

wife: tuple: (rani,24,hyd)
children : bag : {(sony,4,m),(tony,2,f)}

sample tuples of a outerbag.

profiles
-------------------------------
(Ravi, 26, M, (rani,24,hyd),{(sony,4,m), (tony,2,f)})
:
:
------------------------------------------
pig latin statement:
-----------------------
structure

<Alias of Relation> = <Operator along with expressions >

these expressions will be changing from one operator to another.

1) load
2) describe
3) dump
4) store
5) foreach
6) filter
7) limit
8) sample
10) group
11) cogroup
12) union
13) join
14) left outer join
15) right outer join
16) full outer join
17) cross
18) pig
19) exec
20) run
21) illustrate

load:
------
to load data from
file to pig relation.
[ logical load ]

A = load 'file1' using PigStorage(',')
as (a:int, b:int, c:int);

here A is alias of relation.

standard: use Capital letters for relation name.

the input file can be ,
hdfs / local file.
that depends on start-up mode of pig .
------------------
describe :
--> to get the schema of a relation.

grunt> describe emp ;

dump:-- to execute data flow
.

A = load 'file1' ......
B = foreach A generate ...
C = foreach B generate ...
D = group C by ....
E = foreach D generate ....

grunt> dump E
---> the flow will be executed from root relation. and writes output into console..

store:-- to execute data flow,
and writes output into file.

the file can be local/hdfs depends on start up mode.

grunt> store E into '/user/cloudera/myresults';
---> myresults will be output directory.
file s are prefixed with
'part-m-<number>' --> output written by mapper
or 'part-r-<number>' --> output written by reducer.

limit:-
------------
---> to get first n number of tuples.
grunt> X = limit A 3;

--> to get last n number of tuples.
two solutions:
i) udf
ii) joins ;.

-----------------------------------
Sample:
-------
to get random samples.

two types of smpling techniques.
i) sampling with out replacement.
---> different sample sets,
dont have common elements (tuples)
solution: Hive Bucketing.

ii) sampling with replacement:
---> different sample sets,
can have common tuples.

soluting: pig Sampling.

grunt> s1 = sample products 0.05;
grunt> s2 = sample products 0.05;
grunt> s3 = sample products 0.05;

filter:
to filter tuples based on given criteria.

males = filter emp by (sex='m');

3 ways to create subsets:
i)limit ii) sample iii)filter

foreach:
--------
to process a tuple.

i) to filter fieds.
ii) to copy data from one relation to another.
iii) to change field orders
iv) to rename the fields.
v) changing field data types.
vi) performing transformations.
with given expressiosn.
vii) conditional transformations.

ETL
extract transform loading.

extracting from databases.
performing transformations,
loading into target systems.
---> above is bad approach for bigdata.

ELT is recommended for big data.

E -->? extracting from rdbms, using sqoop
L --> load into hdfs.
T --> transform using Pig/hive/mr/spark.

SreeRam Hadoop Notes

Data science Software Course Training in Ameerpet Hyderabad

Wednesday, 3 May 2017

Pig : Data types and Operators

No comments:

Post a Comment