Data science Software Course Training in Ameerpet Hyderabad

Data science Software Course Training in Ameerpet Hyderabad

Wednesday 3 May 2017

Pig : Data types and Operators

 Data types:

  simple data types:
 ---------------------
   int --> 32 bit integer.
   long ---> 64 bit "
   float --> 32 bit float [ not available in
             latest version ]
   double --> 64 bit
 
   boolean --> true/false
 
  -------------------------
 complex data types:
 -------------------
  tuple.
  bag:



 name : chararray [ older versions 0.7 before,
              2gb is length
      later 4 gb is length ]
       ----> variable length.

 age : int
 sal : double.
 sex : chararry

 wife: tuple: (rani,24,hyd)
 children : bag : {(sony,4,m),(tony,2,f)}

 sample tuples of a outerbag.

        profiles
-------------------------------
 (Ravi, 26, M, (rani,24,hyd),{(sony,4,m),     (tony,2,f)})
   :
   :
------------------------------------------
 pig latin statement:
-----------------------
 structure

 <Alias of Relation> = <Operator along with expressions >

  these expressions will be changing from one operator to another.

 1) load
 2) describe
 3) dump
 4) store
 5) foreach
 6) filter
 7) limit
 8) sample
 10) group
 11) cogroup
 12) union
 13) join
 14) left outer join
 15) right outer join
 16) full outer join
 17) cross
 18) pig
 19) exec
 20) run
 21) illustrate

load:
------
  to load data from
   file to pig relation.
   [ logical load ]

  A = load 'file1' using PigStorage(',')
      as (a:int, b:int, c:int);

  here A is alias of relation.
 
  standard: use Capital letters for relation name.

   the input file can be ,
  hdfs / local file.
  that depends on start-up mode of pig .
------------------
 describe :
  --> to get the schema of a relation.

 grunt> describe emp ;

dump:-- to execute data flow
.

  A = load 'file1' ......
  B = foreach A generate ...
  C = foreach B generate ...
  D = group C by ....
  E = foreach D generate ....

 grunt> dump E
 ---> the flow will be executed from root relation. and writes output into console..

store:-- to execute data flow,
    and writes output into file.

  the file can be local/hdfs depends on start up mode.

grunt> store E into '/user/cloudera/myresults';
 ---> myresults will be output directory.
   file s are prefixed with
   'part-m-<number>'  --> output written by mapper
   or 'part-r-<number>' --> output written by reducer.

limit:-
------------
  ---> to get first n number of tuples.
 grunt> X = limit A 3;

  --> to get last n number of tuples.
  two solutions:
  i) udf
 ii) joins ;.

-----------------------------------
 Sample:
 -------
   to get random samples.

   two types of smpling  techniques.
 i) sampling with out replacement.
   ---> different sample sets,
    dont have common elements (tuples)
    solution: Hive Bucketing.

 ii) sampling with replacement:
  ---> different sample sets,
   can have common tuples.

  soluting: pig Sampling.

 grunt> s1 = sample  products  0.05;
grunt> s2 = sample products 0.05;
grunt> s3 = sample products 0.05;


 filter:
   to filter tuples based on given criteria.

 males = filter emp by (sex='m');

3 ways to create subsets:
 i)limit ii) sample iii)filter


 foreach:
 --------
     to process a tuple.
 
   i) to filter fieds.
  ii) to copy data from one relation to another.
 iii) to change field orders
  iv) to rename the fields.
  v) changing field data types.
  vi) performing transformations.
      with given expressiosn.
  vii) conditional transformations.

 ETL
   extract transform loading.
 
  extracting from databases.
  performing transformations,
  loading into target systems.
  ---> above is bad approach for bigdata.

ELT is recommended for big data.

 E -->? extracting from rdbms, using sqoop
 L --> load into hdfs.
 T --> transform using Pig/hive/mr/spark.








   








































No comments:

Post a Comment