Sunday, July 07, 2013

Apache Sqoop Cookbook Unlocking Hadoop for Your Relational Database By Kathleen Ting, Jarek Jarcec Cecho

What is Apache Sqoop? It's a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. I tested a bit about it - Sqoop1 ^____^
Tested it with MySQL
New Example with MySQL
    [oracle@linux01 sqoop-1.4.2-cdh4.2.1]$ hadoop fs -ls
    13/07/07 14:07:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 2 items
    drwxr-xr-x   - oracle supergroup          0 2013-07-07 12:35 hbase
    drwxr-xr-x   - oracle supergroup          0 2013-07-02 15:09 input
    [oracle@linux01 sqoop-1.4.2-cdh4.2.1]$ ./bin/sqoop import --connect jdbc:mysql://localhost/sqoop --username sqoop --password sqoop --table cities
    13/07/07 14:07:54 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
    13/07/07 14:07:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
    13/07/07 14:07:54 INFO tool.CodeGenTool: Beginning code generation
    13/07/07 14:07:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `cities` AS t LIMIT 1
    13/07/07 14:07:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `cities` AS t LIMIT 1
    13/07/07 14:07:56 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/oracle/kiji-bento-buri/cluster/bin/../lib/hadoop-2.0.0-mr1-cdh4.2.1
    Note: /tmp/sqoop-oracle/compile/4ee8b9a55045a237f2fb2961bae8a487/cities.java uses or overrides a deprecated API.
    Note: Recompile with -Xlint:deprecation for details.
    13/07/07 14:08:06 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-oracle/compile/4ee8b9a55045a237f2fb2961bae8a487/cities.jar
    13/07/07 14:08:06 WARN manager.MySQLManager: It looks like you are importing from mysql.
    13/07/07 14:08:06 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
    13/07/07 14:08:06 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
    13/07/07 14:08:06 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
    13/07/07 14:08:06 INFO mapreduce.ImportJobBase: Beginning import of cities
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/home/oracle/kiji-bento-buri/cluster/lib/hadoop-2.0.0-mr1-cdh4.2.1/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/home/oracle/kiji-bento-buri/cluster/lib/hbase-0.94.2-cdh4.2.1/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    13/07/07 14:08:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    13/07/07 14:08:13 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    13/07/07 14:08:18 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `cities`
    13/07/07 14:08:19 INFO mapred.JobClient: Running job: job_20130707123538807_0001
    13/07/07 14:08:20 INFO mapred.JobClient:  map 0% reduce 0%
    13/07/07 14:09:19 INFO mapred.JobClient:  map 33% reduce 0%
    13/07/07 14:09:20 INFO mapred.JobClient:  map 66% reduce 0%
    13/07/07 14:09:38 INFO mapred.JobClient:  map 100% reduce 0%
    13/07/07 14:09:51 INFO mapred.JobClient: Job complete: job_20130707123538807_0001
    13/07/07 14:09:51 INFO mapred.JobClient: Counters: 23
    13/07/07 14:09:51 INFO mapred.JobClient:   File System Counters
    13/07/07 14:09:51 INFO mapred.JobClient:     FILE: Number of bytes read=0
    13/07/07 14:09:51 INFO mapred.JobClient:     FILE: Number of bytes written=705255
    13/07/07 14:09:51 INFO mapred.JobClient:     FILE: Number of read operations=0
    13/07/07 14:09:51 INFO mapred.JobClient:     FILE: Number of large read operations=0
    13/07/07 14:09:51 INFO mapred.JobClient:     FILE: Number of write operations=0
    13/07/07 14:09:51 INFO mapred.JobClient:     HDFS: Number of bytes read=295
    13/07/07 14:09:51 INFO mapred.JobClient:     HDFS: Number of bytes written=54
    13/07/07 14:09:51 INFO mapred.JobClient:     HDFS: Number of read operations=3
    13/07/07 14:09:51 INFO mapred.JobClient:     HDFS: Number of large read operations=0
    13/07/07 14:09:51 INFO mapred.JobClient:     HDFS: Number of write operations=3
    13/07/07 14:09:51 INFO mapred.JobClient:   Job Counters
    13/07/07 14:09:51 INFO mapred.JobClient:     Launched map tasks=3
    13/07/07 14:09:51 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=122318
    13/07/07 14:09:51 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
    13/07/07 14:09:51 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
    13/07/07 14:09:51 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
    13/07/07 14:09:51 INFO mapred.JobClient:   Map-Reduce Framework
    13/07/07 14:09:51 INFO mapred.JobClient:     Map input records=3
    13/07/07 14:09:51 INFO mapred.JobClient:     Map output records=3
    13/07/07 14:09:51 INFO mapred.JobClient:     Input split bytes=295
    13/07/07 14:09:51 INFO mapred.JobClient:     Spilled Records=0
    13/07/07 14:09:51 INFO mapred.JobClient:     CPU time spent (ms)=8520
    13/07/07 14:09:51 INFO mapred.JobClient:     Physical memory (bytes) snapshot=323211264
    13/07/07 14:09:51 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3169579008
    13/07/07 14:09:51 INFO mapred.JobClient:     Total committed heap usage (bytes)=188350464
    13/07/07 14:09:51 INFO mapreduce.ImportJobBase: Transferred 54 bytes in 102.0092 seconds (0.5294 bytes/sec)
    13/07/07 14:09:51 INFO mapreduce.ImportJobBase: Retrieved 3 records.
    [oracle@linux01 sqoop-1.4.2-cdh4.2.1]$ hadoop fs -ls
    13/07/07 14:10:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 3 items
    drwxr-xr-x   - oracle supergroup          0 2013-07-07 14:09 cities
    drwxr-xr-x   - oracle supergroup          0 2013-07-07 12:35 hbase
    drwxr-xr-x   - oracle supergroup          0 2013-07-02 15:09 input
    [oracle@linux01 sqoop-1.4.2-cdh4.2.1]$ type java
    java is /u01/app/oracle/product/12.1.0/dbhome_1/jdk/bin/java
    [oracle@linux01 sqoop-1.4.2-cdh4.2.1]$ hadoop fs -ls cities
    13/07/07 14:10:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 5 items
    -rw-r--r--   3 oracle supergroup          0 2013-07-07 14:09 cities/_SUCCESS
    drwxr-xr-x   - oracle supergroup          0 2013-07-07 14:08 cities/_logs
    -rw-r--r--   3 oracle supergroup         16 2013-07-07 14:09 cities/part-m-00000
    -rw-r--r--   3 oracle supergroup         22 2013-07-07 14:09 cities/part-m-00001
    -rw-r--r--   3 oracle supergroup         16 2013-07-07 14:09 cities/part-m-00002
    [oracle@linux01 sqoop-1.4.2-cdh4.2.1]$ hadoop fs -cat cities/part-*
    13/07/07 14:11:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    1,USA,Palo Alto
    2,Czech Republic,Brno
    3,USA,Sunnyvale
Sqoop with Oracle 12c - Multitenant

However, I didn't not try talking about Sqoop samples whatever. I would like to mention a book about it.
Apache Sqoop Cookbook Unlocking Hadoop for Your Relational Database By Kathleen Ting, Jarek Jarcec Cecho. A book was written for someone who is interested in Apache Sqoop and want to begin for learning about it (Sqoop version 1).  Readers will see much more sample for learning and be able to use them for reference or manual.  A book will give you quickly learn how to deploy and then apply Sqoop in your environment. Ssamples in a book with MySQL, Oracle, and PostgreSQL databases. However, Readers can them for SQL Server, Netezza, Teradata, or other relational systems.
What about details that readers will be able to learn in a book?
- Transfer data from a single database table into your Hadoop ecosystem
- Keep table data and Hadoop in sync by importing data incrementally
- Import data from more than one database table
- Customize transferred data by calling various database functions
- Export generated, processed, or backed-up data from Hadoop to your database
- Run Sqoop within Oozie, Hadoop’s specialized workflow scheduler
- Load data into Hadoop’s data warehouse (Hive) or database (HBase)
- Handle installation, connection, and syntax issues common to specific database vendors

A book has 94 pages (PDF) with 7 chapters. Before readers will start to read. I believe they should install Mysql (and etc), Sqoop version 1(I chose sqoop-1.4.2-cdh4.2.1. It's easy for me) and Apache Hadoop (With my play, I chose Kiji). In a book, I found a little bit mistakes in samples. However, I could change a bit and work for testing and learning. It might be... I might use the different Sqoop version with Authors. If readers are able to get idea from a book and Apache Sqoop. I think this isn't the problem.

About authors:
Kathleen Ting is currently a Customer Operations Engineering Manager at Cloudera where she helps customers deploy and use the Hadoop ecosystem in production. She has spoken on Hadoop, ZooKeeper, and Sqoop at many Big Data conferences including Hadoop World, ApacheCon, and OSCON. She's contributed to several projects in the open source community and is a Committer and PMC Member on Sqoop.

Jarek Jarcec Cecho is currently a Software Engineer at Cloudera where he develops software to help customers better access and integrate with the Hadoop ecosystem. He has led the Sqoop community in the architecture of the next generation of Sqoop, known as Sqoop 2. He's contributed to several projects in the open source community and is a Committer and PMC Member on Sqoop, Flume, and MRUnit.

No comments: