Wednesday, May 08, 2013

Learned a little bit about importing data from MySQL into HDFS using Sqoop

I have a chance to read a book - Hadoop Real-World Solutions Cookbook(Thank you ^______^). It pops up in my head, why I have never tested about Sqoop. As you know sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. So, I wanna test a little bit about it. I choose to read data from mysql into HDFS.
I assume my data in mysql
[surachart@centos ~]$ mysql -u surachart mydb -p < a.sql
Enter password:
[surachart@centos ~]$ mysql -u surachart mydb -p
Enter password:
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 10
Server version: 5.1.69 Source distribution

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select count(*) from mytable;
+----------+
| count(*) |
+----------+
1295462 |
+----------+
1 row in set (0.00 sec)

mysql>
After I have my data, I download Sqoop (binary) and test it.
[surachart@centos ~]$ ls sqoop-1.4.3.bin__hadoop-0.20.tar.gz
sqoop-1.4.3.bin__hadoop-0.20.tar.gz
[surachart@centos ~]$ tar zxf sqoop-1.4.3.bin__hadoop-0.20.tar.gz

[surachart@centos ~]$ cd  sqoop-1.4.3.bin__hadoop-0.20
[surachart@centos sqoop-1.4.3.bin__hadoop-0.20]$ cd bin/
[surachart@centos bin]$ pwd
/home/surachart/sqoop-1.4.3.bin__hadoop-0.20/bin

[surachart@centos bin]$ ./sqoop
Error: /usr/lib/hadoop does not exist!
Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation.
[surachart@centos bin]$ export HADOOP_COMMON_HOME=/usr
[surachart@centos bin]$ ./sqoop
Error: /usr/lib/hadoop-mapreduce does not exist!
Please set $HADOOP_MAPRED_HOME to the root of your Hadoop MapReduce installation.
[surachart@centos bin]$ export HADOOP_MAPRED_HOME=/usr
[surachart@centos bin]$ ./sqoop
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Try 'sqoop help' for usage.
[surachart@centos bin]$ export HBASE_HOME=/home/surachart/hbase
[surachart@centos bin]$ ./sqoop
Try 'sqoop help' for usage.

[surachart@centos bin]$ ./sqoop help
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

[surachart@centos bin]$ hadoop fs -mkdir /user/surachart/import
[surachart@centos bin]$ hadoop fs -ls /user/surachart/import
[surachart@centos bin]$
[surachart@centos bin]$ ./sqoop import -m 1 --connect jdbc:mysql://localhost:3306/mydb --username surachart --password password --table mytable --target-dir /user/surachart/import
13/05/08 12:11:42 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
13/05/08 12:11:42 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
13/05/08 12:11:42 INFO tool.CodeGenTool: Beginning code generation
13/05/08 12:11:43 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
        at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:716)
        at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
        at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:605)
        at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:628)
        at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:235)
        at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:219)
        at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:283)
        at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1255)
        at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1072)
        at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:390)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
Note: On Sqoop no mysql lib, So Download "mysql-connector-java-5.1.25" on http://dev.mysql.com/downloads/connector/j/ and copy it to sqoop lib path.
[surachart@centos ~]$ ls mysql-connector-java-5.1.25.zip
mysql-connector-java-5.1.25.zip
[surachart@centos ~]$ unzip mysql-connector-java-5.1.25.zip
[surachart@centos ~]$ cp mysql-connector-java-5.1.25/mysql-connector-java-5.1.25-bin.jar sqoop-1.4.3.bin__hadoop-0.20/lib/
After everything should be fine. Try again.
[surachart@centos ~]$ cd sqoop-1.4.3.bin__hadoop-0.20/bin/
[surachart@centos bin]$ ./sqoop import --direct -m 1 --connect jdbc:mysql://localhost:3306/mydb --username surachart --password password --table mytable --target-dir /user/surachart/import
13/05/08 12:27:54 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
13/05/08 12:27:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
13/05/08 12:27:54 INFO tool.CodeGenTool: Beginning code generation
13/05/08 12:27:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `mytable` AS t LIMIT 1
13/05/08 12:27:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `mytable` AS t LIMIT 1
13/05/08 12:27:56 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr
Note: /tmp/sqoop-surachart/compile/7d652c85de6562a56d07c0b4017e3cd4/mytable.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
13/05/08 12:28:04 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-surachart/compile/7d652c85de6562a56d07c0b4017e3cd4/mytable.jar
13/05/08 12:28:04 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import
13/05/08 12:28:04 INFO mapreduce.ImportJobBase: Beginning import of mytable
13/05/08 12:28:10 INFO mapred.JobClient: Cleaning up the staging area hdfs://centos:8020/user/surachart/.staging/job_201305081226_0002
13/05/08 12:28:10 ERROR security.UserGroupInformation: PriviledgedActionException as:surachart cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /user/surachart/import already exists
13/05/08 12:28:10 ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /user/surachart/import already exists
        at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:949)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:912)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:912)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
        at org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:173)
        at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:151)
        at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:221)
        at org.apache.sqoop.manager.DirectMySQLManager.importTable(DirectMySQLManager.java:92)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:403)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:238)

[surachart@centos bin]$ ./sqoop import --direct -m 1 --connect jdbc:mysql://localhost:3306/mydb --username surachart --password password --table mytable --target-dir /user/surachart/import/mytable
13/05/08 12:28:23 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
13/05/08 12:28:24 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
13/05/08 12:28:24 INFO tool.CodeGenTool: Beginning code generation
13/05/08 12:28:26 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `mytable` AS t LIMIT 1
13/05/08 12:28:26 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `mytable` AS t LIMIT 1
13/05/08 12:28:26 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr
Note: /tmp/sqoop-surachart/compile/98909e84f7e2215902ebdbcca40cfa9f/mytable.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
13/05/08 12:28:33 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-surachart/compile/98909e84f7e2215902ebdbcca40cfa9f/mytable.jar
13/05/08 12:28:34 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import
13/05/08 12:28:34 INFO mapreduce.ImportJobBase: Beginning import of mytable
13/05/08 12:28:42 INFO mapred.JobClient: Running job: job_201305081226_0003
13/05/08 12:28:43 INFO mapred.JobClient:  map 0% reduce 0%
13/05/08 12:29:34 INFO mapred.JobClient:  map 100% reduce 0%
13/05/08 12:29:56 INFO mapred.JobClient: Job complete: job_201305081226_0003
13/05/08 12:29:56 INFO mapred.JobClient: Counters: 18
13/05/08 12:29:56 INFO mapred.JobClient:   Job Counters
13/05/08 12:29:56 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=54352
13/05/08 12:29:56 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/05/08 12:29:56 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/05/08 12:29:56 INFO mapred.JobClient:     Launched map tasks=1
13/05/08 12:29:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
13/05/08 12:29:56 INFO mapred.JobClient:   File Output Format Counters
13/05/08 12:29:56 INFO mapred.JobClient:     Bytes Written=23687032
13/05/08 12:29:56 INFO mapred.JobClient:   FileSystemCounters
13/05/08 12:29:56 INFO mapred.JobClient:     HDFS_BYTES_READ=87
13/05/08 12:29:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=36476
13/05/08 12:29:56 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=23687032
13/05/08 12:29:56 INFO mapred.JobClient:   File Input Format Counters
13/05/08 12:29:56 INFO mapred.JobClient:     Bytes Read=0
13/05/08 12:29:56 INFO mapred.JobClient:   Map-Reduce Framework
13/05/08 12:29:56 INFO mapred.JobClient:     Map input records=1
13/05/08 12:29:56 INFO mapred.JobClient:     Physical memory (bytes) snapshot=73830400
13/05/08 12:29:56 INFO mapred.JobClient:     Spilled Records=0
13/05/08 12:29:56 INFO mapred.JobClient:     CPU time spent (ms)=16210
13/05/08 12:29:56 INFO mapred.JobClient:     Total committed heap usage (bytes)=29818880
13/05/08 12:29:56 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1179168768
13/05/08 12:29:56 INFO mapred.JobClient:     Map output records=1295462
13/05/08 12:29:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=87
13/05/08 12:29:56 INFO mapreduce.ImportJobBase: Transferred 22.5897 MB in 81.613 seconds (283.4336 KB/sec)
13/05/08 12:29:56 INFO mapreduce.ImportJobBase: Retrieved 1295462 records.


Check data in HDFS.
[surachart@centos bin]$ hadoop fs -ls /user/surachart/import
Found 1 items
drwx------   - surachart surachart          0 2013-05-08 12:29 /user/surachart/import/mytable
[surachart@centos bin]$ hadoop fs -ls /user/surachart/import/mytable
Found 2 items
-rw-------   3 surachart surachart          0 2013-05-08 12:29 /user/surachart/import/mytable/_SUCCESS
-rw-------   3 surachart surachart   23687032 2013-05-08 12:29 /user/surachart/import/mytable/part-m-00000
Note: data in /user/surachart/import/mytable/part-m-00000 file, that is CSV type.


No comments: