Hadoop Challenges: November 2010

2010-11-30

Puppet 2.6.3 and Ruby

Make sure you install Ruby 1.8.6 (probably i686 at the time of this article). I've found that if you simply

yum --enablerepo=ruby install ruby -y

you will end up with 1.8.5 for x86_64 AND 1.8.6 for i686. The x64 install will be the default though and you will end up with wierd ruby (import no recognized, or something like that) errors. Your ruby version should be the first thing you suspect!

So, do the following, as an ammendment to my previous posting:

yum --enablerepo=ruby install ruby.i686

Run the following to make sure you have 1.8.6 available as the default ruby runtime:

ruby --version

Getting Puppet up and Running

First off if after reading all of the articles, documentation, and googling that you can possibly muster in order to solve your problems, go onto IRC at

Freenode.net #puppet

The folks there will be more than happy to help you, LIVE!

Many thanks go out to "ZipKid", "Volcane", and "whack" for helping me out!

Ok, so I had some real trouble even getting Puppet up and running. Again, I'm running CentOS 5.5, and I decided to have puppet run on it's own special VM so that I have some segregation of responsibilities in our infrastructure.

That said, I decided to try and install the latest and greatest version, Puppet; 2.6.3.

In order to accomplish the install using yum, please go to the following VERY helpful site and read the instructions thoroughly before starting on your journey:

http://www.craigdunn.org/2010/08/part-1-installing-puppet-2-6-1-on-centos-with-yumrpm/

Ok, so when you run into certificate errors, you need to keep a couple of things in mind:
1) Run everything with --no-daemonize --debug (puppetmasterd, puppetd)
2) If you get the error, start ALL OVER AGAIN, by going onto the master server and performing the following steps
2 a) puppet cert --list --all
2 b) puppet cert --remove <certmachine.domain.com>
3) deleting EVERYTHING inside /var/lib/puppet/ssl/, and /etc/puppet/ssl/ on ALL machines in question (master and agent(s))

Now re-run the cert request line as shown

puppet agent --no-daemonize --debug --waitforcert 60 --test

You'll see the request come in on the server output. Open another console and run the following on the master:

puppet cert --list

Find the line that shows the FQDN of the agent requesting the certificate and run

puppet cert --sign <machine.domain.com>

That should do it, you can run the following to see if they are talking properly from the agent:

puppet agent --noop --test --server=<servername.domain.com>

You shouldn't see any errors, if you do, go through your .pp files and clean them up as needed.

2010-11-29

Running Sqoop under an Active Directory User Account

We installed Sqoop on a VM here for some end users to import data from our SQLServer instances throughout the enterprise. This is great news for us all, as no one really needs to get dirty by exporting data in SSMS.

So on this VM, we have CentOS 5.5, and installed Hive, Pig, and Sqoop. All properly configured to communicate with our cluster.

We installed Likewise-Open for Active Directory login information. I don't want to be in charge of maintaining a list of user's that can have access to the items in our hadoop interface. That's just a pain. If we find that people are misbehaving, we can narrow this down as much as we need to.

We have a domain group called HadoopUsers. I have certain people assigned to this group, and I can see that they are a member of that group in linux by issuing:

id MYDOMAIN\\userX

All that remains is giving that group sufficient access to the local machine in order to run Sqoop.

Running Hive and Pig seems to work without any elevated privileges, but you will see something like the following exception when trying to import data using Sqoop without them:

10/11/26 08:55:24 ERROR tool.ImportTool: Encountered IOException running import job: java.io.FileNotFoundException: File /tmp/sqoop-MyDomain\userX/compile/9dc233654e097695be7aaf4dd4d5cd81/QueryResult.jar does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:372)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:207)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1270)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1246)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1218)
at org.apache.hadoop.mapred.JobClient.copyAndConfigureFiles(JobClient.java:722)
at org.apache.hadoop.mapred.JobClient.copyAndConfigureFiles(JobClient.java:609)
at org.apache.hadoop.mapred.JobClient.access$300(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:808)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:793)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495)
at com.cloudera.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:107)
at com.cloudera.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:166)
at com.cloudera.sqoop.manager.SqlManager.importQuery(SqlManager.java:418)
at com.cloudera.sqoop.tool.ImportTool.importTable(ImportTool.java:352)
at com.cloudera.sqoop.tool.ImportTool.run(ImportTool.java:423)
at com.cloudera.sqoop.Sqoop.run(Sqoop.java:134)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at com.cloudera.sqoop.Sqoop.runSqoop(Sqoop.java:170)
at com.cloudera.sqoop.Sqoop.runTool(Sqoop.java:196)
at com.cloudera.sqoop.Sqoop.main(Sqoop.java:205)

-sh-3.2$

All I simply did to relieve this issue, was to add the MyDomain\\HadoopUsers group to the sudoers file and give them root access. This is a tad harzardous, I know but it is simple and it works. We can again, narrow this down further if we find that users are not behaving themselves.

Here is the line I added to the bottom of '/etc/sudoers':

%MyDomain\\HadoopUsers ALL=(ALL) ALL

2010-11-26

Sqoop, SQLServer, DateTime, and Fun!

Ok, so writing a simple query like the following:

SELECT a.* FROM table a WHERE a.timestamp >= '20101125'

Should be pretty straightforward for importing data into Hive through Sqoop, but alas there are some hurdles.

I'll just get straight to the solution, I know people hate to read, so here is the final command line example:

sqoop import --driver com.microsoft.sqlserver.jdbc.SQLServerDriver --connect 'jdbc:sqlserver://localhost;user=xxxxxx;password=xxxxxxx;database=MyDB' --query "select a.* from tableX a where (a.timestamp >= '2010-11-25 00:00:00') AND \$CONDITIONS" --target-dir /data/feedlog --split-by 'MyIDColumn' --fields-terminated-by '\t'

Basically, enclose the entire --query in double-quotes, and then escape the $ on the $CONDITIONS value with a \ as well so that is escapes the initial bash evaluation.

Happy hunting!

2010-11-23

Puppet to Manage Your Hadoop Cluster

Ok, I'm sure everyone has heard about how important it is to have a configuration management system up and running for your Hadoop cluster. This can't be understated enough. I have 4 machines right now, and I already HATE copying files manually between the different machines.

That said, everywhere I go, Puppet seems to be the utility of choice for this task.
http://www.puppetlabs.com

Again, I'm using CentOS (version 5.5 at the time of this writing), so you should download the RPM packages (common, client, and server) from puppetlabs and install from there.

Depending on what else you have done to this system, you may not have a repository configured that has rubygems(-stomp) available via YUM. If you get the following error, then refer to the next section, otherwise skip on down a few lines:

No package rubygems available.

In order to get the rubygems-stomp dependency out of the way, perform the following steps excerpt from http://www.threedrunkensysadsonthe.net/2010/04/mcollective-on-centos/

Enable ELFF

rpm -Uvh http://download.elff.bravenet.com/5/i386/elff-release-5-3.noarch.rpm

Install ruby and dependencies

yum -y install rubygems rubygem-stomp

A great start is to read the configuration docs:
http://docs.puppetlabs.com/guides/configuring.html

Active Directory/LDAP and Pig

So, in order to connect the Staging server to our Active Directory for user authentication, I simply used LikeWise-Open, which can be downloaded, and easily installed from http://www.likewise.com/download/index.php

Running the CLI utility was a breeze and only requires a Domain Admin account and a reboot to succeed.

I'm going to assume that the machine you are attempting to get this going on has Sun's Java 6. I have u22 x64 installed from the rpm I downloaded from sun.com.

Now we can log in with our LDAP accounts. Great, but we have no permissions. What next?

Fortunately for me, Hive worked straight away, and was automatically connecting to the cluster's HDFS. So I'm good there. Created a table and it showed up in Hue under the proper user/group.

BTW, use Hue for all of your HDFS exploration tasks, it is so much easier than typing 'hadoop fs -ls /', and then 'hadoop fs -ls /' to find your way through the tree!

To get Pig to run successfully in such a limited user account on linux, we need to set JAVA_HOME with the following lines, but where?

JAVA_HOME=/usr/java/latest
export JAVA_HOME

Adding this line to each user's ~/.bash_profile or ~/.bashrc doesn't take automatically because when an LDAP user logs in, they are not going to be running BASH (assuming you are running CentOS 5x). We need to have this export set globally, so simply add the above two lines to /etc/profile (anywhere). Have an LDAP user log in again and make sure that it works.

You *could* also set up each user to have /bin/bash as its default bash by adding them to the /etc/passwd file, but this does not scale well with the increase in the number of users that *will* be accessing the system.

OINK OINK!!

A Brief Overview of Our Setup

Ok, so setting up a Hadoop cluster isn't super straight-forward, but there are lots of folks out there attempting to bring it all together for us. If I miss a reference here or there, please email me and let me know so that I can source the entries that I place in this blog.

Our setup, for the time being is rather simple. We have 4 nodes in total, one is our NameNode/JobTracker (herein refered to as Node1), and the other three are DataNodes (refered to as Node2-4). This will grow over the next couple of months into at least a 10 node configuration, so having a proper Configuration Management system installed will be crucial to the survival of the opscon (me).

We also have a server that is our "Staging" instance (referred to as Node0, but perhaps another name would be more fitting). This server has Active Directory integration and will be where all users will go to load data, run queries, etc. For the time being this server will also be the host of the MySQL instance for Hive and Puppet, as well as the host for the Puppet Master daemon.

We are implementing Cloudera's distribution of Hadoop, including Hive, Pig, Sqoop, and a variety of other utilities.

My next entry will be around Active Directory installation on our Staging server. It will be fairly brief, and will detail how to ensure that Pig can be executed by a user with near-zero permissions.

Hadoop Challenges

Total Pageviews