Presentation on Apache Hive at Big Data TechCon
- Set up and install Hadoop and Hive. Easiest way is to actually download a demo VM with Hadoop, Hive and HBase installed. Cloudera Demo VMs are available here.
- On your demo VM, download the dataset (source: http://stat-computing.org/dataexpo/2009/the-data.html)
mkdir -p ~/hive
cd ~/hive
wget http://stat-computing.org/dataexpo/2009/2008.csv.bz2
bzip2 -d 2008.csv.bz2
The dataset contains on-time flight performance data from 2008, originally released by Research and Innovative Technology Administration (RITA). 3. Ensure that your virtual machine can connect to the internet. FYI, if you are running VirtualBox on Ubuntu 12.10, you may be hitting a known bug related to internet connectivity of Demo VM. See here for more details.
- Verify contents of HDFS
hadoop fs -ls /
- Pi job
hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop*examples.jar pi 10 100
- Wordcount job
hadoop fs -mkdir input
hadoop fs -put /etc/hadoop/conf/*.xml input
hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop*examples.jar wordcount input output
By the way, if you re-run this job, it will fail. Why is that?
- On-time flight performance data from 2008
hadoop fs -mkdir /user/hive/warehouse/flight_data
- Verify it got loaded
hadoop fs -ls /
- Create hive table
CREATE EXTERNAL TABLE flight_data(
year INT,
month INT,
day INT,
day_of_week INT,
dep_time INT,
crs_dep_time INT,
arr_time INT,
crs_arr_time INT,
unique_carrier STRING,
flight_num INT,
tail_num STRING,
actual_elapsed_time INT,
crs_elapsed_time INT,
air_time INT,
arr_delay INT,
dep_delay INT,
origin STRING,
dest STRING,
distance INT,
taxi_in INT,
taxi_out INT,
cancelled INT,
cancellation_code STRING,
diverted INT,
carrier_delay STRING,
weather_delay STRING,
nas_delay STRING,
security_delay STRING,
late_aircraft_delay STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hive/warehouse/flight_data';
To disable safe mode:
hadoop dfsadmin -safemode leave