Apache Hive dealing with different data format

Introduction

When it comes to big data and modern warehousing technology you must have heard about Apache hive.

Official Definition- The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. The structure can be projected onto data already in storage. A command-line tool and JDBC driver are provided to connect users to Hive.

Hive is created at Facebook but later Facebook donated hive to the Apache community.

Hive provides SQL like language called HiveQL with schema on read and transparently converts queries to MapReduce, Apache Tez, and Spark jobs.

One thing which makes the hive different from other databases/warehouses is it can digest all types of data format (Structured and semi Structured) and it uses Tez/map reduces in the background which reduces time to be executed by Hive Query.

In this blog, we will cover converting different data types(semi-structured) residing in a data file into hive structured tables. We will take input files as comma-separated values(CSV), TSV, Avro, parquet, JSON, and XML.

Our input file has Setup as follows.

Data we are going to process

Comma-Separated Values(CSV)

For CSV file we need to create a table in the location(hdfs) where CSV file is present.

create external table data (
Name string,
emp_id bigint,
salary decimal(12,6),
address string,
gender string,
is_new boolean
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
LOCATION ‘/bank/csv’
tblproperties(“skip.header.line.count”=”1");

Since we have a header in our text file so we have used skip.header property in tblproperties.

TSV(Tab Separated Values)

TSV files are just like CSV files just their separator is different. Instead of “,” are going to use “\t”.

create external table data_tsv (
Name string,
emp_id bigint,
salary decimal(12,6),
address string,
gender string,
is_new boolean
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
STORED AS TEXTFILE
LOCATION ‘/bank/tsv/’
tblproperties(“skip.header.line.count”=”1");

This method can be applied for any character separated file including |(pipe), \001, etc.

Avro

Before Jumping into Avro ingestion we must know what Serde means in the Apache hive.

Serde- SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing. A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats.

For Avro, we need to have one Avro schema file. This ingestion can be achieved by using Avro Serde. Since we are using Avro schema for inferring the data type and column names hence we do not need to specify the same at the time of table creation. The schema is kept at in HDFS location.

CREATE TABLE data_avro
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’
STORED AS INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’
LOCATION ‘/data/avro/’
tblproperties(“avro.schema.url”=”/data/schema/schema.avsc”);

We can use schema directly without specifying the file path using below tblproperty

tblproperties(“avro.schema.literal”=’{“type”:”record”,”name”:”emp_data”,”namespace”:”org.apache.hive.example”,”fields”:[{“name”:”Name”,”type”:”string”},{“name”:”emp_id”,”type”:”int”},{“name”:”salary”,”type”:”int”},{“name”:”address”,”type”:”string”},{“name”:”gender”,”type”:”string”},{“name”:”Is_new”,”type”:”boolean”}]}’)

Parquet

Parquet is quite tricky since we can not specify parquet schema in table creation. so we need to make sure what data type and column name are there in parquet schema should exactly match with the column name and column data type we are giving in hive table structure.

For parquet schema, we can use mr-parquet library.

command to fetch parquet schema using parquet tools from mr-parquet project.

java -jar parquet-tools-1.12.0-SNAPSHOT.jar schema data.parquet

parquet schema of our data-

message org.apache.nifi.nifiRecord {
optional binary Name (STRING);
optional int32 emp_id;
optional int32 salary;
optional binary address (STRING);
optional binary gender (STRING);
optional boolean Is_new;
}

Table structure-

create external table data_parquet (
Name string,
emp_id bigint,
salary int,
address string,
gender string,
is_new boolean
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe’
STORED AS INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat’
LOCATION ‘/data/parquet’;

JSON

For the JSON file type, we have two ways to do our task.

  1. Create a temp table to load JSON as a string and then use the get_json_object() function in Apache hive and load attribute to our target table.
  2. Using JSON serde.

you can use ‘org.apache.hadoop.hive.serde2.JsonSerDe’ Serde.

Since we have seen how to use Serde in the previous format. Now we will use the first method. We will load json data into a temp table as a string and then we load our data into the target table. get_json_obj() is case-sensitive so Make sure in get_json_obj() you have correct case else you get a null value in column. Also, make sure that there is one json record per hive row.

CREATE TEMPORARY TABLE

create table temp_json (
body string
)
LOCATION ‘/data/json’;

TARGET TABLE STRUCTURE

create table data_json (
Name string,
emp_id bigint,
salary int,
address string,
gender string,
is_new boolean
)
STORED AS TEXTFILE
LOCATION ‘/data/data_json/’;

Inserting data into main table

insert overwrite table data_json
select get_json_object(body,’$.Name’) as Name,
get_json_object(body,’$.emp_id’) as emp_id ,
get_json_object(body,’$.salary’) as salary ,
get_json_object(body,’$.address’) as address ,
get_json_object(body,’$.gender’) as gender ,
CASE lower(get_json_object(body,’$.Is_new’) ) WHEN “true” THEN true WHEN “false” THEN false ELSE NULL END as Is_new
from temp ;

We can also use an array and array of structure in get_json_obj() if needed.

You have observed that in the case of Is_new we have used a case-when-then statement. That was purposeful because Apache hive converts any string to boolean true when it is not null. So we have to explicitly tell hive to if true then set Is_new to true else-if false else NULL.

XML

In XML, we need to use Xpath for data ingestion. just like JSON, we have two ways to ingest data.

  1. Using XML Serde. we can use com.ibm.spss.hive.serde2.xml.XmlInputFormat. This serde is not available by default in Apache hive so we need to import this from Github.
  2. create a temp table and extract data from the temp table using XPath and load data into the target table.

we will be using the second method here. If you want that I should show you how to use Serde in XML and JSON please let me know in the comment section. I will post the same.

TEMP TABLE

create table temp_xml (
body string
) LOCATION ‘/data/xml’;

TARGET TABLE

create table data_xml (
Name string,
emp_id bigint,
salary int,
address string,
gender string,
is_new boolean
) STORED AS TEXTFILE
LOCATION ‘/data/data_xml/’;

Extracting data from temp(temp_xml) table and inserting to target(data_xml) table.

insert overwrite table data_xml
select
xpath_string(body,’element/Name/text()’) as Name,
xpath_int(body,’element/emp_id/text()’) as emp_id,
xpath_float(body,’element/salary/text()’) as salary,
xpath_string(body,’element/address/text()’) as address ,
xpath_string(body,’element/gender/text()’) as gender,
CASE lower(xpath_string(body,’element/Is_new/text()’) ) WHEN “true” THEN true WHEN “false” THEN false ELSE NULL END as Is_new
from temp_xml;

If we use Xpath directly it will give results as an array like this.

So we have used xpath_{data_type of column}, So we could save some computation.

Conclusion

A hive is a wonderful tool to query big data since it uses a map-reduce paradigm in the background and we didn’t have to worry about Java code. We just need to write SQL like queries and adding Beeline into hive added more value to data visualization. there are many optimization techniques in Apache hive like partitioning, bucketing which makes query runtime smaller. But according to me which makes it different from all other RDBMS, is its ability to ingest every type of structured and semi-structured data.

It is even easier in Apache Spark. Let me know in the comment section if you want me to write a blog on that.

Senior data Engineer at lumiq.ai