Skip to main content
Marcel Krčah

How a newline can ruin your Hive

Published on , in

If you do not fully understand how Hive/Impala stores your data, it might cost you badly.

I've learnt the hard way.

Symptom #1: Weird values in ingested Hive table #

You double-checked with select distinct(gender) from customers that the gender column in your source RDBMS really contains only values male, female and NULL. However, when you ingest the table into Hive (maybe with Apache Sqoop or Cornet) and run the check there, you see that weird values have creeped in:

> select distinct(gender) from customers;  // Run in Hive/Impala
+-----------------+
| gender          |
+-----------------+
| NULL            |
| CA 94304        |
| male            |
| Page Mill Road  |
| female          |
+-----------------+

Symptom #2: Inconsistent size of ingested Hive table #

You check with select count(*) from customers that the table in your RDBMS table has 156,010 rows. You ingest the table into Hive and BAM! All of a sudden there are 14 more customers.

Maybe the business is doing great and you gained 14 customers before you started the ingestion? Wondering, you check the source table size again.

Still 156,010.

Symptom #3: Weird data when copied from another Hive table #

This one is by far the most strange one.

You already have the customers table ingested in Hive. Values in the gender column look fine. The table has correct size of 156,010. All is fine.

You do some data cleaning with SQL and copy the result into a new table as follows:

CREATE TABLE customers_superclean
AS SELECT name, coalesce(gender, 'unknown') FROM customers;

You check the size of the new table. BAM! 25 new customers in there. Impossible!

No errors, no warnings #

Neither Hive, Impala nor Sqoop gave you any error or warning. You have no idea what's going on. Somewhere at the back of your head, you start questioning the whole Hadoop infrastructure. You feel like the cool stuff you do all day with the data has been compromised.

Cause: Hive delimiters present in the data #

All these problems can occur if the ingested data contains characters that Hive uses to delimit fields and rows. Typically, these are newlines. For instance, let's assume that the customers table in the source RDBMS contains the following data (notice the newline in the first street name):

gender street
male Page Mill Road\nCA 94304
female Great America Parkway

If one naively ingests this data into a Hive table using the default settings (a text file with rows delimited by \01 and fields delimited by newlines), the data gets broken. Look at the corresponding delimited file in HDFS:

male\01Page Mill Road
CA 94304
female\01Great America Parkway

When Hive/Impala reads from this file, it finds three customers instead of two. The additional customer is of gender CA 94304 and has no street specified. On top of that, the street field of the first customer misses the postal code.

Particularly interesting case: Copying binary data to text file #

Assume that the Hive table called customers uses Avro or Parquet for data storage and that the data contains newlines. Querying the customers table directly via Hive or Impala works as expected. However, let's create a new table as follows:

CREATE TABLE customers_superclean
AS SELECT gender, street FROM CUSTOMERS;

The problem with this command is that the new table is backed by a newline-delimited text file. An Avro/Parquet record which contains a newline will be split into two records in the new table. Symptom #3 is born.

Solution 1: Use binary storage formats like Avro or Parquet #

If possible, use binary storage for your Hive tables, for instance Apache Avro or Apache Parquet. Since these formats do not use dedicated characters to split a file into records and fields, Hive/Impala can read data with special characters properly.

Also, Avro and Parquet make it possible to safely copy records from one Hive table to another. For instance, checkout the following command which copies data to a Parquet table:

CREATE TABLE customers_superclean STORED AS parquet
AS SELECT gender, street FROM customers;

The Parquet storage will ensure that even if data in the customers table contains newlines or another delimiters, the data will be properly copied and interpreted in the new table.

Solution 2: Ensure the delimited text file does not contain Hive delimiters #

When ingesting data into a delimited text file, you have to ensure that the file does not contain characters that Hive uses to split data into rows and fields. In general, there are two options to achieve this:

CREATE TABLE customers
ROW FORMAT DELIMITED
  LINES TERMINATED BY '\002'
  FIELDS TERMINATED BY '\001';

Happy ingesting!

This blog is written by Marcel Krcah, an independent consultant for product-oriented software engineering. If you like what you read, sign up for my newsletter