How to ingest data from Azure DataMarket into Hadoop

Published on 01 April 2015, in #data-engineering

For those of you who haven't encountered it yet, Azure DataMarket is quite an exciting platform by Microsoft which provides standardized access to plenty of interesting datasets. As of now, there are over 200 datasets from broad range of topics, including weather, demographics, automotive, agriculture, real-estate, etc. There might be a golden nugget for your bussiness hidden in there, so I would definitely recommend to go and check the platform out.

Recently, we have discovered such a golden nugget ourselves: details about every car sold in Netherlands in the past 15 years. A pretty exciting dataset, considering that the company I work for operates in the Dutch market of electric vehicles. What is more, the data is free, updated daily and comes shipped with a REST API interface. What more can we wish for?

To maximally leverage potential of the dataset, we ingest it into Hive, which allows us to:

run fast ad-hoc exploratory SQL queries with Impala,
explore the dataset with BI tools,
enrich the data model by combining car details with other data and
expand our dashboards with insights about the car market.

Avro-backed external Hive tables FTW! #

Regarding the storage, we opted for an Avro-backed external Hive table. These types of tables rock, since:

Hive automatically infers table schema from the Avro schema. There is no need to explicitly enumerate columns with their types, which is very useful if you deal with tables of many columns.
Changes in table schema are not an issue. You just recreate the table with the new Avro schema and the Avro schema evolution will automagically take care of properly reading the data stored with the old schema.
Data is easily processed by other tools like Apache Spark or MapReduce. The data resides in a directory of your choice (rather than in a Hive directory) and the Avro format has first-class support in Hadoop tooling.
New data is easily added to the table. Just add a new Avro file to the corresponding HDFS directory and Hive will automatically pick it up. With Impala, call REFRESH <table-name>; to notify Impala about arrival of new data.
Hive data partitioning is well supported so you can optimize for query performance.
Dropping external Hive table does not delete data from HDFS. Small thing, but very handy.

Ingestion: From OData XML to an Avro Hive table #

Here are detailed steps how we get that data from DataMarket into Hive.

Step 1: Download the dataset as XML files, in a standardized OData format #

DataMarket publishes datasets as XML. Each XML follows the same OData schema (checkout the example XML with cars data). DataMarket provides this data via a REST interface, which supports various query parameters. Using the parameters, you control the subset of data you want to download.

Since the car dataset that we are interested in is about 15 GiB large, filtering proved very useful for our purposes. With the $filter, $top and $inlinecount parameters and a bit of curl and Bash, it is straightforward to download the whole dataset into files cars_2000.xml till cars_2015.xml, where the number indicates the year where a car was registered in Netherlands.

Step 2: Convert the dataset into Avro, using `odata2avro` utility #

To convert an XML file in OData schema to Avro, we created a Python command-line tool called odata2avro which does all the heavy lifting for you.

Just install the tool with pip install odata2avro and use it as follows:

$ odata2avro cars_2013.xml cars.avsc cars_2013.avro

This command reads an XML file cars_2013.xml and creates two files:

cars.avsc - an Avro schema, in json, describing the dataset,
cars_2013.avro - a binary Avro file containing the dataset from the XML.

Step 3: Upload the Avro schema and Avro files to HDFS #

In this case, let's create and upload the data to /datamarket HDFS directory:

$ hdfs dfs -mkdir -p /datamarket/cars
$ hdfs dfs -put cars_20*.xml /datamarket/cars
$ hdfs dfs -put cars.avsc /datamarket

Step 4: Create an external Avro-backed Hive table #

To create an external Hive table with a schema according to /datamarket/cars.avsc and the data located in /datamarket/cars, use the following Hive command with the AvroSerDe:

CREATE EXTERNAL TABLE cars
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/datamarket/cars'
TBLPROPERTIES ('avro.schema.url'='hdfs:///datamarket/cars.avsc');

Step 5: Query & profit! #

Congrats! As of now, the data is accessible in both Hive and Impala:

$ impala-shell --query 'select * from cars;'

Bonus step: Keeping data updated #

The car dataset is append-only, so for us it's pretty straightforward to keep the data in Hive updated. We run a daily job which:

downloads XML data from DataMarket for the current year 2015,
converts the XML data to an Avro file called cars_2015.avro,
uploads the Avro file to HDFS and replaces the file uploaded on the previous day and
calls REFRESH cars; so Impala registers the new data.

That's all!

← Previous post: Introduction to Apache Zookeeper, backbone of big-data systems
→ Next post: How a newline can ruin your Hive

This blog is written by Marcel Krcah, an independent consultant for product-oriented software engineering. If you like what you read, sign up for my newsletter