Using Apache Avro with Java and Maven

March 8th, 2014 by Micha Kops

Apache Avro is a serialization framework similar to Google’s Protocol Buffers or Apache Thrift and offering features like rich data structures, a compact binary format, simple integration with dynamic languages and more.

In the following short five minute tutorial, we’re going to specify a schema to serialize books in a JSON format, we’re using the Avro Maven plugin to generate the stub classes and finally we’re serializing the data into a single file.

Avro Schema Declaration

Why another Framework
Maven Dependencies
Defining the Book Schema
Schema Compiling / Generating Classes
Serializing / Deserializing from a File
Tutorial Sources
Resources

Why another Framework

What are the advantages of Avro over Google Protocol Buffers (see my article about Protocol Buffers) or Apache Thrift?

Imho one thing I like is the use of JSON as a data format, another good thing is the fact that the schema is written to the serialized file so there might be less problems when using different versions of a schema.

If you’re interested in a more detailed (but biased) comparison, feel free to have a look at this nice presentation from Igor Anishchenko.

Maven Dependencies

We’re adding two dependencies to our pom.xml – the one is the Apache Avro library, the other one is the Maven plugin that allows us to generate Java classes from our format specifications.

We’re configuring the plugin to look in src/main/avro for specification files and to put the generated Java classes to src/main/java.

<dependencies>
	<dependency>
		<groupId>org.apache.avro</groupId>
		<artifactId>avro</artifactId>
		<version>1.7.6</version>
	</dependency>
</dependencies>
 
<build>
	<plugins>
		<plugin>
			<groupId>org.apache.avro</groupId>
			<artifactId>avro-maven-plugin</artifactId>
			<version>1.7.6</version>
			<executions>
				<execution>
					<phase>generate-sources</phase>
					<goals>
						<goal>schema</goal>
					</goals>
					<configuration>
						<sourceDirectory>${project.basedir}/src/main/avro/</sourceDirectory>
						<outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
					</configuration>
				</execution>
			</executions>
		</plugin>
	</plugins>
</build>

Defining the Book Schema

An Avro schema is written in the JSON format and we may use different primitive or complex types here.

A detailed documentation can be found in the Avro documentation.

This is our book schema in src/main/avro/book.avsc:

{
	"namespace": "com.hascode.entity",
	"type": "record",
	"name": "Book",
	"fields": [
		{"name": "name", "type": "string"},
		{"name": "id",  "type": ["int", "null"]},
		{"name": "category", "type": ["string", "null"]}
	 ]
}

Schema Compiling / Generating Classes

You may run the following command to create the Book class needed from the schema file:

mvn generate-sources

Serializing / Deserializing from a File

The following snippet serializes books to a file and afterwards deserializes it and prints it to the output.

package com.hascode.tutorial;
 
import java.io.File;
import java.io.IOException;
 
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.specific.SpecificDatumWriter;
 
import com.hascode.entity.Book;
 
public class FileSerializationExample {
	public static void main(final String[] args) throws IOException {
		Book book1 = Book.newBuilder().setId(123).setName("Programming is fun")
				.setCategory("Fiction").build();
		Book book2 = new Book("Some book", 456, "Horror");
		Book book3 = new Book();
		book3.setName("And another book");
		book3.setId(789);
		File store = File.createTempFile("book", ".avro");
 
		// serializing
		System.out
				.println("serializing books to temp file: " + store.getPath());
		DatumWriter<Book> bookDatumWriter = new SpecificDatumWriter<Book>(
				Book.class);
		DataFileWriter<Book> bookFileWriter = new DataFileWriter<Book>(
				bookDatumWriter);
		bookFileWriter.create(book1.getSchema(), store);
		bookFileWriter.append(book1);
		bookFileWriter.append(book2);
		bookFileWriter.append(book3);
		bookFileWriter.close();
 
		// deserializing
		DatumReader<Book> bookDatumReader = new SpecificDatumReader<Book>(
				Book.class);
		DataFileReader<Book> bookFileReader = new DataFileReader<Book>(store,
				bookDatumReader);
		while (bookFileReader.hasNext()) {
			Book b1 = bookFileReader.next();
			System.out.println("deserialized from file: " + b1);
		}
	}
 
}

Running the example code should produce the following output:

serializing books to temp file: /tmp/book5516033028097754203.avro
deserialized from file: {"name": "Programming is fun", "id": 123, "category": "Fiction"}
deserialized from file: {"name": "Some book", "id": 456, "category": "Horror"}
deserialized from file: {"name": "And another book", "id": 789, "category": null}

Tutorial Sources

Please feel free to download the tutorial sources from my Bitbucket repository, fork it there or clone it using Git:

git clone https://bitbucket.org/hascode/avro-tutorial.git

Resources

Tags: Apache, avro, google protocol buffers, maven, serialization, thrift

This entry was posted on Saturday, March 8th, 2014 at 8:26 pm and is filed under Java. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

4 Responses to “Using Apache Avro with Java and Maven”

Juergen Weber Says:
March 10th, 2014 at 1:42 pm
or with JDK onboard Plain Old CORBA:

module entity
{
struct Book
{
string name;
long id;
string category;
};

typedef sequence BookSequence;
};

package t;

import java.io.FileOutputStream;
import java.util.Properties;

import org.omg.CORBA.Any;
import org.omg.CORBA.ORB;
import org.omg.IOP.Codec;
import org.omg.IOP.CodecFactory;
import org.omg.IOP.CodecFactoryHelper;
import org.omg.IOP.ENCODING_CDR_ENCAPS;
import org.omg.IOP.Encoding;

import entity.Book;
import entity.BookSequenceHelper;

public class T
{

public static void main(String[] args) throws Exception
{
ORB orb = ORB.init(args, null);

org.omg.CORBA.Object obj = orb.resolve_initial_references(“CodecFactory”);

CodecFactory codecFactory = CodecFactoryHelper.narrow(obj);

Codec codec = codecFactory.create_codec(new Encoding(
ENCODING_CDR_ENCAPS.value, (byte) 1, (byte) 2));

Book[] books = {

new Book(“Programming is fun”, 1, “Fiction”),
new Book(“Some book”, 2, “Horror”)};

Any any = orb.create_any();

BookSequenceHelper.insert(any, books);

byte[] b = codec.encode(any);

FileOutputStream fos = new FileOutputStream(“books.iiop”);
fos.write(b);
fos.close();

Any decoded = codec.decode(b);

Book[] books2 = BookSequenceHelper.extract(decoded);

System.out.println(books2);
}
}
micha kops Says:
March 10th, 2014 at 9:58 pm
Interesting, haven’t seen CORBA in a while! Thanks for your input! :)
laki Says:
February 22nd, 2018 at 1:02 pm
Nice article, it helped to to understand avro maven plugin, I am working on my college project, thank you very much,

could you please tell me, where the maven command is executed, i mean in what folder, correct me, if i am wrong, should I need to run where avro schema file located.

Can you explain how to do using eclipse, I am having issues with that approach.

Micha Kops Says:
February 22nd, 2018 at 7:53 pm

Hi,

this is my project structure:

.
├── pom.xml
├── README.md
└── src
    ├── main
    │   ├── avro
    │   │   └── book.avsc
    │   ├── java
    │   │   └── com
    │   │       └── hascode
    │   │           ├── entity
    │   │           │   └── Book.java
    │   │           └── tutorial
    │   │               └── FileSerializationExample.java
    │   └── resources
    └── test
        ├── java
        └── resources

I’m running Maven from the project directory. In Eclipse IDE you might need the m2eclipse integration.