Saturday, June 21, 2014

Pail Framework - Thoughts

In the book Big Data (MEAP), Nathan Marz describes a framework called Pail (dfs-datastores) [1] which is a data storage solution on top of hadoop. It supports schema, merging small files into a large chunk for better hdfs performance etc.
Some points to note about Pail: 
  • Pail is a thin abstraction over files and folders from the dfs-datastores library.
  • Pail makes it significantly easier to manage a collection of records in a batch processing context.
  • The Pail abstraction frees us from having to think about file formats and greatly reduces the complexity of the storage code.
  • It enables you to vertically partition a data-set and it provides a dead-simple API for common operations like appends, compression, and consolidation.
  • Pail is just a Java library and underneath it uses the standard file APIs provided by Hadoop.
  • Pail makes it easy to satisfy all of the requirements we have for storage on the batch layer.
  • We treat a pail like an un-ordered collection of records.
  • Internally, those records are stored across files that can be nested into arbitrarily deep sub-directories
When using Pail, our system will be treated as HDFS and hence we can use Hadoop tools to access these files. Pail files are named using globally unique names, a pail can be written to concurrently by multiple writers without conflicts. Additionally, a reader can read from a pail while it’s being written to without having to worry about half-written files.
Typed Pails
We don’t have to work with binary records when using Pail. Pail lets us work with real objects rather than binary records. At the file level, data is stored as a sequence of bytes. To work with real objects, we provide Pail with information about what type our records will be and how to serialize and deserialize objects of that type to and from binary data.
Ex: To create an integer pail:

Pail<integer> intpail = Pail.create(“/tmp/intpail”, new IntegerPailStructure());
When writing records to the pail, we can give it integer objects directly and the Pail will handle the
serialization. This is shown in the following code:
TypedRecordOutputStream int_os = intpail.openWrite();
int_os.writeObject(1);
int_os.writeObject(2);
int_os.writeObject(3);
int_os.close();
Likewise, when we read from the pail, the pail will deserialize records for us. Here’s how we can iterate through all the objects in the integer pail we just write to:
for(Integer record: intpail) {
System.out.println(record);
}
Pail - Appends 
Using the append operation, we can add all the records from
one pail into another pail. Here’s an example of appending a pail called “source” into a pail called “target”:
Pail source = new Pail(“/tmp/source”);
Pail target = new Pail(“/tmp/target”);
target.copyAppend(source);
The append operation is smart. It checks the pails to make sure it’s valid to append the pails together. So for example, it won’t let us append a pail that stores strings into a pail that stores integers.
There are three types of supported Append operations:
  1. copyAppend
  2. moveAppend
  3. absorb
Pail -- Consolidate
Sometimes records end up being spread across lots of small files. This has a major performance cost associated with it when we want to process that data in a MapReduce job since MapReduce will need to launch a lot more tasks.
The solution is to combine those small files into larger files so that more data can be processed in a single task. Pail supports this directly by exposing a consolidate method. This method launches a MapReduce job that combines files into larger files in a scalable way.
To consolidate a pail we do: pail.consolidate();
Summary
PROS: 
It’s important to be able to think about and manipulate data at the record level and not at the file level. By abstracting away file formats and directory structure into the Pail abstraction, we’re able to do exactly that. The Pail abstraction frees us from having to think about the details of the data storage while making it easy to do robust, enforced vertical partitioning as well as common operations like appends and consolidation. Without the Pail abstraction, these basic tasks are manual and difficult. Vertical partitioning happens automatically, and tasks like
appends and consolidation are just one-liners. This means we can focus on how we want to process our records
rather than the details of how to store those records
CONS:
- Lack of Active Developer Support
- Not many developers working or committing to the official GIT.
- Not may open source projects using it.
- Lack of Documentation.
Additional References:
[1] https://github.com/nathanmarz/dfs-datastores
[2] http://misaxionsoftware.wordpress.com/2012/07/10/step-to-big-data-hello-pail/#more-190

No comments :

Post a Comment