<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Simon Buckle&#039;s Weblog</title>
	<atom:link href="http://www.simonbuckle.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.simonbuckle.com</link>
	<description>Random thoughts for random people</description>
	<lastBuildDate>Wed, 05 Jun 2013 10:32:54 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Quote</title>
		<link>http://www.simonbuckle.com/2013/06/05/quote/</link>
		<comments>http://www.simonbuckle.com/2013/06/05/quote/#comments</comments>
		<pubDate>Wed, 05 Jun 2013 10:32:54 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.simonbuckle.com/?p=801</guid>
		<description><![CDATA[I came across this quote the other day and thought I would share it: Nothing in the world can take the place of persistence. Talent will not; nothing is more common than unsuccessful men with talent. Genius will not; unrewarded genius is almost a proverb. Education will not; the world is full of educated derelicts. [...]]]></description>
				<content:encoded><![CDATA[<p>I came across this quote the other day and thought I would share it:</p>
<blockquote><p>Nothing in the world can take the place of persistence. Talent will not; nothing is more common than unsuccessful men with talent. Genius will not; unrewarded genius is almost a proverb. Education will not; the world is full of educated derelicts. Persistence and determination alone are omnipotent. The slogan “press on” has solved and always will solve the problems of the human race.</p>
<p>- Calvin Coolidge</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.simonbuckle.com/2013/06/05/quote/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Introduction to VoltDB</title>
		<link>http://www.simonbuckle.com/2012/12/11/introduction-to-voltdb/</link>
		<comments>http://www.simonbuckle.com/2012/12/11/introduction-to-voltdb/#comments</comments>
		<pubDate>Tue, 11 Dec 2012 16:31:03 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.simonbuckle.com/?p=768</guid>
		<description><![CDATA[Following up on the recent tradition, or so it seems, of starting every one of my blog posts with the words, &#8220;Introduction to&#8221;, my VoltDB tutorial has (finally!) been published on the developerWorks site: Introduction to VoltDB. The latest version of the source code that accompanies the article can be cloned from the VoltDB example [...]]]></description>
				<content:encoded><![CDATA[<p>Following up on the recent tradition, or so it seems, of starting every one of my blog posts with the words, &#8220;Introduction to&#8221;, my VoltDB tutorial has (finally!) been published on the developerWorks site: <a href="http://www.ibm.com/developerworks/java/library/os-voltdb/index.html" title="Introduction to VoltDB">Introduction to VoltDB</a>.</p>
<p>The latest version of the source code that accompanies the article can be cloned from the VoltDB example project on my GitHub account <a href="https://github.com/sbuckle/voltdb-example" title="VoltDB article source code">here</a>.</p>
<p>If you have any feedback, please leave a comment.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.simonbuckle.com/2012/12/11/introduction-to-voltdb/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Introduction to Riak: Part Deux</title>
		<link>http://www.simonbuckle.com/2012/05/15/introduction-to-riak-part-deux/</link>
		<comments>http://www.simonbuckle.com/2012/05/15/introduction-to-riak-part-deux/#comments</comments>
		<pubDate>Tue, 15 May 2012 15:13:30 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.simonbuckle.com/?p=753</guid>
		<description><![CDATA[Part 2 of my introduction to Riak has just been published. You can view it here: http://www.ibm.com/developerworks/web/library/os-riak2/index.html I&#8217;ve had a quick look and there appears to be an encoding issue in Listing 2. Ignore the question marks. It should read: $ curl -i http://localhost:8098/riak/odds/ ... { "odds":"", "description":"" } &#160; Hopefully they will have corrected [...]]]></description>
				<content:encoded><![CDATA[<p>Part 2 of my introduction to Riak has just been published. You can view it here:</p>
<p><a href="http://www.ibm.com/developerworks/web/library/os-riak2/index.html">http://www.ibm.com/developerworks/web/library/os-riak2/index.html</a></p>
<p>I&#8217;ve had a quick look and there appears to be an encoding issue in Listing 2. Ignore the question marks. It should read:<br />
<code></p>
<pre>
$ curl -i http://localhost:8098/riak/odds/<key>
...
{ "odds":"", "description":"" }
</pre>
<p></code><br />
&nbsp;<br />
Hopefully they will have corrected it by the time you read this. </p>
<p>Other than that it&#8217;s more or less how I submitted it (I think). I&#8217;ll go over it in more detail later on. Let me know what you think.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.simonbuckle.com/2012/05/15/introduction-to-riak-part-deux/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Simulating Auto Increment in VoltDB</title>
		<link>http://www.simonbuckle.com/2012/04/29/simulating-auto-increment-in-voltdb/</link>
		<comments>http://www.simonbuckle.com/2012/04/29/simulating-auto-increment-in-voltdb/#comments</comments>
		<pubDate>Sun, 29 Apr 2012 21:15:27 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.simonbuckle.com/?p=735</guid>
		<description><![CDATA[Auto incrementing fields are quite useful, particularly for allocating values to primary keys. MySQL has AUTO_INCREMENT and PostgreSQL has a SERIAL data type. VoltDB has neither, nor anything remotely close to them. This brief article will show you how to simulate auto-incrementing fields in VoltDB. It assumes some knowledge of VoltDB. VoltDB implements a subset [...]]]></description>
				<content:encoded><![CDATA[<p>Auto incrementing fields are quite useful, particularly for allocating values to primary keys. MySQL has AUTO_INCREMENT and PostgreSQL has a SERIAL data type. <a href="http://www.voltdb.com">VoltDB</a> has neither, nor anything remotely close to them. This brief article will show you how to simulate auto-incrementing fields in VoltDB. It assumes some knowledge of VoltDB.</p>
<p>VoltDB implements a subset of ANSI-standard SQL. It supports the basic CRUD operations (INSERT, SELECT, UPDATE, DELETE) but it does not have support for automatically generating unique identifiers. It is possible, however, to simulate these in VoltDB, as per this <a href="http://community.voltdb.com/faq#id463827">entry</a> in the FAQ. What we can do is create a table that stores the name of the table and the current value that can be used as the unique value for, say, a given column. The schema for the table is shown below:<br />
<code></code></p>
<pre>CREATE TABLE IDENTIFIER (
   TABLE_NAME VARCHAR(100) NOT NULL,
   CURRENT_VALUE INTEGER DEFAULT 1 NOT NULL,
   PRIMARY KEY (TABLE_NAME)
);</pre>
<div style="margin-top: 20px; margin-bottom: 15px;">The next step is to create a stored procedure that, when called, will return the current value for a given table. The stored procedure will read the current value, increment it, and then return the value to the client. <span id="more-735"></span></div>
<p>The stored procedure looks like this:</p>
<pre>import org.voltdb.*;</pre>
<p>&nbsp;</p>
<pre>@ProcInfo (
  partitionInfo = "IDENTIFIER.TABLE_NAME: 0",
  singlePartition = true
)
public class GenerateUniqueIdentifier extends VoltProcedure {

  public final SQLStmt select = new SQLStmt(
    "SELECT CURRENT_VALUE FROM IDENTIFIER WHERE TABLE_NAME = ?"
  );

  public final SQLStmt update = new SQLStmt(
    "UPDATE IDENTIFIER SET CURRENT_VALUE = CURRENT_VALUE + 1 " +
    "WHERE TABLE_NAME = ?"
  );

  public VoltTable[] run(String tableName)
    throws VoltAbortException {

      voltQueueSQL(select, tableName);
      VoltTable[] idResult = voltExecuteSQL();

      voltQueueSQL(update, tableName);
      voltExecuteSQL(true);

      return idResult; // Return the current value for the table
  }
}</pre>
<p>As this procedure demonstrates, it is possible to execute multiple SQL statements from inside a stored procedure. It also highlights the advantage of having single-threaded partitions. The table is partitioned on the table name therefore the current value for a given table (each row) is stored in a single partition. As each partition is single-threaded, and stored procedures are run sequentially, there is no risk of the current value for a given table being altered between the time the procedure reads the current value and updates it; this would not be the case in a multi-threaded environment where some kind of locking would have to be used.</p>
<p>With the stored procedure in place, it can be called at will. The returned values can then be used in subsequent calls to other stored procedures. And that is how you can simulate auto-increment in VoltDB. <del>At some point I will upload a complete working solution</del>. Checkout the <code>autoincrement</code> branch of my example VoltDB <a href="https://github.com/sbuckle/voltdb-example">project</a> to play around with some working code that implements the above. The End.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.simonbuckle.com/2012/04/29/simulating-auto-increment-in-voltdb/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Introduction to Riak</title>
		<link>http://www.simonbuckle.com/2012/03/13/introduction-to-riak/</link>
		<comments>http://www.simonbuckle.com/2012/03/13/introduction-to-riak/#comments</comments>
		<pubDate>Tue, 13 Mar 2012 21:25:34 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.simonbuckle.com/?p=726</guid>
		<description><![CDATA[Several months ago &#8211; in a galaxy far, far away &#8211; I received an email inquiring as to whether I was interested in writing a couple of articles about Riak for IBM&#8217;s developerWorks site. I was, so I did &#8211; I first wrote something about Riak a while back on this site over here. Anyway, after a [...]]]></description>
				<content:encoded><![CDATA[<p>Several months ago &#8211; in a galaxy far, far away &#8211; I received an email inquiring as to whether I was interested in writing a couple of articles about <a title="Riak" href="http://en.wikipedia.org/wiki/Riak">Riak</a> for IBM&#8217;s developerWorks site. I was, so I did &#8211; I first wrote something about Riak a while back on this site over <a href="http://www.simonbuckle.com/2011/08/27/analyzing-apache-logs-with-riak/">here</a>. Anyway, after a bit of a wait, the first one was released into the wild today. You can read it here:</p>
<p><a href="http://www.ibm.com/developerworks/library/os-riak1/">http://www.ibm.com/developerworks/library/os-riak1/</a></p>
<p>It&#8217;s mostly intact although a few paragraphs appear to have fallen by the wayside. Not really surprising as the article was supposed to be under 3000 words whereas the (final) version I submitted was quite a bit over that.</p>
<p>There&#8217;s a second installment but I have no idea when it will be published. Or if for that matter. I guess that may depend on the reaction to the first one <img src='http://www.simonbuckle.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.simonbuckle.com/2012/03/13/introduction-to-riak/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Multi-Tenancy</title>
		<link>http://www.simonbuckle.com/2011/12/27/multi-tenancy/</link>
		<comments>http://www.simonbuckle.com/2011/12/27/multi-tenancy/#comments</comments>
		<pubDate>Tue, 27 Dec 2011 11:38:18 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.simonbuckle.com/?p=712</guid>
		<description><![CDATA[I attended the Alfresco conference in London in the middle of November and there was a fair amount of talk about Alfresco’s cloud offering that &#8211; if it’s not already available &#8211; was due to be launched fairly soon. It will be a hosted service and will allow a single instance of Alfresco to host [...]]]></description>
				<content:encoded><![CDATA[<p>I attended the Alfresco conference in London in the middle of November and there was a fair amount of talk about Alfresco’s cloud offering that &#8211; if it’s not already available &#8211; was due to be launched fairly soon. It will be a hosted service and will allow a single instance of Alfresco to host multiple sites (or tenants). This is usually referred to as multi-tenancy. There are a number of different <a href="http://msdn.microsoft.com/en-us/library/aa479086.aspx">approaches</a> but the simplest one involves sharing the same database; at the database level you can think of each entry in a table, e.g. forum posts, having something like a site ID column that indicates which site the entry belongs to.</p>
<p>I started thinking about it and I don’t get it. I understand technically how multi-tenancy works; I just don’t see the benefits of making an application multi-tenant aware! <span id="more-712"></span>With the advancements in virtualisation technology made in recent years, why not just get a couple of servers<a title="" href="#_ftn1">[1]</a>, install some virtualisation software and create a virtual instance for each site?</p>
<p>I can imagine that the development effort required in making an application multi-tenant aware is non-trivial &#8211; this implies a significant monetary cost in terms of the number of developers working on it and the time it takes to develop &#8211; and then there is the potential security risk of tenant data being exposed to other tenants due to bugs in the software. It could happen, you never know! Presumably you will also need to maintain a separate version of the application that doesn’t support multi-tenancy (?)</p>
<p>I guess upgrades would be easier &#8211; you would only have to do it once as opposed to having to upgrade each virtual instance.</p>
<p>It just seems to me to be much simpler to go down the virtualisation route. You don’t need to do anything to the application, all you need to do is install some virtualisation software and off you go. Obviously I’m over simplifying it but you get the idea. Perhaps I have missed something. I would be interested to know what you think.</p>
<hr />
<p><a title="" href="#_ftnref1">[1]</a> I realise it’s not fashionable nowadays to actually buy/rent your own dedicated servers as everything has to be in the &#8220;cloud&#8221;!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.simonbuckle.com/2011/12/27/multi-tenancy/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Yubico Java Client Changes</title>
		<link>http://www.simonbuckle.com/2011/10/04/yubico-java-client-changes/</link>
		<comments>http://www.simonbuckle.com/2011/10/04/yubico-java-client-changes/#comments</comments>
		<pubDate>Wed, 05 Oct 2011 06:29:29 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.simonbuckle.com/?p=697</guid>
		<description><![CDATA[Just a quick note. As part of the integration work I did getting YubiKey to work with Alfresco, I also added support for signatures and making validation queries in parallel to the Yubico Java client (I forked the original client) so it should now work with version 2 of the validation protocol; see this FAQ. Hopefully I [...]]]></description>
				<content:encoded><![CDATA[<p>Just a quick note. As part of the integration work I did getting YubiKey to work with Alfresco, I also added support for signatures and making validation queries in parallel to the Yubico Java client (I forked the original client) so it should now work with version 2 of the validation protocol; see this <a href="http://www.yubico.com/server-v2-faq">FAQ</a>. Hopefully I didn&#8217;t fork it up!</p>
<p>You can grab it from my GitHub account: <a href="https://github.com/sbuckle/yubico-java-client">https://github.com/sbuckle/yubico-java-client</a></p>
<p>If you do decide to use it, you might want to pick and choose which bits you want to pull as I have made other changes not related to the enhancements in version 2.0 of the validation protocol. Now I wonder if I&#8217;ll get my five free YubiKeys <img src='http://www.simonbuckle.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p><em>Update: I did. Just ordered them. Thanks Yubico <img src='http://www.simonbuckle.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em></p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.simonbuckle.com/2011/10/04/yubico-java-client-changes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Two-Factor Authentication with Alfresco</title>
		<link>http://www.simonbuckle.com/2011/09/29/two-factor-authentication-with-alfresco/</link>
		<comments>http://www.simonbuckle.com/2011/09/29/two-factor-authentication-with-alfresco/#comments</comments>
		<pubDate>Thu, 29 Sep 2011 19:27:39 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.simonbuckle.com/?p=680</guid>
		<description><![CDATA[So what is two-factor authentication? I&#8217;ll defer that explanation to the Wikipedia page on the subject. Most systems require users to identify themselves using a username and password. The problem is that if people choose a weak password, which evidence suggests they do, suddenly your secure authentication system is not so secure. Using &#8220;something you [...]]]></description>
				<content:encoded><![CDATA[<p>So what is two-factor authentication? I&#8217;ll defer that explanation to the Wikipedia page on the <a href="http://en.wikipedia.org/wiki/Two-factor_authentication">subject</a>. Most systems require users to identify themselves using a username and password. The problem is that if people choose a weak password, which evidence suggests they do, suddenly your secure authentication system is not so secure. Using &#8220;something you have&#8221; in the authentication process makes it much more secure. Take cash machines as an example. If I discover your PIN number, I can only take money out of an ATM if I am in possession of your bank card. Without the card, knowing the PIN number is not going to help me steal your money.</p>
<p>I&#8217;ve created an Alfresco extension that implements two-factor authentication using a <a href="http://yubico.com/yubikey">YubiKey</a>.</p>
<p>What is a YubiKey? It&#8217;s a device that you plug into your USB port and it generates one time passwords (OTP). It&#8217;s similar to RSA&#8217;s <a href="http://en.wikipedia.org/wiki/SecurID">SecurID</a>, only a lot cheaper. So now, in addition to specifying your username and password, you also have to submit a OTP when logging in &#8211; the OTPs are validated by Yubico&#8217;s servers.</p>
<p>Using a key like this makes logging in a lot more secure as it is now no longer possible to log in just using a username and password. In addition, each key is tied to a particular user account &#8211; the extension takes care of this &#8211; so it&#8217;s not possible to just use any key; the user has to use the key that has been (uniquely) assigned to them. The screencast below shows how it works.</p>
<iframe width="560" height="315" src="http://www.youtube.com/embed/jlc1DrPkX5c" frameborder="0" type="text/html"></iframe>
<p style="margin-top: 15px;"><del datetime="2011-09-30T11:35:28+00:00">I will release the extension shortly.</del> You can download the extension from <a href="https://github.com/sbuckle/Alfresco-Yubikey-Extension">here</a>. I&#8217;ll be attending Alfresco DevCon in London so come and say hi and I can give you a live demo of the system. In the meantime, if you have any questions, feel free to leave a comment or send me an email.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.simonbuckle.com/2011/09/29/two-factor-authentication-with-alfresco/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Hadoop to Analyze Apache Log Files</title>
		<link>http://www.simonbuckle.com/2011/09/01/using-hadoop-to-analyze-apache-log-files/</link>
		<comments>http://www.simonbuckle.com/2011/09/01/using-hadoop-to-analyze-apache-log-files/#comments</comments>
		<pubDate>Thu, 01 Sep 2011 13:29:57 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.simonbuckle.com/?p=647</guid>
		<description><![CDATA[After my post a few days ago about analyzing Apache log files with Riak, I thought I would follow that up by showing how to do the same thing using Hadoop. I am not going to cover how to install Hadoop; I am going to assume you already have it installed. What is it they [...]]]></description>
				<content:encoded><![CDATA[<p>After my post a few days ago about <a href="http://www.simonbuckle.com/2011/08/27/analyzing-apache-logs-with-riak/">analyzing Apache log files with Riak</a>, I thought I would follow that up by showing how to do the same thing using Hadoop. I am not going to cover how to install Hadoop; I am going to assume you already have it installed. What is it they say about assumptions? Also, any Hadoop commands are executed relative to the directory where Hadoop is installed ($HADOOP_HOME).<span id="more-647"></span></p>
<p>As before, the first thing to do is to get log data into the system so it can be analyzed; in this case the Hadoop Distributed File System (HDFS). I copied all of my log files into /tmp/logs on the server where I was running Hadoop. To import the data into HDFS, run the following command:</p>
<div style="margin-top: 15px; margin-bottom: 15px;">
<pre>./bin/hadoop dfs -put /tmp/logs /var/logs</pre>
</div>
<p>This copies all the files from the directory /var/logs on the local file system into the location /var/logs in HDFS. Remember: the destination location is the location in HDFS and does NOT correspond to a directory in the local file system. To see if the log files were copied, run the following:</p>
<div style="margin-top: 15px; margin-bottom: 15px;">
<pre>./bin/hadoop dfs -ls /var/logs</pre>
</div>
<p>You should see a list of the files you just imported.</p>
<p>Just like with Riak, we need two classes: a Mapper and a Reducer. The Mapper class contains a map function, which is called for each input &#8211; in this case, each log file. The Reducer class contains a reduce function, which is called once for each key, and takes as input the intermediate results from the map phases.</p>
<p>The code for the Mapper class looks like this:</p>
<div style="margin-top: 15px; margin-bottom: 15px;">
<pre>public static class LogEntryMapper extends Mapper&lt;Object, Text, Text, IntWritable&gt; {</pre>
<pre>  private final static IntWritable one = new IntWritable(1);
  private Text url = new Text();

  private Pattern p = Pattern.compile("(?:GET|POST)\\s([^\\s]+)");

  public void map(Object key, Text value, Context context)
  throws IOException, InterruptedException {
	String[] entries = value.toString().split("\r?\n");
	for (int i=0, len=entries.length; i&lt;len; i+=1) {
		Matcher matcher = p.matcher(entries[i]);
		if (matcher.find()) {
			url.set(matcher.group(1));
			context.write(url, one);
		}
	}
  }
}</pre>
</div>
<p>The map function takes the text from each log file and splits it at the newline character to get an array of individual log entries. The URL is then extracted from each line and a count of 1 is assigned to each URL instance. The reduce function takes these individual counts and adds them together to get a total for each key (e.g. URL).  The Reducer class looks like this:</p>
<div style="margin-top: 15px; margin-bottom: 15px;">
<pre>public static class LogEntryReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt; {

	private IntWritable total = new IntWritable();

	public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context)
	throws IOException, InterruptedException {
		int sum = 0;
		for (IntWritable value : values) {
		    sum += value.get();
		}
		total.set(sum);
		context.write(key, total);
	}
}</pre>
</div>
<p>The main() method sets up the map-reduce job and starts it:</p>
<div style="margin-top: 15px; margin-bottom: 15px;">
<pre>public static void main(String[] args) throws Exception {
	Configuration conf = new Configuration();

	if (args.length != 2) {
		System.err.println("Usage: loganalyzer &lt;in&gt; &lt;out&gt;");
		System.exit(2);
	}

	Job job = new Job(conf, "analyze log");
	job.setJarByClass(LogAnalyzer.class);
	job.setMapperClass(LogEntryMapper.class);
	job.setReducerClass(LogEntryReducer.class);
	job.setOutputKeyClass(Text.class);
	job.setOutputValueClass(IntWritable.class);
	FileInputFormat.addInputPath(job, new Path(args[0]));
	FileOutputFormat.setOutputPath(job, new Path(args[1]));

	System.exit(job.waitForCompletion(true) ? 0 : 1);
}</pre>
</div>
<p>As I mentioned in my <a href="http://www.simonbuckle.com/2011/08/27/analyzing-apache-logs-with-riak/#combiner">previous article</a>, it&#8217;s possible with Hadoop to specify a combiner class to perform a reduce-type operation after each map phase. If you want to do that here, you will need to add the following line to the job definition:</p>
<div style="margin-top: 15px; margin-bottom: 15px;">
<pre>job.setCombinerClass(LogEntryReducer.class);</pre>
</div>
<p>Now that the data is available to analyze, the map-reduce job can be started from a local JAR file:</p>
<div style="margin-top: 15px; margin-bottom: 15px;">
<pre>./bin/hadoop jar loganalyzer.jar loganalyzer /var/logs /var/logs-output</pre>
</div>
<p><em>(If you are following along at home you will need to download <a href="/files/loganalyzer.jar">loganalyzer.jar</a>)</em></p>
<p>This tells the map-reduce job to take all the files in /var/logs as input and output the results of the job to /var/logs-output (in HDFS). To copy the results locally, once the job is complete, execute the following:</p>
<div style="margin-top: 15px; margin-bottom: 15px;">
<pre>./bin/hadoop dfs -getmerge /var/logs-output /tmp/logs-output</pre>
</div>
<p>If you open up the file in /tmp/logs-output, you should see something like this:</p>
<div style="margin-top: 15px; margin-bottom: 15px;">
<pre>simonbuckle.com/feed/atom       11
simonbuckle.com/index.php       13
simonbuckle.com/robots.txt      2
...</pre>
</div>
<p>That&#8217;s it! The results could be tidied up a bit by removing the hostname as that doesn&#8217;t add anything but other than that, it does the job.</p>
<p>The code that defines the Mapper and Reducer classes is <a href="/files/LogAnalyzer.java">here</a> &#8211; I have put everything in one file to make it easier to read. The code for the driver is <a href="/files/LogDriver.java">here</a>. All code was compiled against Hadoop version 0.20.203.0.</p>
<p>As before, feel free to leave any comments.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.simonbuckle.com/2011/09/01/using-hadoop-to-analyze-apache-log-files/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Analyzing Apache Logs with Riak</title>
		<link>http://www.simonbuckle.com/2011/08/27/analyzing-apache-logs-with-riak/</link>
		<comments>http://www.simonbuckle.com/2011/08/27/analyzing-apache-logs-with-riak/#comments</comments>
		<pubDate>Sat, 27 Aug 2011 15:59:21 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.simonbuckle.com/?p=622</guid>
		<description><![CDATA[This article will show you how to do some Apache log analysis using Riak and MapReduce. Specifically it will give an example of how to extract URLs from Apache logs stored in Riak (the map phase) and provide a count of how many times each URL was requested (the reduce phase). So what is Riak? [...]]]></description>
				<content:encoded><![CDATA[<p>This article will show you how to do some Apache log analysis using Riak and MapReduce. Specifically it will give an example of how to extract URLs from Apache logs stored in Riak (the map phase) and provide a count of how many times each URL was requested (the reduce phase).</p>
<p>So what is Riak? According to Wikipedia it&#8217;s &#8220;a NoSQL database implementing the principles from Amazon&#8217;s Dynamo paper&#8221;. Or, put another way,  it&#8217;s a distributed key-value store that has built-in support for MapReduce. If you aren&#8217;t familiar with MapReduce a good starting point would be to read Google&#8217;s <a href="http://labs.google.com/papers/mapreduce.html">MapReduce paper</a>. I am not going to go over how to install Riak; there&#8217;s a good <a href="http://wiki.basho.com/The-Riak-Fast-Track.html">tutorial</a> for that on the Riak website. Riak also has a lot of other features that won&#8217;t be covered here.<span id="more-622"></span></p>
<p>Right, off we go.</p>
<p>The first thing to do is get your log data into Riak. The example will use Riak&#8217;s HTTP API that allows you to create/delete content using HTTP GET, POST etc. Run the following command from wherever your log data is stored:</p>
<div style="margin-top: 15px; margin-bottom: 15px;">
<pre>curl -v -X PUT http://localhost:8091/riak/logs/2011-08-23 \</pre>
<pre>    -H "Content-Type: text/plain" --data-binary @2011-08-23.log</pre>
</div>
<p>In this case I am storing the log data in a bucket called &#8220;logs&#8221; and I am also providing a key (&#8220;2011-08-23&#8243;) for this particular log. If you do a POST to just the bucket, e.g. you don&#8217;t specify a key, Riak will generate a key for you; you will be able to see the generated key in the &#8220;Location&#8221; header in the HTTP response. Also, note the use of the <code>--data-binary</code> flag. It&#8217;s really important because if you use <code>-d</code> instead, curl will very kindly strip out all of the newline characters from the text &#8211; as I eventually found out! Not what you want.</p>
<p>Now that the log data is stored in Riak, you can query it. This is where MapReduce comes in. Riak&#8217;s MapReduce supports writing map and reduce functions in either JavaScript or Erlang. <a name="combiner"></a>Currently there is no support for applying an optional combiner function after each map task; other frameworks, such as Hadoop, do. For example, when counting the number of words in a set of documents, each map task may produce lots of records of the type &lt;&#8221;at&#8221;, 1&gt;. Rather than sending all of these individual records over the network, it would be beneficial to merge the counts for each individual record on each node before sending them to the reduce phase; however, in this scenario, it wouldn&#8217;t be difficult to just add the logic for doing the merge in the map function itself. I used JavaScript for my map and reduce functions. Queries are specified using JSON. The query for analysing the log(s) looks something like this:</p>
<div style="margin-top: 15px; margin-bottom: 15px;">
<pre>{
   "inputs": [["logs", "2011-08-23"]],
   "query": [
	{ "map": { "language": "javascript", "name": "LogAnalyzer.mapLogEntry" } },
	{ "reduce": { "language": "javascript", "name": "LogAnalyzer.reduceLogEntry" } }
   ]
}</pre>
</div>
<p>The inputs consist of an array of arrays; each entry specifies the bucket name and the corresponding key of the log we want to process. In this case the bucket name and key correspond to the log that was loaded previously. Only one map phase is defined but you can specify more than one if you want to. During the map phase the request URL will be extracted from each line in the logs. The reduce phase will take the results from the map phase and sum the counts for each URL, returning the totals to the client. Both the map and reduce phases in this example use named queries. The source for the functions is <a href="https://github.com/sbuckle/Riak-Log-Analyzer/blob/master/js/LogAnalyzer.js">here</a>.</p>
<p>You can define anonymous JavaScript functions directly in your queries; there is an example of using an anonymous function halfway down this <a href="http://wiki.basho.com/Loading-Data-and-Running-MapReduce-Queries.html">page</a>. Unless your functions are trivial I recommend that you name your functions and have Riak load them when it starts up. To do that you will need to modify the configuration file <code>(&lt;node&gt;/etc/app.config)</code> for each node in your cluster. Open the config file, locate the variable &#8220;js_source_dir&#8221; and set it to wherever you have your JavaScript files, e.g. <code>{js_source_dir, "/Users/simon/Projects/Riak/js"}</code>. Make sure it&#8217;s uncommented. You will need to restart your nodes for the changes to have an effect.</p>
<p>To run the query, save it to a file, open up a terminal and run the following command:</p>
<div style="margin-top: 15px; margin-bottom: 15px;">
<pre>curl -v -H "Content-Type: application/json" \</pre>
<pre>      http://localhost:8091/mapred -d @log-query.json</pre>
</div>
<p>Hopefully, you should get back something like this:</p>
<div style="margin-top: 15px; margin-bottom: 15px;">
<pre>[{"www.simonbuckle.com/feed/" : 19},
    {"www.simonbuckle.com/2006/01/19/design-revamp-2" : 1}, ...]</pre>
</div>
<p>So that&#8217;s how to do some analysis on your Apache logs using Riak. All the code can be found on <a href="https://github.com/sbuckle/Riak-Log-Analyzer">GitHub</a>. There&#8217;s also an example of how to do a distributed word count that I didn&#8217;t cover here.</p>
<p>There is a lot more information about Riak on the Riak website. At some point it would be nice to be able to specify queries in languages other than JavaScript and Erlang. The map and reduce phases in this example are trivial but I can envisage a scenario where you might want to do some rather complex analysis during each phase so it would be nice to be able to use external libraries rather than having to write stuff from scratch each time.</p>
<p>Feel free to leave a comment if you have any questions.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.simonbuckle.com/2011/08/27/analyzing-apache-logs-with-riak/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
