Tuesday, March 8, 2011

Understanding Maven Dependency Mediation (Part 2)





In my last post I talked about how Maven mediates conflicts in the dependency tree in probably the vast majority of all builds. This default algorithm somehow surprisingly ignores the actual version number of the declared dependencies. Instead, it uses the distance to the root of the dependency tree as a basis for the decision, which dependencies make it into the build.

Now, I'm going to cover another approach of how Maven can handle conflicts in the transitively resolved dependencies. Maven switches to this alternative algorithm, as soon as it detects a concrete version number for one of the dependencies that are involved in a conflict. In my last post I explained already, that version numbers without any decoration are considered to be a recommendation. Maven may or may not follow this recommendation. In contrast to this, if you put square brackets around the version in your dependency declaration, Maven assumes you mean it. That is, now you have defined an exact version of a given dependency. If Maven is not able to deliver the desired version(s) to your build, it will generate an appropriate error and fail.

Just as a side note: The syntax for concrete versions allows much more variants than only defining one specific version. In short, it supports the following definitions:

* [A]: Defines one specific version A
* [A,B]: Defines a range of versions from A to B (inclusive)
* [A,]: Defines a range from A (inclusive) upwards
* [,A]: Defines a range up to A (inclusive)

Instead of square brackets, you can use round brackets to define a version range exclusive the delimiting versions. For example, [A,B) defines a range from A (inclusive) up to B (exclusive).

Now, let's see what happens if a concrete version pops up in the dependency tree. Honestly, I didn't find a real example, because concrete versions are used very seldom. So I constructed a hypothetical sample that is based on the ones of my last post. It assumes that someone introduced a concrete version number range in the POM of commons-logging. The range defines that commons-logging needs Log4J in a version between 1.2.12 and 1.2.16 (inclusive). Our sample project itself references this hypothetical version of commons-logging and Log4J V1.2.11.

If we execute this build, Maven decides that Log4J 1.2.16 must be included. This is in contrast to the examples in my last post, when Maven decided in all cases that Log4J V1.2.11 is the "right one" for the build. The point is, that this different behavior of Maven is not triggered by the POM of our sample project, but by the POM of commons-logging. We're not in control of this POM, someone else changed it and suddenly we end up with a different Log4J version in our build.

So why did this happen? Maven detected a conflict in the dependency tree. We requested Log4J V1.2.11, commons-logging requested a different range of Log4J versions. Now, since commons-logging used a concrete number range, the default algorithm is skipped. Instead, Maven actually looks at the values of the conflicting versions. First, it tries to find an overlapping range of all defined concrete versions. In our case this is simple, there is just one such concrete version range. Now, Maven checks if the nearest recommended version number fits into the found overlapping version range. If it does fit into the range, Maven selects the recommended version. In our case, it doesn't (V1.2.11 doesn't fit into [1.2.12,1.2.16]). If this happens, Maven selects the biggest version of the range. In our sample, this is V1.2.16.

What would happen, if the version range is [1.2.11,1.2.16]? In this case, Maven would select Log4J V1.2.11, because now the recommended version in our POM fits in the concrete range [1.2.11,1.2.16] defined in commons-logging. And what would happen, if we define a concrete version [1.2.11] in our POM and commons-logging would define [1.2.12,1.2.16]? In this case, the build will fail, because there is no overlapping version number range.

After reading all this stuff, you might come to the conclusion that all dependency versions should be defined using concrete version numbers. IMO, this is one of two valid options. The other one is to make use of the POM section dependencyManagement. Instead of defining the versions directly at the dependencies, you can define them below the element dependencyManagement in your POMs. Maven guarantees that all versions you define in this section are actually used for your build. In the example above, you get Log4J V1.2.11. The concrete version number range defined in commons-logging is ignored. The backdraft of this approach is obvious: You get your desired version, but you don't get a notification that an unresolved conflict exists in the dependency tree.

Wednesday, February 23, 2011

Understanding Maven Dependency Mediation (Part 1)





One of the most powerful features of Maven is certainly the management of dependencies. As far as I know, the lack of dependency management in other well-established build tools was actually one of the reasons to start development of Maven. The whole dependency management stuff relies on the fact that people declare all required dependencies for their project explicitly in the POM file. Maven reads this information and decides which jars must be loaded from the repositories and so on. Most of you know all this stuff well, so I won't repeat the basics here again.
However, sometimes the dependency management thing in Maven seems to have a "bad day". Unknown jars pop up in your war file or Maven uses a completely wrong version for your build process. If things like this happen, it is likely that a mechanism called dependency mediation decided to screw up your well-defined list of dependencies. Let's take a look at that beast.

As you know, Maven resolves all required jars for your build transitively from the entries in your POM. For example, if you define a dependency to commons-logging and commons-logging itself defines a dependency to log4j, the commons-logging.jar and log4j.jar will be added to your build process. More formally spoken, transitiveness means that if A->B and B->C then A->C. Transitive dependencies are very cool and they are the prerequisite that really all classes required by your project end up in the classpath, war or whatever.

However, transitive dependencies don't come for free. One of the problems Maven is faced with are conflicts in the dependency tree. The image shows such a conflict. The sample project defines two direct dependencies: One to commons-logging-1.1 and one to log4j-1.2.13. Now, because Maven transitively loads all dependencies that are defined for commons-logging-1.1, a second version of log4j (V1.2.12) pops up in the dependency tree. If this happens, the mechanism called dependency mediation kicks in. Its job is to decide which of the two log4j versions must be used for the build process. In the shown example, log4j-1.2.13 will be selected. But why?

A first, simple explanation would be that V1.2.13 is newer (larger) than 1.2.12 and that's why Maven uses it. To verify if this is true, we can perform a simple test: We change the version of our dependency to log4j from 1.2.13 to 1.2.11. Now, if our explanation is correct, Maven should select log4j-1.2.12 for the build. Try it in one of your projects, Maven still uses V1.2.11 of Log4J. Obviously the actual value of the version had no effect on the dependency mediation.

In fact, Maven knows several strategies to resolve conflicts in the dependency tree. In the above example, the most commonly used and simplest strategy has been chosen. It is triggered, if the dependency in your POM looks like this:

<dependency>
    <groupId>log4j</groupId>
    <artifactId>log4j</artifactId>
    <version>1.2.11</version>
</dependency>

The important thing is the version number. It is stated "as is" without any extra additions (I'll come back to this in a follow-up post). If you define version numbers in this syntax, Maven treats it as a recommended version. So what you're actually doing is to tell Maven "I would prefer V1.2.11 of Log4J, but hey, I can live with any other version, too".

If all versions that are in conflict in your dependency tree have been defined like this, Maven simply chooses the one with the smallest distance from the root of the tree. In our example, log4j-1.2.11 has a distance of 1 to the root and log4j-1.2.12 has a distance of 2. Consequently, Maven chose V1.2.11. The actual value of the version doesn't matter at all. This algorithm is as simple as can be - and it works in almost all cases. This is why Maven in most project situations behaves exactly like you expect it. You get the version you declare in your POM, because your dependencies have always the smallest distance to the root of the dependencies tree.

But beware. Let's say someone in the Apache commons-logging project decides for whatever reason to change the POM of commons-logging like this:

<dependency>
    <groupId>log4j</groupId>
    <artifactId>log4j</artifactId>
    <version>[1.2.12]</version>
</dependency>

Now, commons-logging doesn't define a recommended version 1.2.12 anymore, but a specific one. This is done by simply adding the square brackets around the version number. If this happens you suddenly will end up with log4j-1.2.12 in your build - without actually changing your own dependency to log4j-1.2.11. Confused? I'll explain this behavior soon in a follow-up post.

Thursday, January 27, 2011

How to Mavenize big, monolithic legacy Java projects





Last year a customer asked me to modularize a rather critical legacy Java system. It has grown over the years and was based on a rather monolithic Ant based build process. The architecture of the system clearly defined layers and components, but this architecture was not supported by the project layout. Now, due to the monolithic layout the well-defined architecture became more and more corrupted, because nothing in the build process really enforced the rules and restrictions of the architecture. My customer did not feel comfortable with this situation and decided to invest some time and money in a major refactoring of the system. The goal of the refactoring was to break up the monolithic project structure. Instead, multiple smaller modules with clearly defined dependencies on each other should be constructed.
After some investigation we decided very early in the refactoring process to move the whole thing to Maven. This was not an easy decision. If you decide to use Maven, you have to adhere to the conventions and rules defined by the Maven build process. Projects that have their own, Ant-based build process typically do not follow any other rules than their own. So converting such a project to Maven can be a daunting task. After some prototyping I was sure that I would be able to accomplish it anyhow, and here is how I did it:

A very good starting point for the refactoring was the existing software architecture, which defined layers and components. The goal was to more or less define one separate module per component. Additionally, some infrastructure modules were necessary that capsuled for example the configuration files. However, even with this basic strategy in mind, the practical transformation of the one big project to about 50 modules is a complicated thing. The main problem are the dependencies between the various modules. In a Maven POM, you explicitly define all dependent modules. If this dependency set is incomplete, the resulting project won't build. If you only have 2 or 3 modules, you can figure out the dependencies by try and error. This won't work in a reasonable time frame for 50+ modules.

Another pitfall are cyclic dependencies. Maven won't build projects with cyclic dependencies between modules. A cyclic dependency occurs for example when Module A is dependent on B which again is dependent on A. The problem is that in most cases the cycle is not so easy as in this example. Typically, you have something like A->B->C->D->A. Since Maven resolves dependencies transitively, it will detect this cycle. And immediately cancel the build (which is perfectly all right IMO, because cyclic dependencies are a bad thing and should never occur in a well-structured project).

At short, I was faced with two problems. First, I needed a possibility to cut the big project in the "right" modules and to find out the dependencies between the modules. And second, I had to identify and break up any cyclic dependencies between the new modules. To work efficiently, all this must be done in a kind of simulation, before actually performing the refactoring and breaking up the project.

Luckily there are some tools out there that are able to support these kind of tasks. I chose SonarJ for this specific project. SonarJ allows you to define a target architecture for a given software system. In my case, I defined the planned 50+ modules as target architecture. Next I defined the target dependencies between the modules using a graphical modelling tool within SonarJ. The cool thing is that SonarJ is now able to identify all modules, packages and even classes that break these rules. Often the violation are caused by wrongly placed packages (someone placed the code in component A, but this is simply nonsense, it belongs to component B). Now, I was able to fix any errors by virtually moving the package to the right component. This movement occurs only in the architecture model of SonarJ, no source code is changed (yet).

Additionally, SonarJ has a very powerful tool set to identify and break up cyclic dependencies. In most cases, the break-up is also done by virtually moving classes or packages around in the model. Of course, some cycles are caused by real "bad" code that needs some thorough refactoring. In these cases, one can mark the culprit classes on a todo list with some additional notes for later, manual rework.

As soon as my SonarJ model was fully defined, all cycles resolved and all dependencies defined, I "only" needed to actually perform the refactoring:

  1. Create new Maven modules for all modules defined in SonarJ. This task included the creation of the necessary parent modules that group modules of the same type. As a result of this task, a project structure with all Maven POMs was created, but still without any content (source code, resources, etc.). Of course, the newly created POMs included all dependencies defined in SonarJ.
  2. Copy all source-code files, resources, config-files, test cases etc. from the old monolithic Ant project to the correct new Maven module.

For a project of this size one cannot perform these tasks manually. Since I am a big fan of Groovy, I decided to write some scripts that interpret the SonarJ model file (which is luckily a well-defined XML-file). The first set of scripts created the Maven module structure. To keep thing as simple as possible, I first created a set of simple Maven Archetypes that contained empty templates for all needed module types. The script itself simply iterated through the SonarJ file and invoked the "right" archetype with some parameters. The actual creation of the directories, POM files, etc. was performed by the Maven Archetype Plugin.  A second set of scripts performed the copy tasks using AntBuilder. AntBuilder is incredibly powerful in copying things and hence the script fits on a single page.

Basically I wrote a generator that interpreted the SonarJ model to perform the refactoring. This approach allowed me to easily try out variants or fix problems in the model that showed up after the generation. After I finished everything I realized that this was the first time I used a model-driven approach successfully! I'm still not a fan of MDSD, but hey, in this case it really worked OK. Of course, it was not real software development but "just" a refactoring with some Groovy scripts, but I have the feeling that any other approach would have been much more painful.