Yes We Scan – Exploring Libraries
One of the talks I’m currently giving at conferences (e.g. BED-Con 2015, Java Forum Nord 2015, JDD Krakow 2015) is about exploring 3rd party libraries which are only available as fully packaged artifacts like JAR files. The intro slides can be found here.
The first part of the live demo shows how API changes between two different versions of the same library can be detected, the approach is already described in one of my former blog posts.
Let’s now concentrate on the second part which is about finding potential hotspots and structural problems. Therefore I’m going to use the core libraries of well-known JPA libraries: Hibernate, OpenJPA and EclipseLink. This post is quite long so here’s what we’re going to perform:
All you need is a Java 7 runtime environment and the command line distribution of jQAssistant which is available as ZIP file. In the following examples the variable JQASSISTANT_HOME points to the directory which is created after unpacking.
First of all the libraries need to be scanned by jQAssistant. Therefore they need a place on our hard disk or SSD – let’s copy them all into a directory called „jpa“:
jpa/hibernate-core-3.6.6.Final.jar jpa/hibernate-core-4.3.8.Final.jar jpa/openjpa-2.3.0.jar jpa/org.eclipse.persistence.core-2.6.0.jar
As you can see the Hibernate Core JAR is present in two different versions – we’re interested to see if we can detect some changes between them.
Now let’s trigger the scanner:
$JQASSISTANT_HOME\bin\jqassistant.sh scan -f jpa %JQASSISTANT_HOME%\bin\jqassistant.cmd scan -f jpa
We see some output like this:
Entering C:/Development/projects/YesWeScan/jpa Entering /hibernate-core-3.6.6.Final.jar Leaving /hibernate-core-3.6.6.Final.jar (2114 entries, 29843 ms) Entering /hibernate-core-4.3.8.Final.jar Leaving /hibernate-core-4.3.8.Final.jar (3710 entries, 41287 ms Entering /openjpa-2.3.0.jar Leaving /openjpa-2.3.0.jar (2221 entries, 29522 ms) Entering /org.eclipse.persistence.core-2.6.0.jar Leaving /org.eclipse.persistence.core-2.6.0.jar (2276 entries, 29583 ms) Leaving C:/Development/projects/YesWeScan/jpa (4 entries, 134318 ms)
(Don’t worry about the logged times: it will be much faster on your machine. As time of writing this blog post I’m sitting on a bus with my notebook on energy saving mode…)
Now it’s time to start the integrated Neo4j server and execute some queries:
$JQASSISTANT_HOME\bin\jqassistant.sh server %JQASSISTANT_HOME%\bin\jqassistant.cmd server
This will make the Neo4j browser available for our web browser under the URL http://localhost:7474.
For a little warm-up let’s get some statistics how much Java classes are contained in each of the scanned artifacts. Enter the following query in the top area of the Neo4j browser and hit Ctrl-Enter:
match (a:Artifact)-[:CONTAINS]->(t:Type) return a.fileName, count(t)
The following result appears:
We can see that all JARs contain more or less the same number of types except Hibernate 4 which as grown significantly over its predecessor. Let’s assume that they provide more or less the same functionality.
Getting back to the query we see that the label „Type“ has been used. Didn’t we look for classes? Actually Java defines types which can be classes, interfaces, enumerations or annotations. Therefore jQAssistant always adds two labels on a node representing a scanned „.class“ file: „Type“ and a label which classifies the type further, i.e. „Class“, „Interface“, „Enum“ or „Annotation“. If we only wanted to know the number of class types per artifact the query would look like this:
match (a:Artifact)-[:CONTAINS]->(t:Type:Class) return a.fileName, count(t)
We can also find out which types are required by these libraries, i.e. which are referenced by contained types but not available in the JARs:
match (a:Artifact)-[:REQUIRES]->(t:Type) return a.fileName, t.fqn
Note that for all types which are required by an artifact there’s no information available if it is a class, interface, enumeration or annotation. Therefore these nodes only carry the label „Type“ with the property „fqn“ identifying the fully qualified name:
As a developer we might be interested in knowing if there are any deprecations on type or method level right before we start using a newer version of a library. Let’s start with deprecated types per artifact. On code level these are identified by „the presence of an annotation of type @java.lang.Deprecated“:
match (a:Artifact)-[:CONTAINS]->(t:Type), (t)-[:ANNOTATED_BY]->(:Annotation)-[:OF_TYPE]->(d:Type) where d.fqn="java.lang.Deprecated" return a.fileName, count(t)
There are two interesting facts to be noticed:
1. The number of deprecated types has doubled from Hibernate 3 to 4.
2. OpenJPA has no deprecated types – either the APIs didn’t change or the project does not work with deprecations.
Beside the statistics we are interested in the actual types that have been deprecated – therefore we only need to change the return clause of the last query:
... return a.fileName, t.fqn
Or for a more compact result:
... return a.fileName, collect(t.fqn)
Now let’s drill down to method level by gaining some statistics first:
match (a:Artifact)-[:CONTAINS]->(t:Type)-[:DECLARES]->(m), (m)-[:ANNOTATED_BY]->(:Annotation)-[:OF_TYPE]->(d:Type) where d.fqn="java.lang.Deprecated" return a.fileName, count(m)
The result of the query reveals that OpenJPA actually works with deprecations but only on method level.
Again the return clause may be changed to return the actual methods and their declaring types:
... return a.fileName, t.fqn, m.signature
In this section we’re going to find out which kind of exceptions (or better Throwables) may be thrown by the libraries under inspection. Therefore we need to solve two little problems:
First of all we need to identify which types are actually exceptions. Usually we would try to find all types which directly or indirectly inherit from „java.lang.Throwable“:
match (e:Type)-[:EXTENDS*]->(t:Type) where t.fqn = "java.lang.Throwable" return e.fqn
In our case this won’t work as the full inheritance hierarchy of throwables is not available in our database. As an example take a method declared by a type within our libraries which throws a java.lang.IllegalArgumentException. There’s a node representing IllegalArgumentException in the database but there’s no information available about which is the super type of it. For that we would have needed to scan the file rt.jar of the Java Runtime Environment as well. But the JRE is not the only library which is missing, there could be also exception types coming from other libraries. So the appropriate solution would be scanning all dependencies – but this would make the queries for our analysis more difficult because we would need to add more filters.
Let’s take another approach which is a bit unsafe but sufficient for our case: exceptions types usually have „Exception“ as suffix in their names, e.g. „IllegalArgumentException“. Let’s execute the following query:
match (e:Type) where e.fqn=~ ".*Exception" set e:Exception return e.fqn order by e.fqn
It might be confusing that the result contains the same exception type more than once (e.g. antlr.ANTLRException). The explanation for that is that a node is created for each artifact which requires a specific exception type – obviously two of our inspected libraries depend on ANTLR so we see two nodes.
If we inspect the query a bit further we notice that it contains a clause „set e:Exception“ – we’re adding a label „Exception“ to each type which has been identified as an exception type according to our naming heuristic. The idea behind this is to make further queries easier to write and read – from now on we can just use the label instead of filtering by type names:
match (e:Exception:Type) return e.fqn
Note: if we’d use the rule mechanism provided by jQAssistant this query would become a so called concept.
The first problem is solved, let’s go and see what’s the second and how to get around it: There’s no information in the graph available at which point an exception of a specific type is thrown. This is a limitation of the current byte code scanner (shame on the author…), may be it’s an interesting feature to be implemented in the future.
We can apply another heuristic to find methods which throw exceptions: usually an instance must be created before it can be thrown. So we’ll just look for constructor invocations of exception types as the following query does:
match (e:Exception)-[:DECLARES]->(c:Constructor), (a:Artifact)-[:CONTAINS]->(t:Type)-[:DECLARES]->(m:Method)-[i:INVOKES]->(c) where not (m:Constructor) return a.fileName as file, e.fqn as exception, t.fqn as type, m.signature as method, i.lineNumber order by exception, file, type, method
The screenshot only shows the first results of the query, by scrolling down in the browser we see that some unexpected exception types are used by the implementations, e.g. java.lang.Exception which usually is not considered to be a good practice.
Things are getting more interesting if we filter for exceptions that are provided by the JRE (hence the additional where clause containing a regular expression):
match (e:Exception)-[:DECLARES]->(c:Constructor), (a:Artifact)-[:CONTAINS]->(t:Type)-[:DECLARES]->(m:Method)-[i:INVOKES]->(c) where not (m:Constructor) and e.fqn =~ "java\\.lang\\..*" return a.fileName as file, e.fqn as exception, t.fqn as type, m.signature as method, i.lineNumber order by exception, file, type, method
The result shows that the libraries are also creating instances of „java.lang.NullPointerException“ – did we expect that?
In this section we will start to gather some metrics that may help to identify hotspots in the JPA libraries. Let’s start with a quite simple one:
match (a:Artifact)-[:CONTAINS]->(t:Type)-[:DECLARES]->(m:Method) return a.fileName, t.fqn, count(m) as Methods order by Methods desc limit 10
The query returns types ordered descending by the number of methods they declare. The values are quite impressive!
But wait – interpretation of metrics always requires some knowledge about the context: at the first look the type org.apache.openjpa.kernel.jpql.JPQL with 942 methods seems to be a hotspot. Sadly it’s most likely that the class has been generated from a grammar representing the query language of JPA – so we’re not really interested in it (or we start a discussion about allowed complexity in generated code).
The same probably holds for the second candidate, i.e. org.hibernate.internal.CoreMessageLogger_$logger. If we take a deeper look at it (e.g. by another query or simply decompiling the class file) we see that it contains log messages and methods which all look very similar.
The next entry in the result is the type org.eclipse.persistence.descriptors.ClassDescriptor with 455 methods – decompilation reveals code that looks hand-crafted – therefore it should be treated as a problem.
Let’s have a look at another metric – the depth of inheritance hierarchies:
match h=(class:Class)-[:EXTENDS*]->(super:Type) return class.fqn, length(h) as Depth order by Depth desc
We see a maximum of 7 levels mostly originating from Hibernate classes that seem to represent AST (abstract syntax tree) structures – usually developers say that they start loosing orientation at about 4 levels.
Let’s switch to potential hotspots regarding incoming and outgoing dependencies of types. Both queries are similar except that the direction of a relationship has to be switched:
match (a:Artifact)-[:CONTAINS]->(t:Type), (t)<-[d:DEPENDS_ON]-() return a.fileName, t.fqn, count(d) as FanIn order by FanIn desc limit 10
The type with the highest fan-in is org.hibernate.HibernateException, i.e. lots of other types depend on it. So this class could be a potential hotspot as changes to it could affect lots of other classes. As it is an exception which hopefully is quite stable this is not necessarily a real problem. Looking to the next candidates we see session related types of Hibernate and EclipseLink – this is a common problem of O/R-mappers.
match (a:Artifact)-[:CONTAINS]->(t:Type), (t)-[d:DEPENDS_ON]->() return a.fileName, t.fqn, count(d) as FanOut order by FanOut desc limit 10
This result shows the types with the highest fan-out, i.e. they depend of lots of other types and therefore can be treated as sensitive to changes (i.e. potentially unstable). Interestingly in the case of Hibernate 4 we can again observe session related implementations as in the result before. Even if the types are not the same (actually SessionImpl implements SessionImplementor) we get the impression that session handling is a fundamental part of unstable structures in Hibernate.
A common metric is cyclomatic complexity: „It is a quantitative measure of the number of linearly independent paths through a program’s source code.“ as Wikipedia explains. As a rule of thumb: the higher the value the harder it is to read and test the code as more variations must be considered.
The Java scanner of jQAssistant gathers an estimation of cyclomatic complexity on method level so we can use it to find potential hotspot methods:
match (a:Artifact)-[:CONTAINS]->(t:Type), (t)-[:DECLARES]->(m) where has(m.cyclomaticComplexity) return m.cyclomaticComplexity as CC,a.fileName,t.name +"#" + m.signature as Method order by CC desc limit 10
The values are very high – CheckStyle per default uses a limit of 10. The good news is that the first candidates are again generated classes (JPQL related stuff) but we can also see methods declared by AnnotationBinder and Configuration from the Hibernate implementation there.
It is possible to aggregate cyclomatic complexity per type:
match (a:Artifact)-[:CONTAINS]->(t:Type), (t)-[:DECLARES]->(m) where has(m.cyclomaticComplexity) return sum(m.cyclomaticComplexity) as CC, a.fileName, t.name as Type order by CC desc limit 10
The result proves that working with JPQL obviously is not trivial but luckily the top candidates are most likely generated from a grammar. The other types in the list are hotspots which are most likely hard to test. And there’s another interesting detail: AbstractEntityPersister has already been detected during fan-out analysis – the Hibernate developers should have a look at it as the situation became even worse with the newer release.
As the last analysis parts let’s investigate if we can find something of interest on package level. Again let’s start with a simple metric: the number of types which are contained per package:
match (a:Artifact)-[:CONTAINS]->(p:Package)-[:CONTAINS]->(t:Type) return a.fileName, p.fqn, count(t) as types order by types desc limit 20
The result gives us the feeling that OpenJPA uses much bigger packages then the other libraries. We can verify this assumption by performing aggregations, i.e. determining the maximum and average number of types per package in each of the artifacts:
match (a:Artifact)-[:CONTAINS]->(p:Package)-[:CONTAINS]->(t:Type) with a, p, count(t) as types return a.fileName, max(types), avg(types)
Similar to type level metrics it is also possible to determine fan-in and fan-out on package level. The information is not directly available from the scanned data but be can infered from type level:
match (p1:Package)-[:CONTAINS]->(t1:Type)-[:DEPENDS_ON]->(t2:Type)<-[:CONTAINS]-(p2:Package) where p1 <> p2 create unique (p1)-[:DEPENDS_ON]->(p2) return p1.fqn as package, count(distinct p2) as PackageDependencies order by PackageDependencies desc
The result contains many packages from Hibernate but there’s none coming from OpenJPA. The latter is no real surprise as OpenJPA puts many classes into one package (see the metric above) – so potential problems are hidden within the packages.
Looking at the last query we notice that it not only returns the number of outgoing dependencies per package but also creates (i.e. materializes) DEPENDS_ON relations on this level. This information is useful to find out if there are cycles between packages:
match (a:Artifact), (a)-[:CONTAINS]->(p1:Package), (a)-[:CONTAINS]->(p2:Package), (p1)-[:DEPENDS_ON]->(p2), path=shortestPath((p2)-[:DEPENDS_ON*]->(p1)) return a.fileName, p1.fqn as package, extract(p in nodes(path) | p.fqn) as Cycle order by package
The query searches for all packages p1 which depend on a package p2 where a path exists traversing DEPENDS_ON relations back to p1. The result returns contains p1 and all nodes which are extracted from the path.
We can use these cycles as a metric: how many packages in a library are involved in cycles? Higher values may be used as an indicator for structural problems which make it hard to determine the impact of changes:
match (a:Artifact), (a)-[:CONTAINS]->(p1:Package), (a)-[:CONTAINS]->(p2:Package), (p1)-[:DEPENDS_ON]->(p2), path=shortestPath((p2)-[:DEPENDS_ON*]->(p1)) return a.fileName, count(p1)
OpenJPA at a first glance is the winner for that metric but we already know that it comes to the price of big packages. The much more interesting result is the evolution of Hibernate.
jQAssistant allows us to get insight into Java artifacts by simply scanning them and executing queries over the obtained structural information on several levels. This comes to the price of writing queries on our own but with all the flexibility to apply filters according to our own needs and the possibility to enrich data with our own concepts (i.e. Exception) to make analysis easier.