Shadows Of The Past: Analysis Of Git Repositories
Shadows Of The Past: Analysis Of Git Repositories
17. Mai 2017 2 Kommentare zu Shadows Of The Past: Analysis Of Git RepositoriesDid you read the book Your Code As A Crime Scene by Adam Tornhill?
You definitivly should as the author goes beyond the usual ways of inspecting the code base of a project! He takes a close look at their histories which are hidden in version control systems like Git and gathers very useful information about hotspots, coupling on file or architectural level and derives valuable information like hidden communication structures in organizations.
Some time ago Jens Nerche (Kontext E: company site, tech blog) published the Git plugin for jQAssistant and it was immediately clear that we should perform similar kinds of analysis with it. So the idea for the talk „Shadows Of The Past – Analysis Of Git Repositories“ was born and first presented at JavaLand in March 2017. The following post explains the analysis steps which have been demonstrated there, you can find the introduction slides here.
Scanning A Project’s History
A nice fact about JavaLand is that it is mainly driven by the Java community, namely the Java User Groups from all over Germany represented by their parental organization iJUG. And so it is no surprise that the official conference app „DukeCon“ is also developed by the community, you can find it under the organization with the same name on GitHub.
For the talk a fork of the „dukecon_server“ project has been used, you can find it on https://github.com/DirkMahler/dukecon_server. If you want to replay the steps presented here you should clone it from here to your own machine and build it using the following command:
mvn clean install -DskipTests
(we’re not interested in test results here…)
While the build is running take a look at the file pom.xml in the project root. It contains the setup for jQAssistant including the Git plugin:
<plugin> <groupId>com.buschmais.jqassistant</groupId> <artifactId>jqassistant-maven-plugin</artifactId> <version>${jqassistant-maven-plugin.version}</version> <extensions>true</extensions> <!-- required as Spring Boot packaging is activated --> <executions> <execution> <goals> <goal>scan</goal> <goal>analyze</goal> </goals> <configuration> ... <scanIncludes> <scanInclude> <path>${project.basedir}/.git</path> </scanInclude> </scanIncludes> ... </configuration> </execution> </executions> <dependencies> <dependency> <groupId>de.kontext-e.jqassistant.plugin</groupId> <artifactId>jqassistant.plugin.git</artifactId> <version>1.2.0</version> </dependency> </dependencies> </plugin>
The Git plugin is declared as dependency of the jQAssistant Maven plugin and a ’scanIncludes‘ section has been added to include the local repository in the scan.
The Git Graph
For running the queries you’ll need some understanding of the data model generated by the Git scanner. Take a look at the following slide from the presentation:
- A commit belongs to an author and contains one or more changes where each one modifies a file.
- All commits except the initial one have at least one relation to a parent commit.
- A branch is represented by a node which references the last commit, i.e. the head of the branch.
- A parent commit with more than one child indicates that a new branch has been created.
- Commits with more than one parent relation are merge commits.
- A tag references a commit.
Using Cypher it is now possible and surprisingly easy to explore the graph! Just start the embedded Neo4j server using
mvn jqassistant:server
and open your browser with the URL http://localhost:7474. You can copy & paste the queries from this post into the top area of the Neo4j UI and execute them by hitting Ctrl-Enter:
All About Authors
Let’s start with a simple query that returns the messages of the last 10 commits and their authors:
MATCH (a:Author)-[:COMMITTED]->(c:Commit) RETURN a.name, c.message ORDER BY c.date desc LIMIT 20
The result is as simple as the query is, just some history… So let’s continue with some statistics, i.e. the commits per author:
MATCH (a:Author)-[:COMMITTED]->(c:Commit) RETURN a.name, count(c) as commits ORDER BY commits DESC
The result shows a very common problem of Git repositories: some authors appear more than once (e.g. Niko and Gerd). This is usually caused by different user and e-mail settings used by the developers on different machines. So let’s clean up the database by deleting duplicates and assign their commits to one author node. This is done in two similar steps:
MATCH (a:Author) WITH a.name as name, collect(a) as authors WITH head(authors) as author, tail(authors) as duplicates UNWIND duplicates as duplicate MATCH (duplicate)-[:COMMITTED]->(c:Commit) MERGE (author)-[:COMMITTED]->(c) DETACH DELETE duplicate RETURN author.name, count(duplicate)
The query collects all ‚Author‘ labeled nodes grouped by their ’name‘ attribute. The head node of each author collection is selected and considered as the original one, all others (i.e. ‚tail(authors)‘) are treated as duplicates. For each duplicate the ‚COMMITTED‘ relations to ‚Commit‘ labeled nodes are propagated to the author and the duplicate is deleted.
The following query uses the same approach, the only difference is that the ‚email‘ attribute of authors is used for grouping instead of the ’name‘ attribute before:
MATCH (a:Author) WITH a.email as email, collect(a) as authors WITH head(authors) as author, tail(authors) as duplicates UNWIND duplicates as duplicate MATCH (duplicate)-[:COMMITTED]->(c:Commit) MERGE (author)-[:COMMITTED]->(c) DETACH DELETE duplicate RETURN author.name, count(duplicate)
Let’s execute the query for counting the commits per author again:
The duplicates are gone now. The next step is to get statistics about the files changed per author:
MATCH (a:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->(:Change)-[:MODIFIES]->(file:File) RETURN a.name, count(file) as files ORDER BY files DESC
Again there’s a problem with the result but this time it’s hidden. The numbers contain changes which are the result of merge commits and this adds unwanted noise to the statistics. For that reason merge commits (i.e. those with more than one parent relation) will be labeled with ‚Merge‘:
MATCH (c:Commit)-[:HAS_PARENT]->(p:Commit) WITH c, count(p) as parents WHERE parents > 1 SET c:Merge RETURN count(c)
The query for determining the number of commits and changed files per author can be re-written as follows with an additional WHERE clause:
MATCH (a:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->(:Change)-[:MODIFIES]->(file:File) WHERE NOT c:Merge RETURN a.name, count(distinct c) as commits, count(file) as files ORDER BY files DESC
Looking at the two first rows it can be observed that Falk did less commits than Gerd but at the same time changed much more files. This can be verified with the follwoing query returning the changed files per commit and author:
MATCH (a:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->(:Change)-[:MODIFIES]->(file:File) WHERE NOT c:Merge WITH a, count(distinct c) as commits, count(file) as files RETURN a.name, (toFloat(sum(files))/toFloat(commits)) as filesPerCommit, commits ORDER BY filesPerCommit DESC
As an average Falk changed 5.4 files per commit while the value for Gerd is much lower (1.7). Is it possible to identify the reason for that?
Let’s have a look at the files they’re typically changing. Therefore every file node shall be enriched with a property that indicates the file type. The following query takes the name of each file and performs a split on the character ‚.‘. The last part of the resulting collection (splittedFileName) is used as file type and stored as a property:
MATCH (f:Git:File) WITH f, split(f.relativePath, ".") as splittedFileName SET f.type = splittedFileName[size(splittedFileName)-1] RETURN f.type, count(f) as files ORDER BY files DESC
The result reveals that the application is mainly written in Groovy. Now let’s find file types and their top authors:
MATCH (a:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->(:Change)-[:MODIFIES]->(file:File) WHERE NOT c:Merge RETURN file.type, a.name, count(file) as files ORDER BY files DESC, file.type
Falk is an expert for Groovy and Java!
But what about Gerd? He seems to care a lot about XML files but what is this about? Let’s dig deeper and see which files he cared about most:
MATCH (a:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->(:Change)-[:MODIFIES]->(file:File) WHERE a.name = "Gerd Aschemann" and not c:Merge RETURN DISTINCT file.relativePath, count(c) as commits ORDER BY commits desc
The result mainly contains pom.xml and application property files, so Gerd is obviously doing the build- and integration management!
Short recap: After getting familiar with the structures created by the jQAssistant Git plugin and cleaning up the data it was easy to identify technology experts. I can assure that these results are valid as I know both guys personally – trust me, they’re worth their money!
Exclusive And Distributed Ownership
Having experts in a team in general is a good thing. But this can turn quite quickly into a problem if knowledge about parts of an application is held exclusively by one person which might leave the team at some day for some reason (not going to speculate about this here…). So are there any files which are exclusively owned by one author?
MATCH (author:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f) WHERE NOT c:Merge WITH collect(author) as authors, f WHERE size(authors)=1 UNWIND authors as author RETURN author.name, collect(f.relativePath)
The query returns a quite large result containing many files. The screenshot shows that Gerd added some configuration files that others might not be familiar with.
But also the opposite case might be of interest: What about files which have been changed by many authors?
MATCH (author:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f) WHERE NOT c:Merge WITH f, collect(distinct author.name) as authors RETURN length(authors) as count, f.relativePath ORDER BY count desc
The total numbers are quite low compared to other projects. But there’s already something typical here: build descriptors (in this case pom.xml files) are on the top ranks. What does it mean?
It would require deeper inspection of what actually has been changed within the files by the different authors but one very likely reason is that this file type covers different aspects which need to be maintained by different roles: dependencies, version, assembly, configuration, project information, etc. So the question might be asked if build descriptors violate the Single Responsibility Principle (SRP).
Back In Time
Let’s continue with a look on distribution of commits over time. Therefore time trees are built up from the values of the ‚date‘ property of commit nodes:
MATCH (c:Commit) WITH c, split(c.date, "-") as parts MERGE (y:Year{year:parts[0]}) MERGE (m:Month{month:parts[1]})-[:OF_YEAR]->(y) MERGE (d:Day{day:parts[2]})-[:OF_MONTH]->(m) MERGE (c)-[:OF_DAY]->(d) RETURN y, m, d
In this case the query returns a graph which very nicely shows the tree for a commit (manually expanded) from Alexander on 2016-01-05. Based on this structure it is possible to determine the number of commits and changed files per month:
MATCH (a:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f:File), (c)-[:OF_DAY]->(d)-[:OF_MONTH]->(m)-[:OF_YEAR]->(y) WHERE NOT c:Merge RETURN y.year as year, m.month as month, count(distinct c) as commits, count(f) as files ORDER BY year, month
JavaLand is a conference that takes place once a year usually at the end of March. So it is not a surprise that the commit activity of the team increases significantly about three months before. But it can also be observed that autumn is a good time for pushing the application forward! This pattern repeated for at least the last two years.
There’s another interesting information regardings commit times – their distribution over hours per day. This time the aggregation is performed without building up a tree of nodes:
MATCH (c:Commit) WITH // 23:16:10 +0100 c, split(c.time, " ") as timeAndTimeZone WITH // 23:16:10 c, split(timeAndTimeZone[0], ":") as time RETURN time[0] as hour, count(c) as commits ORDER BY commits desc
Most commits are done after 21 o’clock (within the timezone of the author). This time is not necessarily within the usual working hours. This leads to two possible conclusions: the team members either have very bad employers and should look for a new job or DukeCon is just their hobby. If the latter is assumed then according to these late hours it is very likely that the members producing the majority of the commits are first spending time with their families and we should show gratitude to them that they are still bringing all these nice things together!
Of course it is possible to ask the same question for a specific author:
MATCH (a:Author)-[:COMMITTED]->(c:Commit) WHERE a.name="Falk Sippach" WITH a, c, split(c.time, " ") as timeAndTimeZone WITH a, c, split(timeAndTimeZone[0],":") as time RETURN time[0] as hour, count(c) as commits ORDER BY hour
The result supports the assumption that the project is developed by volunteers in their spare time. But wait: Isn’t this query and the one before hitting the privacy of the developers? Of course: the project is open source and the information is available on GitHub for everyone. But until now it was hidden in the deep vaults of Git. Tools like jQAssistant make this information easily accessible and there’s definitly potential for usage in unintented and bad ways – so please don’t be evil!
Metrics Of Change
Let’s move the focus away from people towards files. If a file is changing quite often this may indicate some instability. So let’s try to identify these potential hotspots:
MATCH (c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f:File) WHERE NOT c:Merge RETURN f.relativePath, count(c) as commits ORDER BY commits desc
The top positions are held by pom.xml files and this is very common result. The reason for this has already been discussed above (potential violation of the SRP).
The result also shows Groovy files. It would be interesting to see if there are code metrics which indicate the reason behind their change frequency. For gathering this information it is necessary to correlate the source files (represented by Git nodes) with the Groovy or Java structures. The latter are compiled to bytecode which is scanned by jQAssistant and provide the name of the source file.
The following query is used to create a „HAS_SOURCE“ relation between Groovy/Java classes and their corresponding Git files:
MATCH (p:Package)-[:CONTAINS]->(t:Type) WITH t, p.fileName + "/" + t.sourceFileName as sourceFileName // e.g. "/org/dukecon/model/Location.java" MATCH (f:Git:File) WHERE f.relativePath ends with sourceFileName MERGE (t)-[:HAS_SOURCE]->(f) RETURN f.relativePath, collect(t.fqn)
Using the HAS_SOURCE relation the most frequent changed classes may be gathered including the sum of their cyclomatic complexity and effective lines of code:
MATCH (c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f:File) WHERE NOT c:Merge WITH f, count(c) as commits MATCH (t:Type)-[:HAS_SOURCE]->(f), (t)-[:DECLARES]->(m:Method) RETURN f.relativePath as path, commits, sum(m.cyclomaticComplexity) as complexity, sum(m.effectiveLineCount) as lines ORDER BY commits desc
The class that has been changed most often is also reported to have a relative high number of lines but more importantly a very high cyclomatic complexity: DoagDataExtractor. It seems that it contains quite complex logic that makes maintenance difficult. Looking at the name of the class it seems to provide importing/conversion functionality for external data (DOAG is the german Oracle user group). Integration with other systems quite often is a pain for developers…
There are now at least two metrics reporting this class as a potential problem. Thus it becomes a candidate for being a hotspot that developers should watch carefully and possibly try to simplify it (i.e. refactoring). There might be other metrics providing more hints: DoagDataExtractor has also the highest number of outgoing dependencies to other classes (fan-out) and therefore also might be highly sensitive to changes in the application.
Temporal Coupling
In the section before direct dependencies to other classes have been mentioned. The version control system can provide information if there are files which are often changed together and thus indicate if there is a „temporal coupling“ between code structures. The next query creates a relation „HAS_TEMPORAL_COUPLING_TO“ between two files which have been part of the same commit and adds an attribute „weight“ to it indicating the number of such commits:
MATCH (c:Commit), (c)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f1), (c)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f2) WHERE f1 <> f2 and id(f1) < id(f2) and not c:Merge WITH f1, f2, count(c) as commits MERGE (f1)-[coupling:HAS_TEMPORAL_COUPLING_TO]->(f2) SET coupling.weight = commits RETURN count(*)
This makes it straight forward to find pairs of files with the highest temporal coupling:
MATCH (f1:File)-[coupling:HAS_TEMPORAL_COUPLING_TO]->(f2:File) RETURN f1.relativePath, f2.relativePath, coupling.weight as weight ORDER BY weight desc
The result shows that the following files evolve together:
- build descriptors (pom.xml)
- classes and their test classes
- data extractor classes and their data (conference.yml)
- model classes
Let’s focus a bit more on classes:
MATCH (t1:Type)-[:HAS_SOURCE]->(f1), (t2:Type)-[:HAS_SOURCE]->(f2), (f1)-[tc:HAS_TEMPORAL_COUPLING_TO]->(f2) RETURN t1.fqn, t2.fqn, tc.weight as weight ORDER BY weight desc
The result shows model classes but it looks very redundant as all inner classes are still included. Let’s label them with „Inner“:
MATCH (t1:Type)-[:DECLARES]->(t2:Type) SET t2:Inner RETURN t1.fqn, collect(t2.fqn)
Now inner classes may be excluded easily:
MATCH (t1:Type)-[:HAS_SOURCE]->(f1), (t2:Type)-[:HAS_SOURCE]->(f2), (f1)-[tc:HAS_TEMPORAL_COUPLING_TO]->(f2) WHERE NOT (t1:Inner or t2:Inner) RETURN t1.fqn, t2.fqn, tc.weight as weight ORDER BY weight desc
The result proves the observations made before but provides a better overview about the dependent types. At this point it becomes interesting to detect temporal coupled types which do not have an explicit type dependency, i.e. which are only implicitly coupled:
MATCH (t1:Type)-[:HAS_SOURCE]->(f1), (t2:Type)-[:HAS_SOURCE]->(f2), (f1)-[tc:HAS_TEMPORAL_COUPLING_TO]->(f2) WHERE NOT (t1:Inner or t2:Inner) and not (t1)-[:DEPENDS_ON]-(t2) // allow any direction RETURN t1.fqn, t2.fqn, tc.weight as weight ORDER BY weight desc
Beside the model there’s a pair of classes that evolves together without having an explicit dependency: „DataProviderLoader“ and „DoagDataExtractor“.
Analysis of temporal coupling between classes may deliver interesting results but for larger projects the information is too fine-grained. The value increases if these couplings are projected to architectural building blocks. The following query uses the generated artifacts (i.e. Maven projecs) for this:
MATCH (a1:Artifact)-[:CONTAINS]->(t1:Type)-[:HAS_SOURCE]->(f1), (a2:Artifact)-[:CONTAINS]->(t2:Type)-[:HAS_SOURCE]->(f2), (f1)-[tc:HAS_TEMPORAL_COUPLING_TO]->(f2) WHERE a1 <> a2 and not (t1:Inner or t2:Inner) RETURN a1.fqn, a2.fqn, sum(tc.weight) as weight ORDER BY weight desc
The tests (test-jar) evolve together with the web application and the latter responds to API changes. This is nothing surprising so what’s the value of it? The DukeCon project consists of only 2 modules and only a few people are working on it. The situation changes if the project consists of hundreds of modules and several teams are working on the same code base. Even if this is not desirable there are many projects out there which are organized like that. In this case it can be investigated which teams need to work closely together to align on changes. The following query therefore provides an overview of the main authors per artifact:
MATCH (artifact:Artifact)-[:CONTAINS]->(:Type)-[:HAS_SOURCE]->(f), (author:Author)-[:COMMITED]->(c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f) WHERE NOT c:Merge RETURN artifact.fqn, author.name, count(c) as commits ORDER BY artifact.fqn, commits desc
Commit Messages
Every commit contains a message and even without deep knowledge in natural language processing it is possible to extract valuable information by analyzing contained words. These can be extracted to nodes which are referenced by each corresponding commit using a „CONTAINS_WORD“ relation:
MATCH (c:Commit) WHERE NOT c:Merge WITH c, reduce(t=toLower(c.message), delim in [ "-","(",")","!",".",":",";","/","," ] | replace(t, delim, "")) as message //1 UNWIND split(message, " ") as word //2 WITH c, trim(word) as word WHERE NOT word in ["", "a","the","of","and","from","to","with","for","in","by","some","on","use","as","is"] MERGE (w:Word{word:word}) MERGE (c)-[:CONTAINS_WORD]->(w) RETURN word, count(c) as count ORDER BY count desc
The query first removes extra characters (1) from the message, splits it into an array using the space character as separator (2) and after trimming removes stop words that do not provide extra information (3).
The word that is used most often is „added“ in 86 commits followed by „fixed“ in 44 commits. Having these words on the top positions is a quite common situation. At a first look the relation between both is good (about 2:1) but in row 6 there’s also the word „fix“ with 24 commits making it a bit worse. It would be nice to remove this redundancy by applying more cleanup steps (e.g. stemming) but it is not necessary at this time. Let’s identify the relation of fixes to all commits per author:
MATCH (a:Author)-[:COMMITTED]->(c:Commit) WHERE NOT c:Merge OPTIONAL MATCH (c)-[fixes:CONTAINS_WORD]->(w:Word) WHERE w.word starts with "fix" WITH a, count(fixes) as fixes, count(c) as commits RETURN a.name, fixes, commits, toFloat(fixes)/toFloat(commits) as relativeFixes ORDER BY relativeFixes DESC
First of all the numbers are only representative for authors with a high number of commits, e.g. Falk or Gerd. It can be observed that about 25% of Falk’s commits are fixes whereas those of Gerd are only 14%. But this comparism is dangerous: they work with different technologies and the reason may be hidden there. Such analysis should be done very carefully before making any conclusions about the skills of people!
A much more reliable information is the number of fixes per file:
MATCH (c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f:File) WHERE NOT c:Merge OPTIONAL MATCH (c)-[fixes:CONTAINS_WORD]->(w:Word) WHERE w.word starts with "fix" WITH f, count(fixes) as fixes, count(c) as commits WHERE commits > 10 RETURN f.relativePath, fixes, commits, toFloat(fixes)/toFloat(commits) as relativeFixes ORDER BY relativeFixes desc LIMIT 10
The query includes only those files with a total number of commits higher than 10 to avoid noise. The result confirms the observations made before:
- The class „DoagDataExtractor“ has already been identified as a potential hotspot (number of changes in correlation with lines of code/cyclomatic complexity). It’s not surprising that the corresponding test class „DoagDataExtractorSpec“ is also affected.
- From the name of „JavaLandDataExtractor“ it can be concluded that it plays a similar role in the system and it seems that importing/converting data is one of the more difficult problems to solve.
- Last but not least the pom.xml files are back again in the result and it seems that the build systems also provides some difficulties.
Wrap Up
This post demonstrated how – with quite low effort – valuable information can be extracted from the Git history of a project, e.g.:
- Experts in domains and technologies
- Activity of authors, e.g. main developers vs. occasional contributors
- Distribution of development activity over time
- Potential hotspots in the code
- Temporal coupling of files, classes or even artifacts
The analysis steps have been performed using jQAssistant, the Git Plugin, Neo4j and the wonderful Cypher query language. This approach opens up the possibility to create reports according to individual needs. So why wait – start exploring!
And don’t forget to read the book, it provides much more interesting ideas!
2 Comments
Thank you for the detailed description! That was what Im looking for but all other descriptions I found were not details at all…
Greets
John
Hi John,
thanks for the feedback! Which are the other descriptions which you are refering to, what were you missing?
Cheers,
Dirk