Shadows Of The Past: Analysis Of Git Repositories

Shadows Of The Past: Analysis Of Git Repositories

2 Kommentare zu Shadows Of The Past: Analysis Of Git Repositories

Did you read the book Your Code As A Crime Scene by Adam Tornhill?

You definitivly should as the author goes beyond the usual ways of inspecting the code base of a project! He takes a close look at their histories which are hidden in version control systems like Git and gathers very useful information about hotspots, coupling on file or architectural level and derives valuable information like hidden communication structures in organizations.

Some time ago Jens Nerche (Kontext E: company site, tech blog) published the Git plugin for jQAssistant and it was immediately clear that we should perform similar kinds of analysis with it. So the idea for the talk „Shadows Of The Past – Analysis Of Git Repositories“ was born and first presented at JavaLand in March 2017. The following post explains the analysis steps which have been demonstrated there, you can find the introduction slides here.

Scanning A Project’s History

A nice fact about JavaLand is that it is mainly driven by the Java community, namely the Java User Groups from all over Germany represented by their parental organization iJUG. And so it is no surprise that the official conference app „DukeCon“ is also developed by the community, you can find it under the organization with the same name on GitHub.

For the talk a fork of the „dukecon_server“ project has been used, you can find it on https://github.com/DirkMahler/dukecon_server. If you want to replay the steps presented here you should clone it from here to your own machine and build it using the following command:

mvn clean install -DskipTests

(we’re not interested in test results here…)

While the build is running take a look at the file pom.xml in the project root. It contains the setup for jQAssistant including the Git plugin:

<plugin>
  <groupId>com.buschmais.jqassistant</groupId>
  <artifactId>jqassistant-maven-plugin</artifactId>
  <version>${jqassistant-maven-plugin.version}</version>
  <extensions>true</extensions> <!-- required as Spring Boot packaging is activated -->
  <executions>
    <execution>
      <goals>
        <goal>scan</goal>
        <goal>analyze</goal>
      </goals>
      <configuration>
        ...
        <scanIncludes>
          <scanInclude>
            <path>${project.basedir}/.git</path>
          </scanInclude>
        </scanIncludes>
        ... 
      </configuration>
    </execution>
  </executions>
  <dependencies>
    <dependency>
      <groupId>de.kontext-e.jqassistant.plugin</groupId>
      <artifactId>jqassistant.plugin.git</artifactId>
      <version>1.2.0</version>
    </dependency>
  </dependencies>
</plugin>

The Git plugin is declared as dependency of the jQAssistant Maven plugin and a ’scanIncludes‘ section has been added to include the local repository in the scan.

The Git Graph

For running the queries you’ll need some understanding of the data model generated by the Git scanner. Take a look at the following slide from the presentation:

  • A commit belongs to an author and contains one or more changes where each one modifies a file.
  • All commits except the initial one have at least one relation to a parent commit.
  • A branch is represented by a node which references the last commit, i.e. the head of the branch.
  • A parent commit with more than one child indicates that a new branch has been created.
  • Commits with more than one parent relation are merge commits.
  • A tag references a commit.

Using Cypher it is now possible and surprisingly easy to explore the graph! Just start the embedded Neo4j server using

mvn jqassistant:server

and open your browser with the URL http://localhost:7474. You can copy & paste the queries from this post into the top area of the Neo4j UI and execute them by hitting Ctrl-Enter:

All About Authors

Let’s start with a simple query that returns the messages of the last 10 commits and their authors:

MATCH
  (a:Author)-[:COMMITTED]->(c:Commit)
RETURN
  a.name, c.message
ORDER BY
  c.date desc
LIMIT 20

The result is as simple as the query is, just some history… So let’s continue with some statistics, i.e. the commits per author:

MATCH
  (a:Author)-[:COMMITTED]->(c:Commit)
RETURN
  a.name, count(c) as commits
ORDER BY
  commits DESC

The result shows a very common problem of Git repositories: some authors appear more than once (e.g. Niko and Gerd). This is usually caused by different user and e-mail settings used by the developers on different machines. So let’s clean up the database by deleting duplicates and assign their commits to one author node. This is done in two similar steps:

MATCH
  (a:Author)
WITH
  a.name as name, collect(a) as authors
WITH
  head(authors) as author, tail(authors) as duplicates
UNWIND
  duplicates as duplicate
MATCH
  (duplicate)-[:COMMITTED]->(c:Commit)
MERGE
  (author)-[:COMMITTED]->(c)
DETACH DELETE
  duplicate
RETURN
  author.name, count(duplicate)

The query collects all ‚Author‘ labeled nodes grouped by their ’name‘ attribute. The head node of each author collection is selected and considered as the original one, all others (i.e. ‚tail(authors)‘) are treated as duplicates. For each duplicate the ‚COMMITTED‘ relations to ‚Commit‘ labeled nodes are propagated to the author and the duplicate is deleted.

The following query uses the same approach, the only difference is that the ‚email‘ attribute of authors is used for grouping instead of the ’name‘ attribute before:

MATCH
  (a:Author)
WITH
  a.email as email, collect(a) as authors
WITH
  head(authors) as author, tail(authors) as duplicates
UNWIND
  duplicates as duplicate
MATCH
  (duplicate)-[:COMMITTED]->(c:Commit)
MERGE
  (author)-[:COMMITTED]->(c)
DETACH DELETE
  duplicate
RETURN
  author.name, count(duplicate)

Let’s execute the query for counting the commits per author again:

The duplicates are gone now. The next step is to get statistics about the files changed per author:

MATCH
  (a:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->(:Change)-[:MODIFIES]->(file:File)
RETURN
  a.name, count(file) as files
ORDER BY
  files DESC	

Again there’s a problem with the result but this time it’s hidden. The numbers contain changes which are the result of merge commits and this adds unwanted noise to the statistics. For that reason merge commits (i.e. those with more than one parent relation) will be labeled with ‚Merge‘:

MATCH
  (c:Commit)-[:HAS_PARENT]->(p:Commit)
WITH
  c, count(p) as parents
WHERE
  parents > 1
SET
  c:Merge
RETURN
  count(c)

The query for determining the number of commits and changed files per author can be re-written as follows with an additional WHERE clause:

MATCH
  (a:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->(:Change)-[:MODIFIES]->(file:File)
WHERE NOT
  c:Merge
RETURN
  a.name, count(distinct c) as commits, count(file) as files
ORDER BY
  files DESC

Looking at the two first rows it can be observed that Falk did less commits than Gerd but at the same time changed much more files. This can be verified with the follwoing query returning the changed files per commit and author:

MATCH
  (a:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->(:Change)-[:MODIFIES]->(file:File)
WHERE NOT
  c:Merge
WITH
  a, count(distinct c) as commits, count(file) as files
RETURN
  a.name, (toFloat(sum(files))/toFloat(commits)) as filesPerCommit, commits
ORDER BY
  filesPerCommit DESC

As an average Falk changed 5.4 files per commit while the value for Gerd is much lower (1.7). Is it possible to identify the reason for that?

Let’s have a look at the files they’re typically changing. Therefore every file node shall be enriched with a property that indicates the file type. The following query takes the name of each file and performs a split on the character ‚.‘. The last part of the resulting collection (splittedFileName) is used as file type and stored as a property:

MATCH
  (f:Git:File)
WITH
  f, split(f.relativePath, ".") as splittedFileName
SET
  f.type = splittedFileName[size(splittedFileName)-1]
RETURN 
  f.type, count(f) as files
ORDER BY
  files DESC

The result reveals that the application is mainly written in Groovy. Now let’s find file types and their top authors:

MATCH
  (a:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->(:Change)-[:MODIFIES]->(file:File)
WHERE NOT
  c:Merge
RETURN
  file.type, a.name, count(file) as files
ORDER BY
  files DESC, file.type

Falk is an expert for Groovy and Java!

But what about Gerd? He seems to care a lot about XML files but what is this about? Let’s dig deeper and see which files he cared about most:

MATCH
  (a:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->(:Change)-[:MODIFIES]->(file:File)
WHERE
  a.name = "Gerd Aschemann"
  and not c:Merge
RETURN DISTINCT
  file.relativePath, count(c) as commits
ORDER BY
  commits desc

The result mainly contains pom.xml and application property files, so Gerd is obviously doing the build- and integration management!

Short recap: After getting familiar with the structures created by the jQAssistant Git plugin and cleaning up the data it was easy to identify technology experts. I can assure that these results are valid as I know both guys personally – trust me, they’re worth their money!

Exclusive And Distributed Ownership

Having experts in a team in general is a good thing. But this can turn quite quickly into a problem if knowledge about parts of an application is held exclusively by one person which might leave the team at some day for some reason (not going to speculate about this here…). So are there any files which are exclusively owned by one author?

MATCH
  (author:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f)
WHERE NOT
  c:Merge
WITH
  collect(author) as authors, f
WHERE
  size(authors)=1
UNWIND 
  authors as author
RETURN
  author.name, collect(f.relativePath)

The query returns a quite large result containing many files. The screenshot shows that Gerd added some configuration files that others might not be familiar with.

But also the opposite case might be of interest: What about files which have been changed by many authors?

MATCH
  (author:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f)
WHERE NOT
  c:Merge
WITH
  f, collect(distinct author.name) as authors
RETURN
  length(authors) as count, f.relativePath
ORDER BY
  count desc

The total numbers are quite low compared to other projects. But there’s already something typical here: build descriptors (in this case pom.xml files) are on the top ranks. What does it mean?
It would require deeper inspection of what actually has been changed within the files by the different authors but one very likely reason is that this file type covers different aspects which need to be maintained by different roles: dependencies, version, assembly, configuration, project information, etc. So the question might be asked if build descriptors violate the Single Responsibility Principle (SRP).

Back In Time

Let’s continue with a look on distribution of commits over time. Therefore time trees are built up from the values of the ‚date‘ property of commit nodes:

MATCH
  (c:Commit)
WITH
  c, split(c.date, "-") as parts 
MERGE
  (y:Year{year:parts[0]})
MERGE
  (m:Month{month:parts[1]})-[:OF_YEAR]->(y)
MERGE
  (d:Day{day:parts[2]})-[:OF_MONTH]->(m)
MERGE
  (c)-[:OF_DAY]->(d)
RETURN
  y, m, d

In this case the query returns a graph which very nicely shows the tree for a commit (manually expanded) from Alexander on 2016-01-05. Based on this structure it is possible to determine the number of commits and changed files per month:

MATCH
  (a:Author)-[:COMMITTED]->(c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f:File),
  (c)-[:OF_DAY]->(d)-[:OF_MONTH]->(m)-[:OF_YEAR]->(y)
WHERE NOT
  c:Merge
RETURN
  y.year as year, m.month as month, count(distinct c) as commits, count(f) as files
ORDER BY
  year, month

JavaLand is a conference that takes place once a year usually at the end of March. So it is not a surprise that the commit activity of the team increases significantly about three months before. But it can also be observed that autumn is a good time for pushing the application forward! This pattern repeated for at least the last two years.

There’s another interesting information regardings commit times – their distribution over hours per day. This time the aggregation is performed without building up a tree of nodes:

MATCH
  (c:Commit)
WITH
  // 23:16:10 +0100
  c, split(c.time, " ") as timeAndTimeZone
WITH
  // 23:16:10
  c, split(timeAndTimeZone[0], ":") as time
RETURN
  time[0] as hour, count(c) as commits
ORDER BY
  commits desc

Most commits are done after 21 o’clock (within the timezone of the author). This time is not necessarily within the usual working hours. This leads to two possible conclusions: the team members either have very bad employers and should look for a new job or DukeCon is just their hobby. If the latter is assumed then according to these late hours it is very likely that the members producing the majority of the commits are first spending time with their families and we should show gratitude to them that they are still bringing all these nice things together!

Of course it is possible to ask the same question for a specific author:

MATCH
  (a:Author)-[:COMMITTED]->(c:Commit)
WHERE
  a.name="Falk Sippach"
WITH
  a, c, split(c.time, " ") as timeAndTimeZone
WITH
  a, c, split(timeAndTimeZone[0],":") as time
RETURN
  time[0] as hour, count(c) as commits
ORDER BY
 hour

The result supports the assumption that the project is developed by volunteers in their spare time. But wait: Isn’t this query and the one before hitting the privacy of the developers? Of course: the project is open source and the information is available on GitHub for everyone. But until now it was hidden in the deep vaults of Git. Tools like jQAssistant make this information easily accessible and there’s definitly potential for usage in unintented and bad ways – so please don’t be evil!

Metrics Of Change

Let’s move the focus away from people towards files. If a file is changing quite often this may indicate some instability. So let’s try to identify these potential hotspots:

MATCH
  (c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f:File)
WHERE NOT
  c:Merge
RETURN
  f.relativePath, count(c) as commits
ORDER BY
  commits desc

The top positions are held by pom.xml files and this is very common result. The reason for this has already been discussed above (potential violation of the SRP).
The result also shows Groovy files. It would be interesting to see if there are code metrics which indicate the reason behind their change frequency. For gathering this information it is necessary to correlate the source files (represented by Git nodes) with the Groovy or Java structures. The latter are compiled to bytecode which is scanned by jQAssistant and provide the name of the source file.
The following query is used to create a „HAS_SOURCE“ relation between Groovy/Java classes and their corresponding Git files:

MATCH
  (p:Package)-[:CONTAINS]->(t:Type)
WITH
  t, p.fileName + "/" + t.sourceFileName as sourceFileName // e.g. "/org/dukecon/model/Location.java"
MATCH
  (f:Git:File)
WHERE
  f.relativePath ends with sourceFileName
MERGE
  (t)-[:HAS_SOURCE]->(f)
RETURN
  f.relativePath, collect(t.fqn)

Using the HAS_SOURCE relation the most frequent changed classes may be gathered including the sum of their cyclomatic complexity and effective lines of code:

MATCH
  (c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f:File)
WHERE NOT
  c:Merge
WITH
  f, count(c) as commits
MATCH
  (t:Type)-[:HAS_SOURCE]->(f),
  (t)-[:DECLARES]->(m:Method)
RETURN
  f.relativePath as path, commits, sum(m.cyclomaticComplexity) as complexity, sum(m.effectiveLineCount) as lines
ORDER BY
  commits desc

The class that has been changed most often is also reported to have a relative high number of lines but more importantly a very high cyclomatic complexity: DoagDataExtractor. It seems that it contains quite complex logic that makes maintenance difficult. Looking at the name of the class it seems to provide importing/conversion functionality for external data (DOAG is the german Oracle user group). Integration with other systems quite often is a pain for developers…

There are now at least two metrics reporting this class as a potential problem. Thus it becomes a candidate for being a hotspot that developers should watch carefully and possibly try to simplify it (i.e. refactoring). There might be other metrics providing more hints: DoagDataExtractor has also the highest number of outgoing dependencies to other classes (fan-out) and therefore also might be highly sensitive to changes in the application.

Temporal Coupling

In the section before direct dependencies to other classes have been mentioned. The version control system can provide information if there are files which are often changed together and thus indicate if there is a „temporal coupling“ between code structures. The next query creates a relation „HAS_TEMPORAL_COUPLING_TO“ between two files which have been part of the same commit and adds an attribute „weight“ to it indicating the number of such commits:

MATCH
  (c:Commit),
  (c)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f1),
  (c)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f2)
WHERE
  f1 <> f2
  and id(f1) < id(f2)
  and not c:Merge
WITH
  f1, f2, count(c) as commits 
MERGE
  (f1)-[coupling:HAS_TEMPORAL_COUPLING_TO]->(f2)
SET
  coupling.weight = commits
RETURN
  count(*)

This makes it straight forward to find pairs of files with the highest temporal coupling:

MATCH
  (f1:File)-[coupling:HAS_TEMPORAL_COUPLING_TO]->(f2:File)
RETURN
  f1.relativePath, f2.relativePath, coupling.weight as weight
ORDER BY
  weight desc

The result shows that the following files evolve together:

  • build descriptors (pom.xml)
  • classes and their test classes
  • data extractor classes and their data (conference.yml)
  • model classes

Let’s focus a bit more on classes:

MATCH
  (t1:Type)-[:HAS_SOURCE]->(f1),
  (t2:Type)-[:HAS_SOURCE]->(f2),
  (f1)-[tc:HAS_TEMPORAL_COUPLING_TO]->(f2)
RETURN
  t1.fqn, t2.fqn, tc.weight as weight
ORDER BY
  weight desc

The result shows model classes but it looks very redundant as all inner classes are still included. Let’s label them with „Inner“:

MATCH 
  (t1:Type)-[:DECLARES]->(t2:Type)
SET
  t2:Inner
RETURN
  t1.fqn, collect(t2.fqn)

Now inner classes may be excluded easily:

MATCH
  (t1:Type)-[:HAS_SOURCE]->(f1),
  (t2:Type)-[:HAS_SOURCE]->(f2),
  (f1)-[tc:HAS_TEMPORAL_COUPLING_TO]->(f2)
WHERE NOT
  (t1:Inner or t2:Inner)  
RETURN
  t1.fqn, t2.fqn, tc.weight as weight
ORDER BY
  weight desc

The result proves the observations made before but provides a better overview about the dependent types. At this point it becomes interesting to detect temporal coupled types which do not have an explicit type dependency, i.e. which are only implicitly coupled:

MATCH
  (t1:Type)-[:HAS_SOURCE]->(f1),
  (t2:Type)-[:HAS_SOURCE]->(f2),
  (f1)-[tc:HAS_TEMPORAL_COUPLING_TO]->(f2)
WHERE NOT
  (t1:Inner or t2:Inner)  
  and not (t1)-[:DEPENDS_ON]-(t2) // allow any direction
RETURN
  t1.fqn, t2.fqn, tc.weight as weight
ORDER BY
  weight desc

Beside the model there’s a pair of classes that evolves together without having an explicit dependency: „DataProviderLoader“ and „DoagDataExtractor“.

Analysis of temporal coupling between classes may deliver interesting results but for larger projects the information is too fine-grained. The value increases if these couplings are projected to architectural building blocks. The following query uses the generated artifacts (i.e. Maven projecs) for this:

MATCH
  (a1:Artifact)-[:CONTAINS]->(t1:Type)-[:HAS_SOURCE]->(f1),
  (a2:Artifact)-[:CONTAINS]->(t2:Type)-[:HAS_SOURCE]->(f2),
  (f1)-[tc:HAS_TEMPORAL_COUPLING_TO]->(f2)
WHERE
  a1 <> a2 
  and not (t1:Inner or t2:Inner)
RETURN
  a1.fqn, a2.fqn, sum(tc.weight) as weight
ORDER BY
  weight desc  

The tests (test-jar) evolve together with the web application and the latter responds to API changes. This is nothing surprising so what’s the value of it? The DukeCon project consists of only 2 modules and only a few people are working on it. The situation changes if the project consists of hundreds of modules and several teams are working on the same code base. Even if this is not desirable there are many projects out there which are organized like that. In this case it can be investigated which teams need to work closely together to align on changes. The following query therefore provides an overview of the main authors per artifact:

MATCH
  (artifact:Artifact)-[:CONTAINS]->(:Type)-[:HAS_SOURCE]->(f),
  (author:Author)-[:COMMITED]->(c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f)
WHERE NOT
  c:Merge
RETURN
  artifact.fqn, author.name, count(c) as commits
ORDER BY
  artifact.fqn, commits desc

Commit Messages

Every commit contains a message and even without deep knowledge in natural language processing it is possible to extract valuable information by analyzing contained words. These can be extracted to nodes which are referenced by each corresponding commit using a „CONTAINS_WORD“ relation:

MATCH
  (c:Commit)
WHERE NOT
  c:Merge
WITH
  c, reduce(t=toLower(c.message), delim in [
    "-","(",")","!",".",":",";","/",","
  ] | replace(t, delim, "")) as message //1
UNWIND
  split(message, " ") as word //2
WITH
  c, trim(word) as word
WHERE NOT
  word in ["", "a","the","of","and","from","to","with","for","in","by","some","on","use","as","is"]
MERGE
  (w:Word{word:word})
MERGE
  (c)-[:CONTAINS_WORD]->(w)
RETURN
  word, count(c) as count
ORDER BY
  count desc

The query first removes extra characters (1) from the message, splits it into an array using the space character as separator (2) and after trimming removes stop words that do not provide extra information (3).

The word that is used most often is „added“ in 86 commits followed by „fixed“ in 44 commits. Having these words on the top positions is a quite common situation. At a first look the relation between both is good (about 2:1) but in row 6 there’s also the word „fix“ with 24 commits making it a bit worse. It would be nice to remove this redundancy by applying more cleanup steps (e.g. stemming) but it is not necessary at this time. Let’s identify the relation of fixes to all commits per author:

MATCH
  (a:Author)-[:COMMITTED]->(c:Commit)
WHERE NOT
  c:Merge
OPTIONAL MATCH
  (c)-[fixes:CONTAINS_WORD]->(w:Word)
WHERE
  w.word starts with "fix"
WITH
  a, count(fixes) as fixes, count(c) as commits
RETURN
  a.name, fixes, commits, toFloat(fixes)/toFloat(commits) as relativeFixes
ORDER BY
  relativeFixes
DESC

First of all the numbers are only representative for authors with a high number of commits, e.g. Falk or Gerd. It can be observed that about 25% of Falk’s commits are fixes whereas those of Gerd are only 14%. But this comparism is dangerous: they work with different technologies and the reason may be hidden there. Such analysis should be done very carefully before making any conclusions about the skills of people!

A much more reliable information is the number of fixes per file:

MATCH
  (c:Commit)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(f:File)
WHERE NOT
  c:Merge
OPTIONAL MATCH
  (c)-[fixes:CONTAINS_WORD]->(w:Word)
WHERE
  w.word starts with "fix"
WITH
  f, count(fixes) as fixes, count(c) as commits
WHERE
  commits > 10
RETURN
  f.relativePath, fixes, commits, toFloat(fixes)/toFloat(commits) as relativeFixes
ORDER BY
  relativeFixes desc
LIMIT 
  10

The query includes only those files with a total number of commits higher than 10 to avoid noise. The result confirms the observations made before:

  • The class „DoagDataExtractor“ has already been identified as a potential hotspot (number of changes in correlation with lines of code/cyclomatic complexity). It’s not surprising that the corresponding test class „DoagDataExtractorSpec“ is also affected.
  • From the name of „JavaLandDataExtractor“ it can be concluded that it plays a similar role in the system and it seems that importing/converting data is one of the more difficult problems to solve.
  • Last but not least the pom.xml files are back again in the result and it seems that the build systems also provides some difficulties.

Wrap Up

This post demonstrated how – with quite low effort – valuable information can be extracted from the Git history of a project, e.g.:

  • Experts in domains and technologies
  • Activity of authors, e.g. main developers vs. occasional contributors
  • Distribution of development activity over time
  • Potential hotspots in the code
  • Temporal coupling of files, classes or even artifacts

The analysis steps have been performed using jQAssistant, the Git Plugin, Neo4j and the wonderful Cypher query language. This approach opens up the possibility to create reports according to individual needs. So why wait – start exploring!

And don’t forget to read the book, it provides much more interesting ideas!

About the author:

@dirkmahler

2 Comments

  1. John  - 21. Juni 2017 - 8:55

    Thank you for the detailed description! That was what Im looking for but all other descriptions I found were not details at all…

    Greets

    John

    • Dirk Mahler  - 21. Juni 2017 - 9:24

      Hi John,

      thanks for the feedback! Which are the other descriptions which you are refering to, what were you missing?

      Cheers,

      Dirk

Leave a comment

Back to Top