@danielcompton Thanks, the wiki was helpful. I used the above rsync command to download only the pom files of around 1GB of data. I am parsing them up to write to database. I could see multiple duplicate poms that result in duplicate data. Is there a reason for that? Does clojars have redundant copies for some reason?
➜ my-wonderful-copy-of-clojars $ diff aleph/aleph/0.1.0-SNAPSHOT/aleph-0.1.0-20100502.112537-10.pom aleph/aleph/0.1.0-SNAPSHOT/aleph-0.1.0-20100502.112537-11.pom
No difference in the above as from the diff command.If there is already a database present with this info it will save me a lot of time.
Those are SNAPSHOT versions, they’re work in progress versions that people can publish. They probably have changed in source files but not necessarily in POM
You can probably ignore them TbH
Makes sense thanks. Is there a database already present with the info because some of the projects have some unicode symbols which break the XML parsing. E.g.
file : my-wonderful-copy-of-clojars/speclj/speclj/2.1.3/speclj-2.1.3.pom
content : <comments>Copyright \251 2011 Micah Martin All Rights Reserved.</comments>
Exception :
Unhandled java.lang.IllegalAccessException
class clojure.lang.Reflector cannot access class
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException (in module java.xml) be$
java.xml does not export <http://com.sun.org.apache.xerces.internal.impl.io|com.sun.org.apache.xerces.internal.impl.io> to unnamed module @3629b018
@xtreak29 be aware that just parsing the pom files will give you the direct dependencies, not the transitive dependencies the direct ones have. In order to get the full dependency tree, you’ll need to use a maven resolver to walk the tree
one hacky (but somewhat straightforward) way to do that is with Maven: mvn -f /path/to/pom dependency:tree
Yes, that is a good catch. Also I could see some of the libraries using the dependencies as part of code and not as part of project.clj
.
right, those would be dependencies brought in transitively
with dependency:tree
, you’ll have to regex parse the output, but it’s regular at least
Looks like the above command downloads jars
➜ ~ mvn -f my-wonderful-copy-of-clojars/speclj/speclj/2.6.1/speclj-2.6.1.pom dependency:tree
[INFO] Scanning for projects...
Downloading: <https://repo.maven.apache.org/maven2/org/codehaus/mojo/build-helper-maven-plugin/1.7/build-helper-maven-plugin-1.7.pom>
Downloaded: <https://repo.maven.apache.org/maven2/org/codehaus/mojo/build-helper-maven-plugin/1.7/build-helper-maven-plugin-1.7.pom> (6 KB at 2.9 KB/sec)
yes, it will resolve all the dependencies in order to process their pom files, unfortunately
it will also download jars it needs to run dependency:tree
Got it. That will be a network heavy task for me. Since doing a full dependency detection of clojars in a tiny DigitalOcean droplet will kill it 🙂
Right now I have parsed the pom and inserted into MongoDB and using a query to see which libraries use dependencies that cannot run on Clojure 1.9 to file issues.
> db.clojars.distinct("url", {dependencies: {$elemMatch : {artifactId: "core.async"}}, "version": {$lt: "0.3.442"}}, {url: 1, '_id': false})
[
"<http://github.com/pleasetrythisathome/tao>",
"<http://dsteurer.org>",
"<https://github.com/cncommerce/beetlejuice>",
]
# More output
Problem is that there I parse every pom file and there are cases where the older maven file will tell me that there is an outdated dependency but it's fixed in the new one. Need to do something to get the latest pom in the directory to avoid false positives.
@tcrawley Do you know of any database that has all the info as a structured output, I am assuming that clojars uses a DB backend to construct the pages.
Clojars does store the direct dependencies in the db, but not the transitive ones
Is the db available for download somewhere in public ?
Assuming there is a dump of db or tables without any sensitive info I can use that instead of trying my own efforts.
No, there’s currently no way to get that db. We could possibly expose the dependencies via an api call, but that still wouldn’t give you the full tree
and you’d be lacking any custom repos defined in the poms to resolve the transitive tree for those dependencies
Got it. Thanks a lot for your help on this 🙂
My pleasure