Aaron Swartz has been arrested and accused of a multitude of crimes, for a break down of them go here, for gaming a big journal retrieval site called JSTOR (it is a large one many journals are stored within this site). As some one that works with these retrieval services quiet often and has actually hit the limit for the amount of citation data you can pull from them, they can be frustrating. Some of the work I’m personally doing right now is related to citation analysis and co-authorship analysis. Which allows networks of knowledge flows to be seen. Another method is to do a word analysis within articles to create knowledge networks based on what articles are about, what knowledge is contained in each of the articles. Apparently, in the past, Swartz has done something like this. Some of my colleagues also use techniques to allow additional gathering of information. Most of this information, even with you have legal access, is difficult and very time consuming to procure. In this case, Swartz has access and may have been able to get a hold of this data through other means. JSTOR mentioned in one of their releases that they have a program that allows for high volume access to their publications.
This case also has made me think of a few other issues with our current knowledge retrieval systems and repositories. Companies need to make money off these publications, so we can’t have them for free. However, through my research, I’ve used articles that are 20 years old. If this knowledge was patented, I would be able to access this and use it with no problem at this point. In many cases, it could happen sooner as many patents aren’t renewed after a certain time frame. Using a scientific article is typically more like using something published under a creative commons license, which means you can remix the information. Through citations you give credit where it is due. In most cases you can get access to the data and models, if you give the person credit, either through citations or co-authorship. Why does this work? Because the research is publicly funded.
Authors can also pay to allow full free access to their work depending on the journal. However, in most cases they don’t, or don’t get the article to be free continuously. However, there is some relief from the burden of paying for individual articles, Google Scholar, is able to find articles that scientists have on their personal websites, and allow access to “working paper” versions, which means they aren’t quite publishable yet, even after they have been published.
I think for publicly funded research we need to have an exception to the copyright law, which changes it from 70 years to 10 years. Depending on the field even 10 years is to long. The work my wife is doing articles cited which are that old are typically cited because it’s giving credit to trail blazers. These papers are typically cited in the hundreds compared to the average of the tens. Once the copyright expired there would be much more competition for distribution of the articles and reduces the risk to the knowledge community if any given retrieval system or journal fails.
This Swartz case scares me in general, because it will make it even more difficult to access information and care a large risk if you create scripts to make it easier to get access to massive amounts of data.