Too many R packages: CRAN is inundated with submissions

79 points by ionychal 7 hours ago

parsimo2010 4 hours ago

I feel like CRAN should be used for packages that are expressly made for others to use, and with effort put in to the documentation and vignettes.

If you’re making a package for a small team or aren’t pushing it to a large audience then just keep it on a GitHub repository. It is almost as easy to install from GitHub with devtools as it is to install.packages().

MostlyStable 4 hours ago

Exactly. I made a small package for my lab and just put it on GitHub. My guess is that many academics don't know that you can install packages from other sources than cran

jdw64 6 hours ago

People would typically choose based on CRAN TaskViews or follow conventional methodologies, but what I notice from this is that R is truly a language used only by those who use it. And the people who use it are usually master's students or professors; it's rarely used at the undergraduate level. So even those with that level of academic background and training must have had their own implementation roadblocks. Could that be why the use of R has exploded with the help of AI? Looking at this, I think it's fair to understand that even domain experts found programming difficult. Seeing this, can we really say that AI is always bad? For some people, it has become both the hands and a voice for their words.

RA_Fisher 6 hours ago

Programming is a lot easier than statistics bc it’s deterministic, whereas statistics is stochastic (that extends and encompasses deterministic functions).
AI speeds up learning, so I bet that’s what you’re noticing with R.
As an aside, the best programmers these days are probabilistic programmers (who write stochastic functions). Our languages are Stan and PyMC. Both can be called by Python or R, and AI writes all of them extremely well. So it seems to me that the underlying language matters less than ever.
- davemp 5 hours ago
  
  Picking up on some dunning kruger effect here.
  Programming isn’t even a field in the same way as prob&stats. Computer science does in fact have non-deterministic sub fields such as information theory.
  
  RA_Fisher 5 hours ago
  
  There’ll always be boundary tending, true. Only a portion of CS deals with stochastic functions though, whereas all of statistics is stochastic. That makes a big difference, bc the world is complex.
  Information theory doesn’t even incorporate utility.
- jdw64 5 hours ago
  
  I partially agree, but I also differ on some points. The part I agree with is that probabilistic programming is difficult and that advanced programmers tend to enjoy it. Where I differ is on the claim that programming is deterministic. At the script level, programming is deterministic and sequential, but once it crosses a certain threshold, it becomes absolutely probabilistic. That's because latency, locks, and asynchronous communication start to intervene. If programming were Non deterministic , C's undefined behavior wouldn't exist; everyone would have prevented it.
  R these days mostly uses the tidyverse, which feels like a variant of DOP (Data-Oriented Programming). It's a kind of data flow, so it's different from typical OOP. I also occasionally work with statisticians (being a freelancer, ETL work is more common than you'd think), and I know what you mean by Stan and PyMC. I know they're powerful tools for Bayesian statistics and multilevel modeling. I know the basic syntax and examples, but I wouldn't say I know them well. My level is mainly focused on the scientists who hire me, and those tools still don't come up often in my country.
  That said, I think we differ on the bigger picture because academic code isn't everything. Academic code is typically algorithm‑centric, like LeetCode problems, but most production work revolves around code hygiene and responsibility (algorithms are usually already established ones). Anyway, that's not the main point. What you said is mostly correct, but my focus was on something else: even people who studied at that level can be surprisingly clumsy at expressing themselves through programming. Regardless, thanks for your input, and I agree that AI is good at programming. But using a programming language generally means understanding its tradeoffs, and R is tricky in that regard since it feels like a mix of OOP and DOP variants
latexr 5 hours ago

> Seeing this, can we really say that AI is always bad?
Is anyone arguing “AI is always bad”? I think the argument is clearly “the negatives outweigh the positives”.
- jdw64 5 hours ago
  
  You're right. I think I overstated it. Since English isn't my native language, I might have used some stronger words than intended. Thank you for pointing that out
PaulHoule 5 hours ago

There is some great stuff in R but from a software engineering level I'd much rather data scientists work in Python.
At risk of sounding like ChatGPT, it's not an R thing, it's a general thing. Turn [showdead] on in your profile and see how Show HN is flooded with AI slop projects and we all know GitHub is drowning in it.
- jdw64 4 hours ago
  
  I also think Python is a bit better. (Though, unlike you, my programming skills are directly tied to my livelihood, so it benefits me if one language can cover as much ground as possible. Being locked into a specific domain just narrows the number of jobs I can take on.) You're not wrong, but it makes me pretty sad that all my homepage submissions are marked as 'showdead' and no one ever sees them. Maybe my submissions would look like rubbish by your standards. But looking at it that way, there's also the gap between what people expect and what the site's filters decide.
  
  PaulHoule 3 hours ago
  
  I've got a very Clark Kent kind of a job doing very ordinary work at a university unit which is authoritative in its domain and don't talk a lot about what I do there because the last thing I want to do is have people think my opinions have anything to do with my employer (and the second to last thing I want to do is post statements to that effect!)
  I code Java and Javascript by day and mostly Python for my side projects because of the practicality. I've always been the guy who can finish projects that other people couldn't by attending to the essential details that everybody else feels entitled to ignore.
  As for your problem you are making the classic mistake of repeatedly posting links to your blog and nothing but links to your blog. If you were finding articles from other sources and posting (say) 10% from your own blog you wouldn't be tripping up the filters.
  Sure I seem to have a led a glamourous life of enterprise sales, industrial espionage and always being ready to write and give a talk in 48 hours if a TED speaker gets kidnapped but the reality I haven't had time to fix the busted Python packages that my autoposter depends on. I am way too busy transforming into a fox when I go down the elevator and casting glamours on people and I am always tell this witch that I am the familiar of that witch (and the other way around) but on the 6th floor they have no idea who they are dealing with.
  
  jdw64 3 hours ago
  
  thanks!
colechristensen 4 hours ago

A considerable amount of work for grad students is answering the question: "How the f#$% do I get this code to compile and run"
Some other researcher, often with limited skills in your native tongue, even more limited skills in software development best practices, wrote some code for a paper between 5 and 50 years ago and your PI has told you to use that code and some OTHER code together at the same time to validate some experiment he wants you to do.
In the past you would take days/weeks/months to get this to work, but with an LLM?
I'm envious of the grad students of today for the amount of nonsense which is bypassable.
- nxobject 3 hours ago
  
  The other half is: "What combination of packages and task views do I actually need to not reinvent the wheel for this particular type of analysis?"
  
  freehorse 2 hours ago
  
  And the third half is "what preprocessing and analysis methods I actually need".
  Because I have never met a person who is great at that last part (methods theory) and sucks at the others (technical implementation; because the same work and effort leads one to train both). The issue is that AI solves all these problems at once, which will probably result in more academics understanding their methods and choices in preprocessing etc even less. At least this is what I have seen, and seen it getting worse.
  I wish the problem was just finding the right packages. Web search, and mentoring/talking to colleagues are pretty good solutions to that. LLMs are more of a gamble here if one use them as an authoritative source, they may suggest the right package, or they may take you on a long trip to nowhere, depending on random factors.

alastairr 4 hours ago

The surprising thing to me is that it's taken as long as it has for CRAN to have this problem. As others have said, this is happening everywhere.

cscheid 4 hours ago

CRAN has very different social structures and culture than other languages' package repos. CRAN _will_ pull your package from the repository if you fail to play nice with all packages, for example. This is evidently controversial, but it's accepted practice and one possible explanation for why CRAN feels more cohesive than NPM or PyPI.
- alastairr 4 hours ago
  
  I remember the controversies well :) but I guess even their extra gating is starting to creak by the sounds of it.
polski-g 3 hours ago

spamming refresh on npmjs's latest packages list is a crazy experience.

f311a 4 hours ago

It's the same on any package index now.

Mairoce 6 hours ago

Frankly the bigger problem is an over reliance among R instructors on the tidyverse, an ever-expanding ecosystem of redundant functions and anti-patterns. They’re teaching new R users that everything can be solved with yet another package import and skipping over teaching them how to use the already powerful and intuitive base packages.

mjhay 6 hours ago

I’m not saying it doesn’t have flaws, but the tidyverse is still the most coherent and functional ML/stat computing ecosystem I’ve ever used. R packages outside of the tidyverse can get pretty gnarly. Even the R stdlib is usually considered to be inconsistent and riddled with legacy cruft.
- 331c8c71 5 hours ago
  
  It's certainly quite pleasant to work with...but I would rather use sql for etl, the backend be whatever it needs to be...
  The real world data transformations can get gnarly very quickly and sql is the perfect common debiminator compared to dplyr which is still niche...
  How do you feel about polars?
  
  mjhay 5 hours ago
  
  I’m a big fan of Polars. It’s really fast and memory efficient. With the lazy streaming functionality, I’ve been able to easily process 1 Tb+ data on a single machine (you do have to be careful to not do any operation that would cause the whole DF to materialize in that case).
  It’s certainly miles better than Pandas, which has a terrible API in addition to being comically inefficient. In my group, we generally use it for any new work, and have also swapped out pandas for polars in critical spots of our existing code - the latter giving a huge benefit relative to the amount of work it took.
  I largely agree with you on SQL being the common denominator, but there are some things that are just awkward in SQL, and much easier to do in Python or other general purpose language.
- pks016 2 hours ago
  
  I would be in minority. But, I don't like tidyverse ecosystem. I prefer data.table for most of my uses.
  
  donkeyboy 8 minutes ago
  
  Data.table is just so much faster, and the sql-like stntax is easier tonunderstand
- Mairoce 21 minutes ago
  
  The core of the problem is that the tidyverse is trying to turn R into a user-friendly real-time calculator, rather than a tool for stable, deterministic, and literate data analysis.
nswizzle31 5 hours ago

I couldn’t disagree more. The base packages are a complete mess. If R was subset to only the tidyverse 5 years ago then it wouldn’t have lost so much ground to Python in nearly all fields.
Posit is obviously the only organization with the pull to do that, and I feel like they got pulled in 10 directions during the move to AI and trying to also support Python. R Shiny is dead too which sucks because reflex.dev just copied them and ate their lunch in 3 months.
- PaulHoule 5 hours ago
  
  Python is just such a good Swiss army knife and it's never a waste to learn: you can do data science and you can do almost anything else. It's the BASIC of the 21st century.
- Mairoce 5 hours ago
  
  The proof is in the pudding. Every single grad student of mine that was brought up on the tidyverse produces gigantic R markdown files with 20 imports to accomplish something that would be shorter and much much easier to understand (and review!) with a base package or with one of a small number of packages (box, data.table) designed by people who understand programming.
  Not to mention the ridiculous styling/formatting of most tidyverse users, which Wickham and others seem to promote. One of the reasons R has lost ground to other languages recently is that most R code these days is ugly
  
  tylermw 4 hours ago
  
  > The proof is in the pudding. Every single grad student of mine that was brought up on the tidyverse produces gigantic R markdown files with 20 imports to accomplish something that would be shorter and much much easier to understand (and review!) with a base package or with one of a small number of packages (box, data.table) designed by people who understand programming.
  The fact that young people are producing sub-optimal code (in terms of whatever optimization criteria you are choosing--here, it sounds like terseness) is not strong evidence that a particular software ecosystem (tidyverse) is flawed. Young people producing bad code is not surprising. They're your grad students, mentor them, and maybe they'll adapt to your ways of thinking. Or not.
  > One of the reasons R has lost ground to other languages recently is that most R code these days is ugly
  Citation needed, surely. The fact that this article is about an increase in the number of CRAN submissions and pseudo-quantitative indices like the TIOBE index show R's slice of the pie is growing provides evidence to the contrary.
  
  Mairoce 3 hours ago
  
  > Young people producing bad code is not surprising. They're your grad students, mentor them, and maybe they'll adapt to your ways of thinking. Or not.
  You’re right, mentorship is key and I do my best to suggest better practices. They are often quite happy to find out they can do more with less and can forget having to remember multiple additional syntaxes (looking at you “ggplot2”).
  I somewhat understand why R instructors lean towards the tidyverse - Wickham’s group produces a ton of tutorials and workbooks, so it’s easy to just point students there - but it has led to entire cohorts of people producing poor code
  
  jochapjo 1 hour ago
  
  For doing "more with less" in graphics, I would rather learn a unique syntax for a package that is based on the grammar of graphics (ggplot2) than use a package with standard syntax and some other foundation.
  
  Mairoce 34 minutes ago
  
  Good you find value in that framework, but it doesn’t seem like a useful starting point for first time R learners interested in plotting and exploring their data. I have a colleague that integrates ggplot2 and other tidyverse packages into their undergraduate classes and they struggle quite a bit with creating basic plots since they now have to learn two things instead of one.
  
  zippyman55 4 hours ago
  
  That was always my struggle w tidyverse vs base mastery. From the looney tunes cartoon of the road runner vs the coyote, the coyote used tidyverse and the road runner used base R.
  
  Tarq0n 3 hours ago
  
  Data.table is a masterclass in bad API design. Its lack of success despite its technical merits is entirely of their own doing.

jochapjo 2 hours ago

I'm a recent first-time CRAN submitter. I believe my package went through 2 rounds of human review. I doubt R has a severe "too much AI slop" problem relative to other languages, but I can see how human reviewers would get inundated.

dizhn 4 hours ago

CRAN is not a conventional package repo. Its audience is not really people who care about programming or software. It is a means to an end for them and slop is perfectly fine. The language itself is also very simple and has defaults that people don't even bother changing. For example the default output file name. It doesn't ask for an output file name when you save output.

As a result of the above, it is full of packages that come with associated datasets right in the package itself. Packages with a tiny script and gigabytes of data. Or perhaps just the data without any actual code.

Very weird universe.

gnerd00 4 hours ago

OK you are right but that is selective for an "overview". The attention to documentation has always been outstanding for substantial packages. The culture is to make many repetitive steps into one liner "magic" that sometimes is very very useful; lastly, the completeness of advanced statistical methods in standard libraries is real. ps- I do not like the R language at all myself, but to be fair there are reasons it is widely used in higher ed.
- dizhn 3 hours ago
  
  I only dipped into it a little bit while helping out a friend. It looked weird to me but I didn't mean to sound so negative. Sorry. I am sure it does get the job done or people wouldn't be using neither R nor the CRAN.
- nxobject 3 hours ago
  
  > I do not like the R language at all myself, but to be fair there are reasons it is widely used in higher ed.
  In the same boat... from a PL perspective, yikes (especially the macro mechanism that somehow never seemed to be planned, but somehow exists). As a working statistician? It really does get work done quickly.
  To pass inputs with complex unevaluated syntax, I've seen...
  – ad-hoc string parsing (lavaan etc.)
  – formulas (which somehow the tidyverse doesn't use),
  – base R syntax manipulation by round-tripping between as.list and as.call;
  – and whatever wheel reinvention with bizarre semantics that the tidyverse uses.
  
  hadley 2 hours ago
  
  You can learn about the theory that underlies tidyeval at https://adv-r.hadley.nz/quasiquotation.html. I'd claim that it's neither reinventing the wheel (because it solves problems that the base equivalents do not) nor bizarre (because it is backed by a deep, well-founded theory).
hadley 4 hours ago

CRAN is a weird universe, but not (just) for the reasons you mention. CRAN is still heavily human maintained which means that there's a high chance that an actual human will look at your packages (at least for your first package). This imposes a considerably higher barrier to entry than most package repos, and hence I suspect CRAN actually has a considerably lower percentage of slop.
- tylermw 4 hours ago
  
  Absolutely correct. CRAN takes down and rejects packages all the time for minor issues and violations of their rules and guidelines. And there are a lot of them: https://cran.r-project.org/doc/manuals/r-release/R-exts.html
  The fact that there is a human (and one with expertise in R) reviewing each incoming package makes pure vibe coded slop much, much harder to get approved.
  
  nxobject 3 hours ago
  
  Even though I've dealt with this, I'm genuinely appreciative of requirements: out of many stipulations, packages that monkeypatch are prohibited (I have a few ones that add diagnostics to advance analyses), online API access needs robust error handling... and there is a conformance/diagnostic suite.
  https://cran.r-project.org/web/packages/policies.html
  
  tylermw 3 hours ago
  
  Yep, as someone who both uses lots of R packages and writes lots of R packages (which, in turn, import other R packages), I've grown to appreciate the strictness of CRAN: if a package is on the CRAN, I'm just about guaranteed to not have installation issues or have it screw up my environment. To me, that's the one major job of a package repository, and CRAN does it well, even if it does cause package authors pain at times :)
morpheuskafka 2 hours ago

> For example the default output file name. It doesn't ask for an output file name when you save output.
I'm not really sure what output file even means for an interpreted language, but GCC doesn't ask either, it will spit out an a.out by default (not even .elf or something logical).
- dizhn 17 minutes ago
  
  I can't find the script now but I am talking about text output to a file. Perhaps people were normally running the script in a Jupiter like env and that's why they didn't bother with file names and it created a file with a default name when I ran it to test the interpreter.

ianbooker 5 hours ago

I see "AI and R" in three perspectives:

First, usage: Using R for our undergrads in time of LLMs is brilliant. ChatGPT slops out working code for their needs. Not pretty but works better that in 2022.

Second, development: Mastering R is hard, because its kalkül. Tidyverse mediates some of it, but still. This is the perfect breeding ground for slopification. Lets see.

Third, errata: I would love to know the percentage of science built on R to this day. I mean insights and analysis supported by it and it vast packages. What if somewhere, deep down in the stack there is an ancient bug that dented all of this? I think AI might help us here, or review slop will negate this?

colechristensen 4 hours ago

>What if somewhere, deep down in the stack there is an ancient bug that dented all of this?
Science is built on libraries with experience, that have been validated extensively against reality. Code often written by people who have retired and died because that exact same code has been validated and pinned to reality for decades. It is of course possible that a load bearing bug survives for a long time conspiring with an incorrect model of reality to give validated results, but wide use tends to eliminate these things.

piokoch 4 hours ago

We have too many videos (since creating one is so easy), too many music (since recording it is so easy), too many books (since publishing an e-book is so easy). Now the same story happens again, for software. But this time it causes more troubles...

nickcageinacage 5 hours ago

vibe coding hell is the reason

greenavocado 5 hours ago

The solution to this problem will be a web of trust featuring a vouching system that auto-closes PRs by default. I already see this being implemented in projects.

dofm 5 hours ago

R slop. Oof.

What an awful thing to imagine. It's already the programming language of choice for egregious abuses of good practice.

ActionHank 5 hours ago

I do wonder if there isn't enough computer science / software engineering that is being taught as part of data science.
People I've worked with that used R and manged data / did analysis didn't really seem too concerned with long term maintenance.
Secondary observation, these same people were the first to preach for the AI coding gospel.
- mjhay 5 hours ago
  
  Bingo. The typical data scientist has a masters or PhD in a non-CS quantitative field, and has had exactly zero CS or software eng classes. It’s a shame, because once you get over some of the idiosyncrasies, R is a really powerful and flexible functional language.
- mr_toad 5 hours ago
  
  > People I've worked with that used R and manged data / did analysis didn't really seem too concerned with long term maintenance.
  Unless you’re the poor schmuck who is given the task of running the code written by the previous analyst, who has probably already left the company. Often it’s easier to just throw something together from scratch and then look for a new job, perpetuating the problem.
- dofm 5 hours ago
  
  One of the things that always reassures me about LLMs is that as well as being trained on languages with reasonably well-designed grammars, they will also have seen lots of examples of good practice in their training set.
  Two things that make me wonder if they can possibly turn out good quality R.
  Perhaps a true test of AGI will be when you ask it to write an application in R and it refuses for fear of what people might think.
- ngriffiths 4 hours ago
  
  At my job I switch between writing analysis code for research projects and writing code for apps. The difference in mindset is so dramatic. In the same way that good software has consistent names and interfaces that are ~useless when you just need the code to run once, research code has its own requirements that are ~useless in software. It's honestly a big challenge to switch back and forth. So I think it just reflects the main skillset of the people who use it (caring is not enough).
buellerbueller 5 hours ago

Conversely, it is the programming language of choice for people who don't assume that their expertise on one domain (data science) translates into expertise in the whole of human knowledge (as we often see among techbros generally and here specifically).
As a working data scientist, I know I am not a computer scientist or a 10x engineer (hell, I am probably a 0.8x engineer), but that's not where my expertise is. My engineer co-workers are 0.01x data scientists, but you won't see me complaining that they don't know the Central Limit Theorem or how to build a causal inference engine.
- malshe 3 hours ago
  
  Your comment reminds me of a techbro blog I came across a few years ago. He was an influential "data scientist" on Twitter with CS rather than stats/econometrics background. In this post, he literatlly used linear regression on a categorical dependent variable. He just relabeled the categories 1, 2, 3, etc. Worse, when people pointed out the problem to him, he couldn't understand what was wrong about it and started pushing back.
  It's been a while so I don't remember any details. I don't go on Twitter/X as much as I used to in those days.