I started my PhD in NLP a year or so before the advent of Transformers, and finished it just as ChatGPT was unveiled (literally defended a week before). Halfway through, I felt the sudden acceleration of NLP, where there was so much everywhere all at once. Before, knowing one’s domain, and the state-of-the-art GCN, CNN or Bert architectures, was enough.
Since, I’ve been working in a semi-related area (computer assisted humanities) as a data engineer/software developer/ML engineer (it’s a small team so many hats). Not much in terms of latest news, so I tried recently to get up to speed with the recent developments.
But there are so many ! Everywhere. Even just in NLP, not considering all the other fields such as reinforcement learning, computer vision, all the fundamentals of ML etc. It is damn near impossible to gather an in-depth understanding of a model as they are so complex, and numerous. All of them are built on top of other ones, so you also need to read up on those to understand anything. I follow some people on LinkedIn who just give new names every week or so. Going to look for papers in top conferences is also daunting as there is no guarantee that a paper with an award will translate to an actual system, while companies churn out new architectures without the research paper/methodology being made public. It’s overwhelming.
So I guess my question is two fold : how does one get up to speed after a year of not being too much in the field ? And how does one keep up after that ?
The Last Week in AI podcast is great for keeping up to date on information and you can do a deep dive on whatever catches your interest.
I don’t - I rather focus on drilling every nook and cranny of the attention mechanism so that reading any of these papers becomes easier.
I’d say if you truly understand Transformers both theoretically and intuitively, you’re already in the top 10% of MLEs. Though I’d imagine most PhDs understand it.
“truly understand Transformers theoretically”!? Could you please share references which explain the theory around transformers.
No one does. Do you really think that engineers/researchers at OpenAI, Google Brain/DeepMind, MS Research, Meta Research etc are up to date in all topics?
We’re not. We just focus on our current field of expertise/daily job for the most part. Professors at university usually have a wider (but not deeper) view, but only top ones.
Then it’s kind of sad, because a lot of discoveries have been made by looking at what other disciplines were doing and cross-pollinating (genetic algorithms, attention, etc.). Plus then how does one know of they want to branch to another domain? But you’re right there is too much…
Idk, depends on your standards of what it means to keep up. I skim, pick out things that seem relevant/useful to whatever I focus on right now, and put more time in that paper/blog/whatever. I think everybody does the same.
Yes, it seems from all the answers that I just try to go too deep. Unfortunately it feels like nowadays it’s just tweaking and trying architectures, but there is no “red line” or big mechanism to know about, like there was kernels or attention.
There’s an online group called Transferred Learnings that holds monthly sessions on the latest developments. They are private though, and vet everyone to make sure you’re actually working in ML.
Thanks ! When I get back (soon) in a full-time ML position I’ll be sure to check it out.
I believe you can join as long as you’ve been working in ML historically.
They just want to avoid non technical folks more generally.
It’s basically impossible to be completely caught up. So don’t feel bad. I am not really sure it’s all that useful either, you should know of technologies / techniques / architectures and what they are used for. You don’t need to know the details of how they work or how to implement them from scratch. Just being aware means you know what to research when the appropriate problem comes your way.
Also a lot of the newest stuff is just hype and won’t stick. If you’ve been in ML research since 2017 (when transformers came out) you should know that. How many different CNN architectures came out between Resnet in 2016 (or 15?) and now? and still most people simply use Resnet.
My tactic is to start by checking the papers that actually GET IN to major conferences (Neurips, ICLR, ICML are a good start). This narrows the search considerably. Doing a Google scholar search, for example, will just yield an insurmountable number of papers. This is, in part, due to the standard “make public before it is accepted” methodology (arXiV preprints are fantastic but they also increase the noise level dramatically).
Now, having been burnt by the chaos of the review processes of the aforementioned conferences, I am certainly aware that their publications are by no means the “Gold standard” but the notion of peer review, including the intended outcomes of improvement therethrough, is powerful nonetheless.
That helps narrow it down. Though, many discoveries are not published anymore. Reminds me of Mikolov, who was rejected pretty much everywhere and word vectors ended up being such a big deal. Or that OpenAI does not publish their models.
Honestly, I mostly just follow hugging face’s blog and articles. I know there are some latest fancy attention improvements, alternatives for RLHF, GPU whatever optimization, etc, but I’m not going to implement those myself. If it’s not in hugging face’s ecosystem, then I most likely wouldn’t use it in my daily work/production code anyway.
Hugging face is for sure a godsend, even though I’m still at a semi-loss with their API. It changed so much, and there is so much more now that it has become a little confusing. Nothing a little work can’t fix ! But that raises the question to me : how do these people manage to get out every model so fast ?
Yeah, reading all their latest releases is already taking me a lot of time so I just mostly stop there. They also don’t have a lot of documentations for their latest stuffs, so it takes a bit time to figure things out. I think their packages will settle down to a more stable state after a year or two, after the NLP hype cooldowns a bit.
You don’t. The process is broken, but nobody cares anymore.
- Big names and labs want to maintain the status quo = churning paper out (and fighting on Twitter…erm X, of course).
- If you’re a Ph.D. student, you just want to get the hell out of there and hopefully try to ride a bit the wave and make some = trying to along and churn some papers out.
- If you’re a researcher in a lab, you don’t really care as long as you try something that works and, eventually you have to prove in the yearly/bi/x review that you actually did some work = churn whatever paper out there.
Now, if by any chance, any absolutely crazy reason, you’re someone who’s actually curious about understanding the foundations of ML, deeply reason about why “ReLU” behaves like so over “ELU”, or, I don’t know, you question why some models with 90 billion parameters behave almost the same as a model that was compressed by a factor of 2000x and only lose 0.5% of accuracy, in brief, the science behind it all, then you’re absolutely doomed.
In ML…(DL, since you mention NLP), the name of the game is improving some “metric” with an aesthetically appealing name, but not so strong underlying development (fairness, perplexity). All, of course using 8 GPU’s, 90B parameters and zero replications of your experiment. Ok, let’s be fair, there are some papers indeed that replicate their experiments in a total of…10…times. "The boxplot shows our median is higher, I won’t comment on the variance of of it, we will leave it for future work. "
So, yes…that’s the current state of affairs right there.
Hold on, why is it useless to understand why a model which is 2000x smaller has only a 0.5% reduction in accuracy? Isn’t that insanely valuable?
It is absolutely valuable. But the mainstream is more interested in beating the next metric, rather than investigating why such phenomena happens. But being fair there are quite of researchers trying to do that. I’ve read a few papers in such direction.
But the thing is, in order experiment with it you need 40 GPUs and the people with 40 GPUs available are more worried about other things. That was the whole gist of my rant…
You managed to put into words what bugs me with the field nowadays. What kills me most is that third paragraph you said : no-one cares what the model does IRL but how it improves a metric on a benchmark task and dataset. When the measure becomes the objective, you’re not doing proper science anymore.
The doomed student is me :’(
Try having done your PhD before SVMs were well known… yeah, the struggle is real…
Know some general Ideas like Attention, Diffusion, Vector DB, Backprop, Dropout, Unet etc and where they work best on. Additionally know SOTA models for general use cases. When a new use case arises you should know where to dig and how to code. Have strong understanding on all these new concepts and feel free code them yourself. Most papers are just combinations of these general Ideas. If you are on a project, only then you should read in depth papers on that use case.
I spend an hour each morning scanning the preprints on arxiv.org, scanning a half dozen or so and selecting perhaps one to save for weekly symposium( I like to have at least one really good paper a week to share).
It’s usually easy to tell if a paper is a follow-up or response to another and if that’s the case,.I might skim those too. These get supplemented with what pops up here and HN which might extend back a few months (higher signal, less noise).
This is enough to feel like I have my finger on the pulse of one topic within ML.
Wow, that is a lot of work. It’s awesome that you manage to have the latest and the pulse of AI as you said. That is the kind of discipline I cannot follow. Just one hour at work in the morning would destroy the rest of my day ^^
The main trick is learning to filter out the bs “attention aware physics informed multimodal graph centric two-stage transformer attention LLM with clip-aware positional embeddings for text-to-image-to-audio-to-image-again finetuned representation learning for dog vs cat recognition and also blockchain” papers with no code.
That still leaves you with quite a few good papers, so you need to focus down into your specific research area. There’s no way you can keep caught up in all of ML.
Yeah, those bs ones pop up everywhere. If only there was some model to sort between those and the good ones… And I’m kind of giving up on being caught up, seeing g all the answers.
Read noam shazeers work, you have now caught up
This website helps me: