RCD Program Story: Oklahoma University

The power of human connection: How talking with a friend during a time of grief and loss led to the creation of the enviable OU Supercomputing Center

With a flagship campus in Norman, a flourishing suburb of Oklahoma City, The University of Oklahoma (OU) might not be your first thought when it comes to academic research. But at the time of our interview, OU is experiencing impressive growth, reporting a 33% increase in research funding in fiscal year 2023. CaRCC had the pleasure of speaking with with Henry Neeman, director of the University of Oklahoma (OU) Supercomputing Center for Education and Research (OSCER) to learn how he got his start in the field and how his program is working to serve the needs of OU’s growing number of researchers. 

The following Q&A has been lightly edited for brevity and clarity.

How did you get involved with OSCER?

Way back in the summer of 2000, my brother passed away, and it was exactly as awful as you would expect. As part of how I dealt with that, I talked on the phone a lot with friends. One of them was also a colleague and the conversation wandered, as will happen. I started ranting about how there were people at OU who were using HPC, but they never cooperated together. And I thought if they did we could get a little 32-node cluster on campus or something. 

He thought I was proposing something, when in fact I was just ranting. So he went to his boss and said, “Henry has this great idea.” And his boss said, “That is a great idea. Tell Henry to do that.” So we started a little brown bag group, just getting together once a month, having lunch, talking about that stuff – again, with the hope that we would write a little MRI or something. In fact, I think back then it wasn’t even that concrete. 

How did the program grow to where it is now? 

Well, around that time, we also got our first CIO. And this was during the dot com boom, so there was money, and he wanted to make a splash. He got wind of us and we had already talked with the VP for research. So we had a sort-of pitch meeting with him – except I wasn’t planning to go. It was going to be just the big PIs who were going to pitch it, and I was just a little peon. But, I ran into my then boss, who was running one of the big research centers, in the elevator the Friday before the meeting, and he said, “You’re coming, right?”He told me to come and bring a few slides and be ready to  talk about it. And somehow I ended up being in charge of the meeting. Everybody expected me to be in charge because I had run the brown bag group. And by the end of the meeting, everybody suddenly thought that I should be in charge of this thing, which I did not think was the case!

There was another meeting later, and at that meeting the CIO said, “I’m gonna have somebody on campus in the next two weeks to be in charge of this.” And we all thought he was blowing smoke, because back in 2001, you weren’t going to have somebody in two weeks. And when people thought about supercomputing, they didn’t say, “Oklahoma!” 

And then the next day, he called me into his office and suddenly I was in charge of a supercomputing center that didn’t exist. So we made it exist. He managed to get us some budget for machines, I want to say low seven figures – but way more than I was expecting. It just snowballed from there. We went from 44 users in the first year to now around 1, 500 users and about 500 or 600 run jobs every year. We’ve been growing now for 22 years. 

Interestingly, at the beginning, only two people expressed skepticism about this. One was one of the deans, who had himself run an HPC center in the 90s, and apparently it didn’t go well. So he was skeptical. And the other was a faculty member who said, “I’m worried that five years from now it won’t exist.” Well, four times longer than that has passed and we’re still going strong and he has remained a very loyal user all that time. 

So, how did things grow from there? 

You know, it’s the nature of this kind of job that there’s no promotion path. If you’re the director of your institution’s HPC center, you can either go into some kind of national leadership that is unrelated to your institution, or you can go to a bigger institution and be HPC center Director there, or you can try and become VPR or CIO.

Well, I I very much appreciate the people who do enterprise stuff, but that’s not interesting to me personally. The CIO position  is not of interest to me at all. I don’t want my boss’s job ever. I’m very grateful to him for doing it! On the VPR part, I’ve never been a proper faculty member, so I’d have no credibility. So it’s pointless to even think about putting me in a position like that. I also don’t want it; that doesn’t sound like fun to me. 

So the way to a promotion, for me, was to make the impact footprint bigger and have more people doing more stuff that’s of more value to the institution. And so that’s the way that we’ve approached it. I continually consider how we make our impact footprint bigger.

What sorts of impacts does OSCER have today?

We do an annual symposium, and I give a State of the Center address at the beginning of each symposium, and last year we just passed the billion dollar mark of external research funding facilitated by our center. That’s a combination of OU funds and collaborator funds, not just OU; OU is about half a billion. But that billion dollar mark was very compelling. And it’s impressive for an institution like ours where it’s been less than 10 years that our annual research expenditure has been over 100 million for the Norman campus. Overall, we’re in the 350 to 400 million range, but in terms of the Norman campus, which is where the bulk of our usage is, we’ve been growing quickly in recent years. And I think it’s about 4, 500 publications that we’ve facilitated. So that’s pretty good. Overall, we’ve done a reasonable job. 

I think it’s a pretty good measure of the impact that we’ve had, kind-of a core measure that’s easy to remember in that it represents about 20 to 22-ish percent of the external research expenditure of our institution during that period. And that means of course, it’s ramped up because at the beginning, our contribution was very minuscule, as is what would be expected. So now it must be higher than that, but I don’t know what the actual numbers are year by year, because I think that’s effectively impossible to measure, particularly because of the way we identify which funds we should be credited with facilitating. 

We have a button that people press when they’re filling out the paperwork for a proposal, and we also beg people for information every year. I pester the faculty every September and say “Please, please, please, please tell us grants and publications that we’ve facilitated!” And every year a subset of them respond. 

Then if you get down to individual areas, right, we’ve been supporting, OU is a big meteorology school, depending on whether you ask us or Penn State, we’re either number one or number two and, so that’s a big part of what we support. We also support a lot of chemistry and chemical engineering, molecular dynamics and a lot of high energy physics because we’re part of the Southwest Tier 2, ATLAS collaboration. Lately, of course we have a big growth in machine learning.

So we’re buying more and more GPU cards, which before we had very low uptake when it was traditional floating point. Now that it’s TensorFlow, there’s really an unlimited demand. So we’re just trying to figure out how to apportion our budget on that stuff. We try to be neutral in terms of what we focus on. The history of every discipline has a kind-of sine curve of up and down on funding, and we don’t want to be in the situation that when Discipline A loses funding, we don’t have an established relationship with Discipline B that’s in an upswing. So, we try to serve everybody really well. 

What are the types of clients you serve?

The clients within these sorts of research areas are typically a few different populations. There are those just getting started over to the mature researcher. It’s been a while since I’ve looked at this, but the last time I looked at this between five and seven years ago, we were seeing about

⅔ rd new users every three years. So the turnover was really high. When you say mature users, honestly, that devolves down to faculty who actually run jobs and a small number of research scientists, because most faculty are having their students and postdocs run the jobs.So we have primarily grad students and postdocs. The grad students are far more master students than PhD students, but master students turn over very rapidly. Postdocs turn over pretty rapidly. Two to three years is pretty standard, right? so it’s not surprising. I don’t think anything about us is unusual in that sense.

Do you provide training for these clients? 

We do a lot of research training. We facilitate, computing part of it which for most of them, they’re running canned codes, most commonly free open source, community codes, and then occasionally commercial stuff, and so the code is a black box. We do help them deploy the code in many cases, so we have a very large software repository, but if they want to deploy it themselves in their home directory, of course, they’re welcome to do that. That’s most commonly seen these days with people doing Python. 

What other services and support does OSCER offer its researchers?

We have five major systems. We have a traditional cluster supercomputer and a small internal research cloud that clients can buy into. Then on the storage side, we have two systems. We have a large-scale persistent disk system that they can buy into and a large- scale tape archive that they can buy tape cartridges for. And I should mention that on the supercomputer, we do both a free part and a condominium part where they can buy servers.

We also do a fairly extreme version of condominium in the sense that if they buy it, they get to decide who runs jobs on it and when; we don’t decide that for them. We maintain root permission, but they can run batch jobs in their own queues for whatever duration they want.It’s unlimited, where unlimited is defined as until the next outage. 

And, then the fifth system is we have a science DMZ that all of those are hooked into. Of course, there’s no charge for the science DMV. That’s funded entirely by the CIO and grant funds. 

On the people side we have software deployment I already mentioned. That’s a huge thing because most of our researchers are not sufficiently experienced with Linux to be able to deploy poorly designed code with a very quirky deployment mechanism. So we run EasyBuild. And we will also help them if they want to deploy it in their home directory. Of course, we get a lot of uptake on Conda, so we help them with that. Or they can always deploy their own and we typically don’t have visibility into that, which is fine. 

Then we do a significant amount of facilitation. We have effectively 1. 7 FTE of facilitators who are very knowledgeable and they spend a lot of time helping researchers be productive. Each one of them has their own individual domain expertise.

One of them is a geophysicist and the other is a high energy physicist. But they don’t develop a high degree of expertise in a discipline. Instead, they develop a high degree of expertise in how to help researchers across many disciplines. And so OU founded and co leads the virtual residency program to train research computing facilitators. So, we have a pretty good feel for how to get the facilitators to help the researchers be productive. 

And then we do some training workshops. In fact I encouraged the facilitators to put together a 2024 training calendar with a couple of different levels – an introductory level with basic Linux commands and basic Slurm stuff and then in Intermediate level, which was effectively how to run Python jobs, including for machine learning and including interactive stuff like Jupyter. We also see some users who want to use VS Code. We don’t discourage that for their Python stuff, but we also don’t encourage them because it’s super easy to break the login nodes running VS code. So we’d rather they do Jupyter.

In total, how many people are involved at OSCER Developments?

In addition to the 1.7 FTE facilitators, which is two distinct human beings, there’s me, and my percentage varies, although my labor commitment never does.Sometimes I’ve got portions of my time being paid by grants, sometimes I don’t, because I spent a little over six years as joint co manager of the XSEDE Campus Engagement Program, which includes the Campus Champions. Dana Brunson and I did that together and that’s over, and I’m not directly involved with the ACCESS project, although I’m a booster, of course.

Then we have five full time system administrators, though I should say that between a third and a half of their time is spent on researcher-facing tasks. That includes software deployment. And just this year for the first time, we now have three student employees. We have a grad assistant, we have an undergrad specialist, and we have an undergrad intern. Those folks are working out great. We’re really pleased with how that’s going.

What drove you to add student employees and interns? 

We’ve always been desperate for more help. But we never even advertised the position and nobody told us it had been posted, but we got an unbelievable number of candidates with no advertising, which was shocking. And it was in the middle of the summer, I might add. So the whole thing just blew my mind. And we got three great people who came in with some system administration experience, which we were not expecting to get. We were expecting to have to start them from the ground up.

Maybe the grad assistant might come in with some. Because most PhD-granting institutions have zero courses on system administration, especially Linux system administration. and vanishingly few have something like a degree or certificate program in that at my institution, the closest we have is some operating system courses, which are not system administration. So we were just blown away by the number and quality of candidates that we interviewed. It was amazing.

How, how does OSCER partner with other teams regionally, nationally, and within the institution?

Okay, so, that’s always complicated and messy. We’ve never partnered in a formal sense with XSEDE. It’s more that we have a limited number of researchers who do plenty of consumption of XSEDE access. Some of our researchers, mostly a subset of the weather folks, are huge users of the national resources. But a large number of OU users have no interest in the national resources and have never used them and probably never will. 

If someone comes to us and says, we want to learn how to use the national resources, that’s straightforward to do. So we don’t, we don’t have a lot of worry around that. 

Regionally, we’re part of the Great Plains Network. That’s something I’m more involved in than the rest of the team because it’s more at a strategic level than an operational level. Within the state, we have something called the OneOklahoma Cyberinfrastructure Initiative. I think it’s soon growing to five or six institutions that have resources that we make available not only to our own researchers, but to any non-commercial research or education project in the state. That’s been very successful and the cost of doing that is very low. 

Are there other OU teams doing research computing or research software that you interact with?

We have two research software groups on campus that are part of the development community for national models for national community codes. We have a group that’s involved with WRF and we have a group that’s involved with a coastal simulation code called ADCIRC. That’s a collaboration among OU, UNC, and Notre Dame. In terms of national community code we have a small contingent, maybe 5 to 10 percent of our users do homebrew. The big change is that there’s now a middle ground between full on homebrew and downloading somebody else’s code, which is Python wrappers around things like TensorFlow and PyTorch– so we’re seeing more and more of that.

We are also pretty closely allied with the research data folks in our libraries and in fact, one of them attends the One Oklahoma Cyberinfrastructure Initiative calls. There’s a data science and statistics interest group that’s led out of our Health Sciences campus in Oklahoma City. We have a good relationship with them and occasionally present to them. We’re also hooked in with governing bodies like the IT council at the Norman campus. 

And then, because we’re part of central IT, then we’re very closely aligned with our enterprise sister teams. We work very closely with them, and in fact, several of those teams attend our weekly meeting so that they know what’s going on with the things that we’re working on. That includes the network team, the operations team, and the security team is sort of in and out as needed. 

One of the problems that we saw early on was we were seen as a bolt-on instead of a core part of central IT. Again, this is 20 years ago, not now. But when there was a budget crunch during the dot com collapse, it was floated, “Well, why don’t we just dump that? It will save us a fortune.” And happily, that did not prevail, but I’ve developed a healthy paranoia about that and I’m  always proving that we’re worth the money. So that’s part of my approach of developing strong relationships with the other teams. We want to make sure that they understand the benefit of what we’re doing – not only to the institution, but also to the central IT organization.

It seems like one of the central tensions for your kind of group is straddling both worlds of the VPR and the CIO. Inevitably you end up having to explain a lot to the people you report to, and you also have to go out of your way to maintain relationships or exchange information with the other group. 

Yeah, I think that’s absolutely true, and the data that I’ve collected, which is not by any means complete data, suggest that for about two thirds of institutions that have on-campus research computing it’s under the CIO. About a quarter are under the VP for Research, and whatever’s left is a random jumble of everything else. And most CIOs have very limited exposure to research computing so it’s important to help them understand how different research computing is from enterprise and why. I’ve developed a slide deck to show why it is that way and how that’s beneficial, including to the enterprise side of central IT. 

When there’s someone new, it’s part of my job to help them understand it, and I take that part of my job very seriously. We’re an academic institution so teaching is part of our job. Professional development is part of our job. An academic enterprise IT leader does need to understand the basic nature of research computing at their institution at least if they’re at a research intensive institution It’s fascinating to first watch the chills running up their, down their spine and the blood draining out of them when I explain the concept of two nines 99% uptime, and why that’s actually normal. And then when I explain you can see the light bulb go on behind their eyes and they get it. And that’s when I know I’ve done my job.

Describe how your budget and funding are structured. 

On the equipment side, we have some internal core funding through OU’s regular budget. as part of IT’s regular budget. that has stayed pretty steady since the beginning meaning it has not grown with inflation. On the people side it’s grown. Obviously, we started as two people, me and the sysadmin, and we’re now the team I just described. Obviously, that’s grown substantially over that time, but the hardware software budget is, it’s up a little bit, but only a little bit. And that’s fine; we make it work. 

It used to be that about every three to four years, we would do a forklift upgrade for a New Cluster but the university doesn’t do leasing of large equipment anymore so now we do rolling upgrades where we’re making purchases every year. Obviously, that changes the labor picture quite a lot. It gives us a lot more flexibility. So if we make bad decisions at the beginning, that’s okay. We can recover. We just wait till the next fiscal year. But the flip side, of course, is we’re always in deployment mode. We never get out of it, which was not true under the old regime.

So, other other than occasional external funding for people and occasional external funding for equipment, the bulk of our people and equipment are funded internally, and their positions are either 100 percent hard money or backed by 100 percent hard money. I can buy out any portion of my time with soft money external funds, whenever I want and and nobody has a complaint about that. But I know that if the grant money runs out I’m still 100 percent covered. I don’t lose my job because the grant ends. 

How do you manage your team and keep them organized and moving forward?

We have two all-hands meetings a week on Tuesday and Friday mornings. And then, you know, it’s constant email. They’ve got a Slack channel. I’m rarely on the Slack channel. I’m not really a Slack person, but communication is constant. We had a crisis yesterday. the little baby Ceph system that our OpenStack uses to run all the administrative stuff and our internal research cloud crashed, and it turned out to be a bigger problem than just rebooting. So, they were up until 2:40 in the morning. We finally released it back to everybody. It was almost 24 hours of downtime because it had started at 4:50 in the morning on Sunday and ended at 2:40 in the morning on Monday. So these guys are heroes. We are not a 24/7 shop. We are a business hours shop. And, they basically ruined their weekend to get back into production. I have deep gratitude for them. 

How does the team track work that has to be done?

Everybody except the students is on salary and we don’t do any hour tracking other than PTO. But we don’t do time cards. We don’t ask people to check in. We don’t ask people to track how many hours they worked a day because they’re on salary.

There’s a ticket tracking system. That’s the way we interact with users by email. So there’s an email address that they send email to either to continue an existing ticket or to start a new ticket. We are typically managing dozens of tickets at a time. We get probably half a dozen to a dozen new tickets per business day. We used to structure it so that our systems facing folks would take shifts, where for half the day they would be responsible for whatever was coming into the ticket queue. But we’ve moved on because that was a bit too disruptive to their ability to focus on their work. We’ve moved to the facilitators being the front line on all the tickets and then they assign tickets to the systems folks, initially at random. We sort-of treat it as a machine learning problem for people.And then over time, they’ll get the sense of who to assign specific kinds of tickets to. And anytime a ticket comes in and gets assigned to someone, they are responsible for either doing it themselves, or trading it to someone else and making sure it gets done. That’s been working out pretty well. We’ve been doing that for almost a year now. and I’m pretty pleased with how that’s going.

What are your plans for the future of your program?

I can tell you’ve been thinking strategically. It depends on the specific thing. When we think about our cluster supercomputer we don’t have grandiose plans about that. It’s very continuing in the same vein, but with the newer server, CPU and GPU models. We’re just tracking stuff as it goes and adjusting on the fly based on what’s the best value proposition. We’ve become very heterogeneous in our cluster. 

On the storage, of course, you can’t play it quite that way, so we built a storage model. It’s a charging model to align with research funding models. So research is funded as a random amount of money shows up on a random date and is available for a random duration. We can’t do recurring charges, it’s just not practical. Instead, what we do is drive down the cost and then we do a purchase model where they’re buying a portion of one 24th of a 24 bay server full of disk drives. And I mentioned Ceph. We’re running Ceph as the underlying software technology because it’s free.Now, the trade off is it’s a lot more labor, but, we don’t charge the researchers for that labor. we’re just charging them the hardware costs because that makes it sustainable.The way that this has worked has been really brilliant because they make a one time upfront purchase and it’s in production. We guarantee it’ll be in production for seven years, and they can add more at any time. And the barrier to entry is under $1, 000. So we’ve seen more uptake in two years than the previous ten – both in terms of amount of capacity and more importantly, in terms of the number of research groups that are participating.So, now we’re at somewhere around 80 research groups. On the previous system, it was 12.

Are there other things your research clients are looking for in the coming years?

Well, yes, but we sort of anticipated the big one was archival storage. Archival storage has to last for a long time. Now especially with data sharing requirements from the NIH and soon to be all the others having multi-year post grant expectations, you’ve got to have something you can pay for during the grant period and then it’ll still be there. So we have an institution-wide, Globus license, which includes file sharing, so they can address their data sharing requirement trivially, We work with our libraries so that the institutional repository can point to what’s on the tape archive, but not actually include the data, because they don’t have capacity for and they can’t afford capacity for anything close to that.

I spend a lot of my time thinking about what will be the next tape archive. In the early years of a tape archive, it’s easy to get people in but in the later years, you’re saying, “Well, our plan is to get another NSF grant for the next tape archive, and then we’ll transition your data over. And we’re hoping to make that free.” It’s easy to say that, but you can’t promise it, and so there’s some risk on their part to buying into the archive in the later years. Now, we have delivered in the transition from our first tape archive to our current one, but it is risky for them. So, the more I can think through the plan, the better it is for everybody.

And the other problem is long term planning is difficult in the sense that things evolve so rapidly. If someone tells you they know what technology five years from now will look like, they’re lying. They have no idea. Let alone ten years.. And we’ve got to make 7 to 10 year projections because otherwise the faculty won’t buy in. Otherwise they want to buy their own standalone resources and that is not a model that the institution wants, especially because of security concerns. Grad students do not have the security training that’s needed to be able to do this stuff, which is who they would put in charge of the machine. So, our job is to make the alternative server very attractive so we don’t have grad students running a server in a closet. A huge part of that is pricing. Another part is that we’ll take care of it so you don’t have to worry about it. And the third part that’s really key is that it will be available for a long time.

Is there anything we didn’t cover that you think people should absolutely know about OSCER?

There are plenty of things we haven’t talked about and all of these ultimately drive back to how we can lower the price of each thing so that we can provide as many things that’s possible with the budget we’ve got. One of the things that we’ve been trying to do is drive down the cost of storage by not making it attractive to buy SSD because on a per terabyte basis it is way more expensive. Every five years they claim that five years from now that will no longer be true. After 20 years, I’m still waiting. I no longer believe it. We try to make it as invisible to them as possible, but we want them buying the affordable stuff and we want a mechanism so that we can buy a fixed amount of expensive stuff instead of a proportional amount of expensive stuff. And I’ve got a whole screed on that. 

What’s your elevator pitch for OSCER?

Science and engineering makes the impossible possible. Supercomputing makes the impossible practical. We are in the business of making the impossible practical. Come to us with your computational needs and we’ll make it good and affordable at the same time.