RCD Program Story: Cornell University

Finding the Sweet Spot: How Cornell provides reliable, high quality services while maintaining time to grow and innovate

Nestled in the picturesque Finger Lakes region of upstate New York, Cornell University is an Ivy League and Land Grant institution renowned for its engineering and medical research. In the fiscal year 2023-2024, it reported $1.22 billion in funded research. CaRCC spoke with Rich Knepper, Director of the Center for Advanced Computing (CAC), to learn more about the services and support they offer to Cornell researchers. The following Q&A has been edited for brevity and clarity.

Can you tell us a little bit about the Center for Advanced Computing (CAC) at Cornell?

At CAC, we provide computational services to Cornell faculty. While we offer some infrastructure, our most important asset is our computational scientists, who serve as consultants on research projects. They often embed themselves in faculty research teams, co-authoring papers and, in some cases, serving as co-PIs or PIs on funded projects.

We benefit from long-term relationships and engagements with Cornell faculty. In addition to the expertise we provide, we maintain an on-premises cloud that supports both computational work and storage. We also run a sizable managed cluster program for faculty who wish to own their own systems and do their own computations.

Our consultants engage in long-term collaborations with faculty, rather than focusing on project-by-project work. They often become involved in projects that spin off into new ones, as our faculty have many ideas and seek out our computational scientists for support. While long-term engagements are common, we also work on single projects that may last anywhere from a few months to a year or more.

Are there other services offered, such as training?

Yes, we provide extensive training through our Cornell Virtual Workshop program, which has been around a long time and has been accessed by over 350,000 unique visitors. The program is asynchronous, allowing attendees to go at their own pace. It covers a range of topics, including programming, parallel computing, and how to use a variety of computing resources and manage data.

We’ve also developed virtual workshops for several projects in the national cyberinfrastructure arena, including Indiana University’s Jetstream2 and Texas Advanced Computing Center’s Frontera. We’ll be collaborating with TACC on training for the NSF Leadership-Class Computing Facility.

Additionally, we provide training at our Ithaca campus and at our medical campus in Manhattan, We’re currently running the third year of our Scientific Computing Training Series that’s available to everyone in the Cornell system and some partner hospitals. Twenty-three of these webinars are posted on our YouTube channel for public access.

Our systems team also provides infrastructure and cloud computing consulting to the Scientific Computing Unit at our medical campus and collaborates with the Epigenomics Core here in Ithaca to support their infrastructure.

Are there things you’ve learned from offering virtual workshops and any suggestions you would give to others who are thinking about offering them?

Creating and maintaining virtual workshops is a significant effort. The tech landscape evolves quickly, so it’s crucial to regularly review and update content. It’s important to stay invested in the ongoing review process and be intentional about creating new content, focusing on topics essential to your partners and audience. We really try to dial these in to meet the needs of our constituency.

We’ve learned that breaking up training into smaller modules works better than traditional, longer courses. People prefer to select the most relevant material for them at any given time and access it quickly.

We are also piloting an NSF CyberTraining project called HPC-ED. It’s a federated system for sharing and showcasing training materials among institutions. The idea is not only to share your materials with others but also to discover and incorporate materials from other institutions into your own local portal. This way, you don’t need to develop all your training materials yourself, and you can access vetted content without the headaches and overhead of maintaining it.

How many people work for CAC and are all those full-time employees?

Overall, we have 18 people at CAC. We have one administrative coordinator, split with another research center. While we don’t have a huge amount of administrative coordination needs, it’s helpful to have that capability when necessary. Everybody else is full time.

Are there student employees or interns?

What we usually find at Cornell is that the summer internship program isn’t competitive because many undergraduates secure internships at companies like Meta, Amazon, and in the finance sector. These high-powered, high-paying internships are hard to compete with. However, we do have some students interested in working with us on specific topics during the year, and we’ve had some great successes. Although we don’t have a formal internship program, we try to make it work when students come to us with a focus that aligns with our center. For example, we’ve worked with students on projects related to energy efficiency, data center management, and cloud computing. We’ve also hosted NSF Research Experiences for Undergraduates (REU) participants.

Who are your clients and constituents?

We work well with a broad range of departments, but it often depends on the service. For consulting, it’s all over with clients from veterinary science, engineering, earth and atmospheric science, bioinformatics, and more.

For our managed cluster services, those folks tend to be in engineering and sometimes astronomy and physics. They’re traditional people who want to have their own clusters and are well versed in managing them and getting their labs up and running.

Our cloud services are used by a wide range of researchers, from the social sciences and business school to groups at Weill Cornell Medicine in Manhattan and Qatar.

What kinds of partnerships does CAC have within Cornell and beyond?

Within the institution we partner quite well with Central IT. For example, they have a cloud team that handles provisioning, identity management and security. And we want to leverage that, as well as the billing piece. We don’t want to do those things. What we want to do is facilitate the faculty and be efficient and responsive to their needs. So, we provide a layer on top of that, and we communicate back and forth with the cloud team and Central IT. We also collaborate with the Bioinformatic Institute’s BioHPC Group that provides specific life science computing services such as memory systems. They run a lot of these systems using our managed cluster service: we manage the platform, and they manage the applications.

Nationally, we seek partnerships that extend our competencies and empower Cornell faculty. For example, we currently have an NSF CSSI collaboration with NCAR, working to containerize the Weather Research & Forecasting Model (WRF) software and related MET validation and visualization tools. We’re working on pieces to make WRF more portable and scalable, and more effective for teaching and demonstration. We also try to integrate with folks in the ACCESS project wherever we can. We were involved in the XSEDE project for quite a few years, and the ACCESS community is very familiar with us and our capabilities, although we are not a funded partner.

Other examples of partnerships we provide services to include the Institute for Research and Innovation in Software for High Energy Physics (IRIS-HEP), the North American Nanohertz Observatory for Gravitational Waves (NANOGrav), the Integrative Research in Biology program (IntBIO). and the CCAT Observatory, which is preparing a new submillimeter telescope that will be located in Chile’s Atacama Desert at 18,400 feet.

CAC is a mature program. How did it get started, and how has it evolved?

CAC was originally the Cornell Theory Center and was one of the first supercomputing centers along with NCSA, Pittsburgh, and SDSC. After losing funding for our national center, we found other ways to become sustainable, including industry partnerships with Dell, Microsoft, Intel, and MathWorks. At the time Dell was primarily a PC company who wanted to get into the server business. We convinced them what better way to promote their servers than to build a supercomputer with their technology. As a result, we deployed the first Dell HPC cluster to make the TOP500 list. We also operated a Financial Solutions Center across the street from the New York Stock Exchange with IBM, SGI, and software partners.

In 2007, Dave Lifka, the previous center director, introduced a cost recovery model that has helped us stay sustainable. We continuously work to provide value that faculty can see. Transitioning to a cost recovery model isn’t easy. People really don’t like going from a fully subsidized model to a pay-your-own-way model. However, we’ve focused on providing the right resources to researchers and then getting out of their way—creating an environment with the security and safeguards they need, while offering the expertise to help them make the most of those resources without cost overages. I think that’s been a key to CAC’s long-term success.

We’ve fine-tuned our approach to track and maintain the proper balance between Cornell-funded internal projects and externally funded ones. We’ve also adjusted our fees for our on-prem cloud, ensuring they remain highly competitive with public cloud options. I think we’ve found a solid niche as far as that service goes.

Tell me more about finding that niche that’s the sweet spot for CAC versus other providers.

The on-prem cloud at CAC is designed to be an infrastructure where you can get a significant amount of work done. However, we’re not a hyperscaler, and we’re never going to be able to compete on that. The way that we describe this is if you need extreme scale, there are other places that do that in a much better way. If you need access to exotic accelerators or specific services that ride on top of that, then it’s probably a good idea to interface with the providers directly. And you get the flexibility of not having to buy whatever new thing is coming out but still are able to try it out.

Where we do really good work, I think, is in meeting the needs of people who have a medium to significant core count requirements and need solid VMs with no oversubscription of resources and a decent amount of memory per core. We have some GPU resources and if they are going to scale, they can scale with us, or they can begin their scaling with us very cheaply compared to experimenting in the public cloud.

So, if you need to try things out on Red Cloud, it’s very economical to do that and to figure out what the constraints are and what kind of problems you have. You don’t run into nearly as many issues with overages or unexpected things. And we have some staff and flexibility to work with researchers, as opposed to the public providers.

We’re also working to provide advanced capabilities at Cornell and in New York in general. I serve on Empire AI’s Technical Advisory Committee. Empire AI is a recently announced $400 million New York State and university partnership. It’s an emerging computing option for Cornell researchers who have problems that require very large-scale GPUs.

You’ve mentioned cost recovery but are there other important sources for funding with CAC?

We receive a subsidy from the Vice Provost of Research & Innovation that allows us to lower our rates for Cornell faculty, while applying higher rates for external projects. Additionally, we have a Partner Program that’s been in place for a number of years. It’s a specialized model where corporations, universities, and public agencies can purchase access to a bundle of goodies. These might include time with computational scientists or cloud resources, depending on what the partner needs and what we can offer. While the scale of this program is much smaller compared to our internal recovery model, it’s still an important source of funding as is our eCornell courses for professionals. These courses, which cover topics like Python, data management, and data visualization, generate substantial royalties for our center.

How do you communicate CAC’s impact? Are there metrics you track? Are there qualitative success stories you track?

We track both quantitative and qualitative metrics. We have a new faculty flyer available on our website that outlines what we offer. We also generate a one-page annual overview that includes the number of projects and users we served, the number of core hours and amount of storage we delivered, the number of nodes we manage in our private cluster service, and things like that. We also use case studies to highlight specific projects.

One faculty experience I like to highlight demonstrates multiple areas of success. A new faculty member joined Cornell, already having a workflow set up at a traditional HPC center and was very happy with the dedicated HPC model. However, this researcher had a bursty usage pattern, requiring a lot of cores but not consistently. After several discussions, and some resistance, we proposed an elastic cluster on top of our cloud, allowing him and his students to submit SLURM jobs without the need for a dedicated resource. This model, which only bills for active machine usage, turned out to be far more cost-effective and efficient than acquiring and managing dedicated hardware. It has been a great success and a prime example of the importance of asking “What are the real requirements here and what does it take to support them.”

Are there other outreach or communication activities CAC does within the institution or beyond?

Within the institution, I attend new faculty orientations and write an annual letter to deans updating them on our offerings. We also have a pretty extensive mailing list where if there’s opportunities that come up, we push them out through direct email campaigns. And we also leverage the university’s communications channels, though we’re one of many bullet points. On a national scale, we maintain a modest booth at the Supercomputing conference. This serves as an opportunity to meet with partners or potential collaborators and maintain relationships.

How does work get done and organized within CAC? Do you have team meetings? Do you have ticketing systems?

We’re organized into two teams: a systems team and a consulting team. The consulting director holds regular team meetings and check-ins with staff. They also conduct one-on-ones to make sure that projects are moving along and they run a bidding process for new projects. If a faculty member has a project that’s going to take a certain amount of time, the director asks who has time in their schedule and how they will fit it in. Consultants also have regular project meetings with the researchers they’re working with.

On the systems side, there are team meetings and individual check-ins, with coordination around maintenance, outages, data center changes, and other issues. We also hold a monthly all-staff meeting to share updates I have from Cornell leadership and to discuss internal matters. Additionally, we have a monthly proposal meeting to review externally funded project opportunities and potential partnerships.

For service delivery, we use a ticketing system called Request Tracker. It’s an internal system but we’re likely to change to a Cornell-wide system, to offload that service management load. We also have a home-grown project management system that allows PIs to easily start projects, add resources, and manage project membership and access. This system also tracks our service usage and handles accounting. Its recently been revamped to be easier to use and more responsive to faculty needs.

What are some of your near and medium term priorities? And how do you choose between them?

Storage continues to be a priority, but we’re also placing a strong focus on security and compliance, particularly for projects that require higher levels of security. We’re working closely with Cornell’s Chief Information Security Officer and Central IT teams to try and figure out the best path forward. We really need to offer something that’s at the very least a kind of federated solution, so we know that everyone’s on the same page.

In the slightly longer term, there’s the question of retention and reproducibility, which is sort-of the flip side of the security issue of how to meet the data-sharing requirements. How do we ensure that only necessary data is shared, without exposing everything? And how long do we need to retain data to avoid reproducibility claims and similar concerns?

How does governance work for CAC? Who do you report to and is there an advisory board or stakeholder you meet with regularly?

We report to the Vice Provost for Research & Innovation. We don’t currently have an advisory board and haven’t had one for some time. The thinking behind this was if a service isn’t getting purchased, you de-invest from that service and go back to the drawing board. However, relying solely on negative feedback isn’t always the best way to steer through the currents. So, we’ve been exploring a different model: an “experience lab.” This would involve talking to users and non-users who are meeting their needs outside of our services to figure out what they’re looking for. From there, we’d experiment with solutions, where people are invested in the outcomes. What do they truly need to support their research, and what level of support will be sustainable for both the university and faculty members?

We want to better understand how to address these needs through a more structured approach, rather than relying solely on an advisory board where members often focus on their individual interests. Identifying and prioritizing the actual needs is something we want to pursue. We’re still in the early stages and haven’t had the opportunity to fully implement this approach yet, but that’s the direction we’re heading.

Was that focus group idea inspired by some other institute or the VPRI’s experience elsewhere?

That came from the VPRI, who had tried this model at another institution and was interested to see if it would work here. I’m really excited about this approach because, rather than just holding a stream of meetings with an advisory board, we can dig deeper into the real needs of our users. It’s not about steering the direction, but about researching how infrastructure and consulting services meet or don’t meet the needs of our faculty. That’s much more valuable.

Is there anything we didn’t cover that you think an emerging research computing leader might benefit from knowing about CAC?

I think one of the challenges many organizations face, but especially in one like ours where hours count, is finding the slack to take on new projects. What does the ramp-up look like when someone approaches you with a request outside your organization’s current competencies? Sending people away from your center is not a good way to stay in business. You need to figure out how to build competencies as you go, without putting all the costs on the user. What we’ve tried to do is find ways to ensure we don’t have to turn people away when they come to us with a project. I think this has been a key factor in our success.

Another challenge is finding the time and focus to pull in external projects and funding, especially when grant proposal work isn’t compensated. We’ve worked to figure out how to support that effort. Over time, we’ve built a reserve of funds we can tap into for these purposes.

We don’t reserve every single hour of a consultant’s time for cost-recovery work. There’s dedicated time for learning and proposal prep because there needs to be some give and take in those areas.

Are there any other success stories you’d like to share?

Early on when I came to the center, we had a faculty member who was retiring and moving to become a chair at another university. She had built up a fairly large infrastructure that others in her department were using. To the center’s credit, she said, “Well, I’m retiring, you should turn this over to CAC.” And it was a lot of stuff. There was equipment of varying ages and provenance, and there was a building project that would render the current space unusable.

So, our team was able to inventory all the equipment, figure out how to move it to our data center, get utility out of it, and continue to make use of it for a long time.

Since then, we’ve actually had people join the initiative. While we don’t have a broad condo model for computing, this is more like a mini condo within chemical and biological engineering, where people are pooling their resources. I think it’s been worthwhile for everyone, even though it has taken some adjustment. It has required effort to get used to it, but it’s really become a solid resource for them. Now, it’s starting to grow, with an additional set of around 75 nodes being added to the existing jointly owned resource.

Do you have an elevator pitch you give for your center?

CAC provides computational resources, but our greatest asset is our people and the expertise they provide. We’ve built an infrastructure that lets faculty use these resources seamlessly, without unnecessary obstacles.