Jclark-ctr (John Clark)
User

Projects

User does not belong to any projects.

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Jul 24 2019, 8:11 PM (259 w, 2 d)
Availability
Available
LDAP User
Jclark-ctr
MediaWiki User
Jclark-ctr [ Global Accounts ]

Recent Activity

Thu, Jul 11

Jclark-ctr closed T369042: Netbox Reporting Triage - Week of 2024-7-12 as Resolved.
Thu, Jul 11, 7:53 PM · DC-Ops
Jclark-ctr updated the task description for T369042: Netbox Reporting Triage - Week of 2024-7-12.
Thu, Jul 11, 7:09 PM · DC-Ops

Wed, Jul 3

Jclark-ctr updated subscribers of T363399: Q4:rack/setup/install parsoidtest1001.

@Papaul if you get a chance can you look at this one?

Wed, Jul 3, 10:38 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr claimed T363399: Q4:rack/setup/install parsoidtest1001.
Wed, Jul 3, 1:24 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr updated the task description for T363399: Q4:rack/setup/install parsoidtest1001.
Wed, Jul 3, 1:23 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops

Tue, Jul 2

Jclark-ctr updated Other Assignee for T369042: Netbox Reporting Triage - Week of 2024-7-12, added: VRiley-WMF.
Tue, Jul 2, 1:12 PM · DC-Ops
Jclark-ctr created T369042: Netbox Reporting Triage - Week of 2024-7-12.
Tue, Jul 2, 1:11 PM · DC-Ops
Jclark-ctr added a comment to T364870: Q4:rack/setup/install new cloudcephmon hosts.

@Andrew @dcaro thank you for providing update did you have host names for this and please update preseed.yaml, and site.pp

Tue, Jul 2, 12:27 AM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
Jclark-ctr added a comment to T364429: Q4:rack/setup/install an-conf100[4-6].

@BTullis if you get a chance to update files. These are ready to be imaged and handed over

Tue, Jul 2, 12:23 AM · Patch-For-Review, SRE, Data-Engineering, ops-eqiad, DC-Ops
Jclark-ctr closed T368866: Degraded RAID on aqs1013 as Resolved.

duplicate of T362033

Tue, Jul 2, 12:21 AM · DC-Ops, SRE, ops-eqiad
Jclark-ctr updated subscribers of T363344: Q4:rack/setup/install cloudcephosd10[35-38].

@VRiley-WMF if you can update with 2nd network connection then hand over to @cmooney

Tue, Jul 2, 12:18 AM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Jclark-ctr updated Other Assignee for T363344: Q4:rack/setup/install cloudcephosd10[35-38], added: VRiley-WMF.
Tue, Jul 2, 12:17 AM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Jclark-ctr claimed T363344: Q4:rack/setup/install cloudcephosd10[35-38].
Tue, Jul 2, 12:16 AM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Jclark-ctr updated the task description for T363344: Q4:rack/setup/install cloudcephosd10[35-38].
Tue, Jul 2, 12:16 AM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops

Mon, Jul 1

Jclark-ctr updated the task description for T363344: Q4:rack/setup/install cloudcephosd10[35-38].
Mon, Jul 1, 11:43 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Jclark-ctr closed T363341: Q4:rack/setup/install cloudcephosd10[39-41] as Resolved.
Mon, Jul 1, 10:49 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Jclark-ctr updated the task description for T363341: Q4:rack/setup/install cloudcephosd10[39-41].
Mon, Jul 1, 10:48 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops

Fri, Jun 28

Jclark-ctr claimed T363341: Q4:rack/setup/install cloudcephosd10[39-41].
Fri, Jun 28, 10:17 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Jclark-ctr updated the task description for T363341: Q4:rack/setup/install cloudcephosd10[39-41].
Fri, Jun 28, 10:16 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Jclark-ctr added a comment to T363341: Q4:rack/setup/install cloudcephosd10[39-41].

cloudcephosd1039
2nd cable serial#20220008 port 1
cloudcephosd1040
2nd cable serial#20220043 port 5
cloudcephosd1041
2nd cable serial#20220011 port 7

Fri, Jun 28, 9:27 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Jclark-ctr closed T365485: Q4:rack/setup/install dbproxy102[89] as Resolved.
Fri, Jun 28, 8:21 PM · SRE, Data-Persistence, ops-eqiad, DC-Ops
Jclark-ctr updated the task description for T365485: Q4:rack/setup/install dbproxy102[89].
Fri, Jun 28, 8:20 PM · SRE, Data-Persistence, ops-eqiad, DC-Ops
Jclark-ctr claimed T365485: Q4:rack/setup/install dbproxy102[89].
Fri, Jun 28, 7:51 PM · SRE, Data-Persistence, ops-eqiad, DC-Ops
Jclark-ctr updated the task description for T365485: Q4:rack/setup/install dbproxy102[89].
Fri, Jun 28, 7:48 PM · SRE, Data-Persistence, ops-eqiad, DC-Ops
Jclark-ctr assigned T364429: Q4:rack/setup/install an-conf100[4-6] to BTullis.
Fri, Jun 28, 7:02 PM · Patch-For-Review, SRE, Data-Engineering, ops-eqiad, DC-Ops
Jclark-ctr assigned T368766: ManagementSSHDown to VRiley-WMF.

Did mgmt ip address get update for any maintenance you preformed?

Fri, Jun 28, 6:58 PM · SRE, DC-Ops, ops-eqiad
Jclark-ctr closed T368767: PowerSupplyFailure as Resolved.

Reseated psu

Fri, Jun 28, 6:56 PM · SRE, ops-eqiad, DC-Ops
Jclark-ctr closed T368564: Degraded RAID on aqs1013 as Resolved.

duplicate T362033

Fri, Jun 28, 6:54 PM · Data-Engineering, DC-Ops, SRE, ops-eqiad
Jclark-ctr closed T368099: Degraded RAID on aqs1013 as Resolved.

duplicate T362033

Fri, Jun 28, 6:54 PM · DC-Ops, SRE, ops-eqiad
Jclark-ctr closed T365165: Q4:rack/setup/install krb1002 as Resolved.
Fri, Jun 28, 2:24 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops
Jclark-ctr updated the task description for T365165: Q4:rack/setup/install krb1002.
Fri, Jun 28, 2:23 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops
Jclark-ctr updated the task description for T364429: Q4:rack/setup/install an-conf100[4-6].
Fri, Jun 28, 12:48 PM · Patch-For-Review, SRE, Data-Engineering, ops-eqiad, DC-Ops

Thu, Jun 27

Jclark-ctr added a comment to T364429: Q4:rack/setup/install an-conf100[4-6].

@BTullis can you update preseed.yam and site.pp file for these servers

Thu, Jun 27, 8:17 PM · Patch-For-Review, SRE, Data-Engineering, ops-eqiad, DC-Ops
Jclark-ctr moved T368099: Degraded RAID on aqs1013 from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Thu, Jun 27, 2:00 PM · DC-Ops, SRE, ops-eqiad
Jclark-ctr updated the task description for T364416: Q4:rack/setup/install deploy1003.
Thu, Jun 27, 12:42 PM · SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr added a comment to T364416: Q4:rack/setup/install deploy1003.

@akosiaris please update Site.pp file for this server

Thu, Jun 27, 12:11 PM · SRE, serviceops, ops-eqiad, DC-Ops

Tue, Jun 25

Jclark-ctr updated subscribers of T365165: Q4:rack/setup/install krb1002.

@MoritzMuehlenhoff would you be able to update site.pp file for this server?

Tue, Jun 25, 1:42 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops
Jclark-ctr updated the task description for T365165: Q4:rack/setup/install krb1002.
Tue, Jun 25, 1:40 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops

Thu, Jun 20

Jclark-ctr closed T367004: Inbound interface errors as Resolved.

No faults since jun 9th

Thu, Jun 20, 7:10 PM · SRE, DC-Ops, ops-eqiad
Jclark-ctr closed T362841: Degraded RAID on aqs1014 as Resolved.

Have not seen any errors return on this closing this ticket

Thu, Jun 20, 1:16 PM · DC-Ops, Cassandra, SRE, ops-eqiad
Jclark-ctr closed T367789: Relabel eqiad kubernetes nodes as Resolved.
Thu, Jun 20, 1:14 PM · SRE, ops-eqiad, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Jclark-ctr closed T367789: Relabel eqiad kubernetes nodes, a subtask of T351074: Move servers from the appserver/api cluster to kubernetes, as Resolved.
Thu, Jun 20, 1:13 PM · serviceops, MW-on-K8s
Jclark-ctr closed T367678: Degraded RAID on aqs1013 as Resolved.

duplicate T362033

Thu, Jun 20, 12:58 PM · SRE, ops-eqiad, DC-Ops
Jclark-ctr closed T367209: Degraded RAID on aqs1013 as Resolved.

duplicate T362033

Thu, Jun 20, 12:58 PM · DC-Ops, SRE, ops-eqiad
Jclark-ctr closed T367766: hw troubleshooting: firmware upgrade for mw1359.eqiad.wmnet, mw1364.eqiad.wmnet, mw1365.eqiad.wmnet, mw1412.eqiad.wmnet as Resolved.
Thu, Jun 20, 12:57 PM · SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr added a comment to T367766: hw troubleshooting: firmware upgrade for mw1359.eqiad.wmnet, mw1364.eqiad.wmnet, mw1365.eqiad.wmnet, mw1412.eqiad.wmnet.

Updated Idrac for all servers listed to iDRAC Firmware Version 7.00.00.171

Thu, Jun 20, 12:57 PM · SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr closed T367854: db1165 network flapping issues as Resolved.

Replaced cable

Thu, Jun 20, 12:37 PM · SRE, ops-eqiad, DC-Ops, DBA
Jclark-ctr updated the task description for T364416: Q4:rack/setup/install deploy1003.
Thu, Jun 20, 12:28 PM · SRE, serviceops, ops-eqiad, DC-Ops

Tue, Jun 18

Jclark-ctr added a comment to T367766: hw troubleshooting: firmware upgrade for mw1359.eqiad.wmnet, mw1364.eqiad.wmnet, mw1365.eqiad.wmnet, mw1412.eqiad.wmnet.

@Clement_Goubert did you need just idrac updated we can do that easily. Bios requires reboot

Tue, Jun 18, 12:29 AM · SRE, serviceops, ops-eqiad, DC-Ops

Mon, Jun 17

Jclark-ctr closed T367499: hw troubleshooting: server fails to reboot for clouddb1018.eqiad.wmnet as Resolved.

@Marostegui Updated idrac and bios firmware

Mon, Jun 17, 2:21 PM · cloud-services-team (Hardware), SRE, ops-eqiad, DC-Ops
Jclark-ctr closed T367075: Degraded RAID on ganeti1019 as Resolved.

resolved with T367071

Mon, Jun 17, 1:20 PM · DC-Ops, SRE, ops-eqiad
Jclark-ctr closed T367071: ganeti1019 is down as Resolved.
Mon, Jun 17, 1:18 PM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti

Jun 11 2024

Jclark-ctr added a comment to T367071: ganeti1019 is down.

@MoritzMuehlenhoff after replacing failed drive looked like it might boot but still fails. Might need to be reimaged I do not have root access so unable to proceed passed this

Jun 11 2024, 9:26 PM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti
Jclark-ctr added a comment to T367071: ganeti1019 is down.

failed drive was replaced also

Jun 11 2024, 9:09 PM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti
Jclark-ctr added a comment to T367071: ganeti1019 is down.

@MoritzMuehlenhoff Replaced Dimm.

Jun 11 2024, 9:09 PM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti
Jclark-ctr added a comment to T367071: ganeti1019 is down.

@MoritzMuehlenhoff Can i take server down to replace dimm?

Jun 11 2024, 5:57 PM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti
Jclark-ctr added a comment to T367071: ganeti1019 is down.

DIMM B1

BankLabel:
B
CacheSize:
Information Not Available
CurrentOperatingSpeed:
2400 MHz
DeviceDescription:
DIMM B1
DeviceType:
Memory
FQDD:
DIMM.Socket.B1
InstanceID:
DIMM.Socket.B1
LastSystemInventoryTime:
2024-06-10T19:54:17
LastUpdateTime:
2022-02-02T00:40:40
ManufactureDate:
Mon Sep 10 07:00:00 2018 UTC
Manufacturer:
Micron Technology
MemoryTechnology:
DRAM
MemoryType:
DDR-4
Model:
DDR4 DIMM
NonVolatileSize:
Information Not Available
PartNumber:
36ASF4G72PZ-2G6E1
PrimaryStatus:
Ok
Rank:
Double Rank
RemainingRatedWriteEndurance:
Information Not Available
SerialNumber:
1E5C734E
Size:
32768 MB
Speed:
2666 MHz
SystemEraseCapability:
Not Supported
VolatileSize:
32768 MB

Jun 11 2024, 5:05 PM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti
Jclark-ctr added a comment to T367071: ganeti1019 is down.

The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.

Jun 11 2024, 5:04 PM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti
Jclark-ctr added a comment to T367071: ganeti1019 is down.

This server is out of warranty Will check decom servers to see if we have any suitable dimms

Jun 11 2024, 5:03 PM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti

Jun 6 2024

Jclark-ctr added a comment to T366102: Patch circiut CRT-008647.

Installed cross connect link came up on port. cableid #5229

Jun 6 2024, 10:30 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops, netops

Jun 5 2024

Jclark-ctr added a comment to T363119: db1246 crashed.

replaced broken cable server went 2 weeks with out fault returning

Jun 5 2024, 1:01 PM · DC-Ops, SRE, ops-eqiad, DBA

Jun 4 2024

Jclark-ctr closed T366583: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet as Resolved.

manually updated firmware
iDRAC Firmware Version 7.00.00.171
BIOS Version 2.21.1

Jun 4 2024, 8:48 PM · SRE, Infrastructure-Foundations, serviceops, ops-eqiad, DC-Ops
Jclark-ctr closed T366583: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet, a subtask of T365571: Rename wikikube worker nodes during OS reimage, as Resolved.
Jun 4 2024, 8:47 PM · Kubernetes, Prod-Kubernetes, serviceops

Jun 3 2024

Jclark-ctr closed T364060: Degraded RAID on cloudcephosd1031 as Resolved.
Jun 3 2024, 4:05 PM · Cloud-VPS, Cloud-Services-Origin-Alert, cloud-services-team, DC-Ops, SRE, ops-eqiad

May 30 2024

Jclark-ctr added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

@akosiaris kafka-main1010 has imaged but is still failing cookbook for me would you be able to try that one for me?

May 30 2024, 7:29 PM · SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr updated the task description for T363344: Q4:rack/setup/install cloudcephosd10[35-38].
May 30 2024, 1:48 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Jclark-ctr updated the task description for T363341: Q4:rack/setup/install cloudcephosd10[39-41].
May 30 2024, 1:47 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops

May 29 2024

Jclark-ctr updated the task description for T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.
May 29 2024, 10:18 PM · SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr added a comment to T364060: Degraded RAID on cloudcephosd1031.

@dcaro the drive was listed as ready in idrac Converted to non-raid should be visible now

May 29 2024, 2:04 PM · Cloud-VPS, Cloud-Services-Origin-Alert, cloud-services-team, DC-Ops, SRE, ops-eqiad

May 28 2024

Jclark-ctr closed T364060: Degraded RAID on cloudcephosd1031 as Resolved.

Replaced Failed Drive

May 28 2024, 6:39 PM · Cloud-VPS, Cloud-Services-Origin-Alert, cloud-services-team, DC-Ops, SRE, ops-eqiad
Jclark-ctr added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

@akosiaris still failing for same issue for kafka-main1010

May 28 2024, 6:34 PM · SRE, serviceops, ops-eqiad, DC-Ops

May 24 2024

Jclark-ctr added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

I was able to correct kafka-main1010 issue for dhcp but image fails still

Screenshot 2024-05-24 at 8.50.08 AM.png (906×1 px, 247 KB)
@akosiaris did you have this issue with other servers?

May 24 2024, 12:56 PM · SRE, serviceops, ops-eqiad, DC-Ops

May 23 2024

Jclark-ctr closed T365711: Relabel eqiad Kubernetes hosts as Resolved.

relabled servers

May 23 2024, 4:17 PM · SRE, serviceops, DC-Ops, ops-eqiad
Jclark-ctr claimed T364060: Degraded RAID on cloudcephosd1031.

You have successfully submitted request SR191070960.
Ordered replacement drive. will update when arrives

May 23 2024, 4:13 PM · Cloud-VPS, Cloud-Services-Origin-Alert, cloud-services-team, DC-Ops, SRE, ops-eqiad
Jclark-ctr updated subscribers of T364984: cloudvirt1041: can't boot after reimage.

@aborrero I am stuck right now i did attempt to reimage with no luck. Unsure what version of grub we have installed but looks like the same as this bug. @Papaul do you have any insight on this? https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987008

May 23 2024, 4:00 PM · DC-Ops, ops-eqiad, User-aborrero, cloud-services-team, SRE
Jclark-ctr added a comment to T363119: db1246 crashed.

The replacement cable did just arrive yesterday. After multiple back and forth with dell Can we leave this open for 1 more week make sure error will not return. leave server running and I will reach out to you for downtime next week for replacement.

May 23 2024, 2:18 PM · DC-Ops, SRE, ops-eqiad, DBA
Jclark-ctr claimed T365711: Relabel eqiad Kubernetes hosts .
May 23 2024, 2:13 PM · SRE, serviceops, DC-Ops, ops-eqiad
Jclark-ctr added a comment to T365346: Degraded RAID on db1172.

Replaced drive

May 23 2024, 2:12 PM · DC-Ops, DBA, SRE, ops-eqiad
Jclark-ctr claimed T365346: Degraded RAID on db1172.
May 23 2024, 2:05 PM · DC-Ops, DBA, SRE, ops-eqiad
Jclark-ctr added a comment to T365346: Degraded RAID on db1172.

@Marostegui I do have a spare disk can I swap it at anytime

May 23 2024, 2:05 PM · DC-Ops, DBA, SRE, ops-eqiad

May 22 2024

Jclark-ctr updated subscribers of T364870: Q4:rack/setup/install new cloudcephmon hosts.

@Andrew @dcaro once we find out racking information we will be able to rack and image these fairly quickly these have arrived

May 22 2024, 1:53 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops

May 21 2024

Jclark-ctr claimed T365289: partial power outage for lsw1-e5-eqiad.
May 21 2024, 3:01 PM · DC-Ops, SRE, netops, Infrastructure-Foundations, ops-eqiad

May 15 2024

andrea.denisse awarded T360356: Request access to servers Dcops group a Like token.
May 15 2024, 7:07 PM · User-Elukey, SRE, Infrastructure-Foundations

May 14 2024

Jclark-ctr updated subscribers of T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

@akosiaris could you please update preseed.yaml file? I did take care of site.pp file for codfw and eqiad

May 14 2024, 10:25 PM · SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

kafka-main1010
Rack: E 5
U 26
Cableid : 2013339101771
Port : 6

May 14 2024, 2:42 PM · SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr added a comment to T360356: Request access to servers Dcops group.

@Volans i also see this as a learning opportunity most of these are just logs. Some dcops members are very light on linux and we could be expanding knowledge and could be come more valuable members of the team. Although I do love cookbooks but sometimes they fail and would be nice if we could continue to teach and train coworkers

May 14 2024, 2:17 PM · User-Elukey, SRE, Infrastructure-Foundations
Jclark-ctr updated the task description for T360356: Request access to servers Dcops group.
May 14 2024, 2:17 PM · User-Elukey, SRE, Infrastructure-Foundations

May 13 2024

Jclark-ctr added a comment to T360356: Request access to servers Dcops group.

@Volans The main purpose is for gathering debug information I would prefer to grep mesg /log files instead of searching throughout entire output. Mdadm commands would allow us to one day rebuild failed software raids

May 13 2024, 6:44 PM · User-Elukey, SRE, Infrastructure-Foundations

May 9 2024

Jclark-ctr added a comment to T363119: db1246 crashed.

@Marostegui you can put server back in rotation even though i uploaded multiple photos yesterday to Dell. They replied this morning requesting part number to send correct part

Screenshot 2024-05-09 at 10.31.55 AM.png (988×1 px, 2 MB)
I attached the photo that was sent to dell. I do not expect it to arrive till next week. We can work on this at a later date.

May 9 2024, 2:37 PM · DC-Ops, SRE, ops-eqiad, DBA

May 8 2024

Jclark-ctr added a comment to T363119: db1246 crashed.

I believe we are good to reimage server OS looks corrupt. if you could just wait till tomorrow to put back in production while i wait for Dell to respond if they will send out new cable.

May 8 2024, 3:02 PM · DC-Ops, SRE, ops-eqiad, DBA
Jclark-ctr added a comment to T363119: db1246 crashed.

I am powering it up now and will check idrac.

May 8 2024, 2:43 PM · DC-Ops, SRE, ops-eqiad, DBA
Jclark-ctr added a comment to T363119: db1246 crashed.

Replaced Backplane : cable that connects raid card<-> backplane / power control board. I did find a cable with a loose pin on the power control board (not replaced) but will be reaching out to Dell regarding it it has been reseated in connector and should be fine for the time being

May 8 2024, 2:39 PM · DC-Ops, SRE, ops-eqiad, DBA

May 7 2024

Jclark-ctr added a comment to T363119: db1246 crashed.

Friday dell agreed to replace Backplane and cables. shipped out Monday expected arrival Tuesday.

May 7 2024, 8:17 PM · DC-Ops, SRE, ops-eqiad, DBA

May 2 2024

Jclark-ctr added a comment to T363086: ManagementSSHDown parse1002.eqiad.wmnet.

@akosiaris idrac has stayed up for 4 days now possibly me relocating to a different port helped it. We wont know until it is put in use again. this server is out of warranty if it fails again we could look at swapping it with another decom server?

May 2 2024, 11:02 PM · DC-Ops, SRE, ops-eqiad
Jclark-ctr assigned T363660: Degraded RAID on centrallog1002 to Papaul.
May 2 2024, 7:55 PM · DC-Ops, SRE, ops-eqiad
Jclark-ctr added a comment to T363660: Degraded RAID on centrallog1002.

@andrea.denisse We have been having a few issues with software raids we are trying to pinpoint what slot these are in. Idrac is not listing the drives. I will message you for assistance

May 2 2024, 1:57 PM · DC-Ops, SRE, ops-eqiad

Apr 30 2024

Jclark-ctr closed T362871: hw troubleshooting: disk failure for an-worker1087 as Resolved.
Apr 30 2024, 2:34 PM · SRE, ops-eqiad, DC-Ops
Jclark-ctr closed T362871: hw troubleshooting: disk failure for an-worker1087, a subtask of T362860: Apparent disk failure on an-worker1087, as Resolved.
Apr 30 2024, 2:34 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
Jclark-ctr added a comment to T363119: db1246 crashed.

@Marostegui "At the creation of ticket i requested to not repeat any troubleshooting steps the where not effective"
followed up with dell again they should be sending out parts shortly

Apr 30 2024, 2:32 PM · DC-Ops, SRE, ops-eqiad, DBA
Jclark-ctr added a comment to T363086: ManagementSSHDown parse1002.eqiad.wmnet.

Idrac is still up after almost 24 hours. i did move IDRAC port on switch to a different group of ports will monitor it

Apr 30 2024, 2:23 PM · DC-Ops, SRE, ops-eqiad